Nutanix Cluster ZooKeeper Error

這幾天小編公司的Nutanix一直發出Cluster告警，這可不是開玩笑的Cluster如果不能正常運作，那可就無法確保資料的安全，跟原廠報修完畢很快就配發到Support的客服，但這次的經驗卻讓小編有點擔憂，就紀錄一下如何檢測ZooKeeper服務是否正常吧。

系統上一直有顯示Cluster是因為ZooKeeper服務出錯，所以問題很鮮明就是ZooKeeper~

1.進行Nutanix Cluster Check(除稱NCC)
指令： ncc health_checks system_checks zkinfo_check_plugin

2.檢查CVM上的ZooKeeper系統紀錄狀況
(1).透過SSH軟體登入node的CVM
(2).檢查「/etc/hosts」檔案
指令：cat /etc/hosts

備註：小編的環境是3個node

(3).檢查ZooKeeper與主機IP資訊是否跟「/etc/hosts」內容吻合
指令：
zeus_config_printer dev 2>null | grep -B20 myid | egrep -i “myid|external_ip”
這一邊要注意一下ID編號與IP的對應是否與上個步驟的「/etc/hosts」內容吻合。

3.檢查Cluster所有的ZooKeeper服務是否正常
指令：for i in $(sed -ne “s/#.*//; s/zk. //p” /etc/hosts) ; do echo -n “$i: ZK ” ; ssh $i “source /etc/profile ; zkServer.sh status” 2>&1 | grep -viE “nut|config|fips|jmx” ; done

備註：正常畫面如上，不能出現「ZK Error contacting service. It is probably not running」相關訊息。

題外話：
Nutanix系統建議定期還是跑一下「ncc health_checks run_all」，查看系統是否有其他狀況。