自己搭的3個OSD節點的集群的健康狀態經常處在”WARN”狀態,replicas設置為3,OSD節點數量大于3,存放的data數量也不多,ceph -s 不是期待的health ok,而是active+undersized+degraded。被這個問題困擾有段時間,因為對Ceph不太了解而一直沒有找到解決方案,直到最近發郵件到社區才得到解決[1]。
PG狀態的含義
PG的非正常狀態說明可以參考[2],undersized與degraded的含義記錄于此:
undersized
The placement group has fewer copies than the configured pool replication level.
degraded
Ceph has not replicated some objects in the placement group the correct number of times yet.
這兩種狀態一般同時出現,大概的意思就是有些PG沒有滿足設定的replicas數量要求,PG中的部分objects亦如此。看下PG的詳細信息:
ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs
stuck undersized; 2 pgs undersized
pg 17.58 is stuck unclean for 61033.947719, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck unclean for 61033.948201, current state
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck undersized for 61033.343824, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck undersized for 61033.327566, current state
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck degraded for 61033.343835, current state
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck degraded for 61033.327576, current state
active+undersized+degraded, last acting [0,2]
pg 17.16 is active+undersized+degraded, acting [0,2]
pg 17.58 is active+undersized+degraded, acting [2,0]
解決辦法
雖然設定的拷貝數量是3,但是PG 17.58與17.58卻只有兩個拷貝,分別存放在OSD 0與OSD 2上。
而究其原因則是我們的OSD所在的磁盤不是同質的,從而每個OSD的weight不同,而Ceph對異質OSD的支持不是很好。從而導致部分PG無法滿足我們設定的備份數量限制。
OSD狀態樹:
ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.89049 root default
-2 1.81360 host ceph3
2 1.81360 osd.2 up 1.00000 1.00000
-3 0.44969 host ceph4
3 0.44969 osd.3 up 1.00000 1.00000
-4 3.62720 host ceph1
0 1.81360 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
解決辦法是另外構建一個OSD,使其容量大小和其它節點相同,是否可以有偏差?猜測應該有一個可以接受的偏差范圍,重構后的OSD節點樹看起來像這樣:
$ ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.25439 root default
-2 1.81360 host ceph3
2 1.81360 osd.2 up 1.00000 1.00000
-3 0 host ceph4
-4 3.62720 host ceph1
0 1.81360 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
-5 1.81360 host ceph2
3 1.81360 osd.3 up 1.00000 1.00000
ceph4節點被刪除,重新加入了另一個OSD節點ceph2。
$ ceph -s
cluster 20ab1119-a072-4bdf-9402-9d0ce8c256f4
health HEALTH_OK
monmap e2: 2 mons at {ceph2=192.168.17.21:6789/0,ceph4=192.168.17.23:6789/0}
election epoch 26, quorum 0,1 ceph2,ceph4
osdmap e599: 4 osds: 4 up, 4 in
flags sortbitwise,require_jewel_osds
pgmap v155011: 100 pgs, 1 pools, 18628 bytes data, 1 objects
1129 MB used, 7427 GB / 7428 GB avail
100 active+clean
另外,為了滿足HA的要求,OSD需要分散在不同的節點上,這里拷貝數量為3,則需要有三個OSD節點來承載這些OSD,如果三個OSD分布在兩個OSD節點上,則依然可能會出現”active+undersized+degraded”的狀態。
官方是這樣說的:
This, combined with the default CRUSH failure domain, ensures that replicas or erasure code shards are separated across hosts and a single host failure will not affect availability.
理解如有錯誤還望能點醒。
[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47070.html
[2] http://docs.ceph.com/docs/master/rados/operations/pg-states/
轉: https://blog.csdn.net/chenwei8280/article/details/80785595