1. PG介紹

繼上次分享的《Ceph介紹及原理架構(gòu)分享》，這次主要來分享Ceph中的PG各種狀態(tài)詳解，PG是最復雜和難于理解的概念之一，PG的復雜如下：

在架構(gòu)層次上，PG位于RADOS層的中間。
a. 往上負責接收和處理來自客戶端的請求。
b. 往下負責將這些數(shù)據(jù)請求翻譯為能夠被本地對象存儲所能理解的事務。
是組成存儲池的基本單位，存儲池中的很多特性，都是直接依托于PG實現(xiàn)的。
面向容災域的備份策略使得一般而言的PG需要執(zhí)行跨節(jié)點的分布式寫，因此數(shù)據(jù)在不同節(jié)點之間的同步、恢復時的數(shù)據(jù)修復也都是依賴PG完成。

2. PG狀態(tài)表

正常的PG狀態(tài)是 100%的active + clean，這表示所有的PG是可訪問的，所有副本都對全部PG都可用。
如果Ceph也報告PG的其他的警告或者錯誤狀態(tài)。PG狀態(tài)表：

狀態(tài)	描述
Activating	Peering已經(jīng)完成，PG正在等待所有PG實例同步并固化Peering的結(jié)果(Info、Log等)
Active	活躍態(tài)。PG可以正常處理來自客戶端的讀寫請求
Backfilling	正在后臺填充態(tài)。 backfill是recovery的一種特殊場景，指peering完成后，如果基于當前權(quán)威日志無法對Up Set當中的某些PG實例實施增量同步(例如承載這些PG實例的OSD離線太久，或者是新的OSD加入集群導致的PG實例整體遷移) 則通過完全拷貝當前Primary所有對象的方式進行全量同步
Backfill-toofull	某個需要被Backfill的PG實例，其所在的OSD可用空間不足，Backfill流程當前被掛起
Backfill-wait	等待Backfill 資源預留
Clean	干凈態(tài)。PG當前不存在待修復的對象， Acting Set和Up Set內(nèi)容一致，并且大小等于存儲池的副本數(shù)
Creating	PG正在被創(chuàng)建
Deep	PG正在或者即將進行對象一致性掃描清洗
Degraded	降級狀態(tài)。Peering完成后，PG檢測到任意一個PG實例存在不一致(需要被同步/修復)的對象，或者當前ActingSet 小于存儲池副本數(shù)
Down	Peering過程中，PG檢測到某個不能被跳過的Interval中(例如該Interval期間，PG完成了Peering，并且成功切換至Active狀態(tài)，從而有可能正常處理了來自客戶端的讀寫請求),當前剩余在線的OSD不足以完成數(shù)據(jù)修復
Incomplete	Peering過程中，由于 a. 無非選出權(quán)威日志 b. 通過choose_acting選出的Acting Set后續(xù)不足以完成數(shù)據(jù)修復，導致Peering無非正常完成
Inconsistent	不一致態(tài)。集群清理和深度清理后檢測到PG中的對象在副本存在不一致，例如對象的文件大小不一致或Recovery結(jié)束后一個對象的副本丟失
Peered	Peering已經(jīng)完成，但是PG當前ActingSet規(guī)模小于存儲池規(guī)定的最小副本數(shù)(min_size)
Peering	正在同步態(tài)。PG正在執(zhí)行同步處理
Recovering	正在恢復態(tài)。集群正在執(zhí)行遷移或同步對象和他們的副本
Recovering-wait	等待Recovery資源預留
Remapped	重新映射態(tài)。PG活動集任何的一個改變，數(shù)據(jù)發(fā)生從老活動集到新活動集的遷移。在遷移期間還是用老的活動集中的主OSD處理客戶端請求，一旦遷移完成新活動集中的主OSD開始處理
Repair	PG在執(zhí)行Scrub過程中，如果發(fā)現(xiàn)存在不一致的對象，并且能夠修復，則自動進行修復狀態(tài)
Scrubbing	PG正在或者即將進行對象一致性掃描
Unactive	非活躍態(tài)。PG不能處理讀寫請求
Unclean	非干凈態(tài)。PG不能從上一個失敗中恢復
Stale	未刷新態(tài)。PG狀態(tài)沒有被任何OSD更新，這說明所有存儲這個PG的OSD可能掛掉, 或者Mon沒有檢測到Primary統(tǒng)計信息(網(wǎng)絡抖動)
Undersized	PG當前Acting Set小于存儲池副本數(shù)

3. 狀態(tài)詳解及故障模擬復現(xiàn)

3.1 Degraded

3.1.1 說明

降級：由上文可以得知，每個PG有三個副本，分別保存在不同的OSD中，在非故障情況下，這個PG是active+clean 狀態(tài)，那么，如果PG 的副本osd.4 掛掉了，這個 PG 是降級狀態(tài)。

3.1.2 故障模擬

a. 停止osd.1

$ systemctl stop ceph-osd@1

b. 查看PG狀態(tài)

$ bin/ceph pg stat
20 pgs: 20 active+undersized+degraded; 14512 kB data, 302 GB used, 6388 GB / 6691 GB avail; 12/36 objects degraded (33.333%)

c. 查看集群監(jiān)控狀態(tài)

$ bin/ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
    osd.1 (root=default,host=ceph-xx-cc00) is down
PG_DEGRADED Degraded data redundancy: 12/36 objects degraded (33.333%), 20 pgs unclean, 20 pgs degraded
    pg 1.0 is active+undersized+degraded, acting [0,2]
    pg 1.1 is active+undersized+degraded, acting [2,0]

d. 客戶端IO操作

#寫入對象
$ bin/rados -p test_pool put myobject ceph.conf

#讀取對象到文件
$ bin/rados -p test_pool get myobject.old

#查看文件
$ ll ceph.conf*
-rw-r--r-- 1 root root 6211 Jun 25 14:01 ceph.conf
-rw-r--r-- 1 root root 6211 Jul  3 19:57 ceph.conf.old

故障總結(jié)：
為了模擬故障，(size = 3, min_size = 2) 我們手動停止了 osd.1，然后查看PG狀態(tài)，可見，它此刻的狀態(tài)是active+undersized+degraded,當一個 PG 所在的 OSD 掛掉之后，這個 PG 就會進入undersized+degraded 狀態(tài)，而后面的[0,2]的意義就是還有兩個副本存活在 osd.0 和 osd.2 上, 并且這個時候客戶端可以正常讀寫IO。

3.1.3 總結(jié)

降級就是在發(fā)生了一些故障比如OSD掛掉之后，Ceph 將這個 OSD 上的所有 PG 標記為 Degraded。
降級的集群可以正常讀寫數(shù)據(jù)，降級的 PG 只是相當于小毛病而已，并不是嚴重的問題。
Undersized的意思就是當前存活的PG 副本數(shù)為 2，小于副本數(shù)3，將其做此標記，表明存貨副本數(shù)不足，也不是嚴重的問題。

3.2 Peered

3.2.1 說明

Peering已經(jīng)完成，但是PG當前Acting Set規(guī)模小于存儲池規(guī)定的最小副本數(shù)(min_size)。

3.2.2 故障模擬

a. 停掉兩個副本osd.1,osd.0

$ systemctl stop ceph-osd@1
$ systemctl stop ceph-osd@0

b. 查看集群健康狀態(tài)

$ bin/ceph health detail
HEALTH_WARN 1 osds down; Reduced data availability: 4 pgs inactive; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
    osd.0 (root=default,host=ceph-xx-cc00) is down
PG_AVAILABILITY Reduced data availability: 4 pgs inactive
    pg 1.6 is stuck inactive for 516.741081, current state undersized+degraded+peered, last acting [2]
    pg 1.10 is stuck inactive for 516.737888, current state undersized+degraded+peered, last acting [2]
    pg 1.11 is stuck inactive for 516.737408, current state undersized+degraded+peered, last acting [2]
    pg 1.12 is stuck inactive for 516.736955, current state undersized+degraded+peered, last acting [2]
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded
    pg 1.0 is undersized+degraded+peered, acting [2]
    pg 1.1 is undersized+degraded+peered, acting [2]

c. 客戶端IO操作(夯住)

#讀取對象到文件，夯住IO
$ bin/rados -p test_pool get myobject  ceph.conf.old

故障總結(jié)：

現(xiàn)在pg 只剩下osd.2上存活，并且 pg 還多了一個狀態(tài)：peered，英文的意思是仔細看，這里我們可以理解成協(xié)商、搜索。
這時候讀取文件，會發(fā)現(xiàn)指令會卡在那個地方一直不動，為什么就不能讀取內(nèi)容了，因為我們設置的 min_size=2 ，如果存活數(shù)少于2，比如這里的 1 ，那么就不會響應外部的IO請求。

d. 調(diào)整min_size=1可以解決IO夯住問題

#設置min_size = 1
$ bin/ceph osd pool set test_pool min_size 1
set pool 1 min_size to 1

e. 查看集群監(jiān)控狀態(tài)

$ bin/ceph health detail
HEALTH_WARN 1 osds down; Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized; application not enabled on 1 pool(s)
OSD_DOWN 1 osds down
    osd.0 (root=default,host=ceph-xx-cc00) is down
PG_DEGRADED Degraded data redundancy: 26/39 objects degraded (66.667%), 20 pgs unclean, 20 pgs degraded, 20 pgs undersized
    pg 1.0 is stuck undersized for 65.958983, current state active+undersized+degraded, last acting [2]
    pg 1.1 is stuck undersized for 65.960092, current state active+undersized+degraded, last acting [2]
    pg 1.2 is stuck undersized for 65.960974, current state active+undersized+degraded, last acting [2]

f. 客戶端IO操作

#讀取對象到文件中
$ ll -lh ceph.conf*
-rw-r--r-- 1 root root 6.1K Jun 25 14:01 ceph.conf
-rw-r--r-- 1 root root 6.1K Jul  3 20:11 ceph.conf.old
-rw-r--r-- 1 root root 6.1K Jul  3 20:11 ceph.conf.old.1

故障總結(jié)：

可以看到，PG狀態(tài)Peered沒有了，并且客戶端文件IO可以正常讀寫了。
當min_size=1時，只要集群里面有一份副本活著，那就可以響應外部的IO請求。

3.2.3 總結(jié)

Peered狀態(tài)我們這里可以將它理解成它在等待其他副本上線。
當min_size = 2 時，也就是必須保證有兩個副本存活的時候就可以去除Peered這個狀態(tài)。
處于 Peered 狀態(tài)的 PG 是不能響應外部的請求的并且IO被掛起。

3.3 Remapped

3.3.1 說明

Peering完成，PG當前Acting Set與Up Set不一致就會出現(xiàn)Remapped狀態(tài)。

3.3.2 故障模擬

a. 停止osd.x

$ systemctl stop ceph-osd@x

b. 間隔5分鐘，啟動osd.x

$ systemctl start ceph-osd@x

c. 查看PG狀態(tài)

$ ceph pg stat
1416 pgs: 6 active+clean+remapped, 1288 active+clean, 3 stale+active+clean, 119 active+undersized+degraded; 74940 MB data, 250 GB used, 185 TB / 185 TB avail; 1292/48152 objects degraded (2.683%)
$ ceph pg dump | grep remapped
dumped all
13.cd         0                  0        0         0       0         0    2        2      active+clean+remapped 2018-07-03 20:26:14.478665       9453'2   20716:11343    [10,23]         10 [10,23,14]             10       9453'2 2018-07-03 20:26:14.478597          9453'2 2018-07-01 13:11:43.262605
3.1a         44                  0        0         0       0 373293056 1500     1500      active+clean+remapped 2018-07-03 20:25:47.885366  20272'79063  20716:109173     [9,23]          9  [9,23,12]              9  20272'79063 2018-07-03 03:14:23.960537     20272'79063 2018-07-03 03:14:23.960537
5.f           0                  0        0         0       0         0    0        0      active+clean+remapped 2018-07-03 20:25:47.888430          0'0   20716:15530     [23,8]         23  [23,8,22]             23          0'0 2018-07-03 06:44:05.232179             0'0 2018-06-30 22:27:16.778466
3.4a         45                  0        0         0       0 390070272 1500     1500      active+clean+remapped 2018-07-03 20:25:47.886669  20272'78385  20716:108086     [7,23]          7  [7,23,17]              7  20272'78385 2018-07-03 13:49:08.190133      7998'78363 2018-06-28 10:30:38.201993
13.102        0                  0        0         0       0         0    5        5      active+clean+remapped 2018-07-03 20:25:47.884983       9453'5   20716:11334     [1,23]          1  [1,23,14]              1       9453'5 2018-07-02 21:10:42.028288          9453'5 2018-07-02 21:10:42.028288
13.11d        1                  0        0         0       0   4194304 1539     1539      active+clean+remapped 2018-07-03 20:25:47.886535  20343'22439   20716:86294     [4,23]          4  [4,23,15]              4  20343'22439 2018-07-03 17:21:18.567771     20343'22439 2018-07-03 17:21:18.567771#2分鐘之后查詢$ ceph pg stat
1416 pgs: 2 active+undersized+degraded+remapped+backfilling, 10 active+undersized+degraded+remapped+backfill_wait, 1401 active+clean, 3 stale+active+clean; 74940 MB data, 247 GB used, 179 TB / 179 TB avail; 260/48152 objects degraded (0.540%); 49665 kB/s, 9 objects/s recovering$ ceph pg dump | grep remapped
dumped all
13.1e8 2 0 2 0 0 8388608 1527 1527 active+undersized+degraded+remapped+backfill_wait 2018-07-03 20:30:13.999637 9493'38727 20754:165663 [18,33,10] 18 [18,10] 18 9493'38727 2018-07-03 19:53:43.462188 0'0 2018-06-28 20:09:36.303126

d. 客戶端IO操作

#rados讀寫正常
rados -p test_pool put myobject /tmp/test.log

3.3.3 總結(jié)

在 OSD 掛掉或者在擴容的時候PG 上的OSD會按照Crush算法重新分配PG 所屬的osd編號。并且會把 PG Remap到別的OSD上去。
Remapped狀態(tài)時，PG當前Acting Set與Up Set不一致。
客戶端IO可以正常讀寫。

3.4 Recovery

3.4.1 說明

指PG通過PGLog日志針對數(shù)據(jù)不一致的對象進行同步和修復的過程。

3.4.2 故障模擬

a. 停止osd.x

$ systemctl stop ceph-osd@x

b. 間隔1分鐘啟動osd.x

osd$ systemctl start ceph-osd@x

c. 查看集群監(jiān)控狀態(tài)

$ ceph health detail
HEALTH_WARN Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded
PG_DEGRADED Degraded data redundancy: 183/57960 objects degraded (0.316%), 17 pgs unclean, 17 pgs degraded
    pg 1.19 is active+recovery_wait+degraded, acting [29,9,17]

3.4.3 總結(jié)

Recovery是通過記錄的PGLog進行恢復數(shù)據(jù)的。
記錄的PGLog 在osd_max_pg_log_entries=10000條以內(nèi)，這個時候通過PGLog就能增量恢復數(shù)據(jù)。

3.5 Backfill

3.5.1 說明

當PG的副本無非通過PGLog來恢復數(shù)據(jù)，這個時候就需要進行全量同步，通過完全拷貝當前Primary所有對象的方式進行全量同步。

3.5.2 故障模擬

a. 停止osd.x

$ systemctl stop ceph-osd@x

b. 間隔10分鐘啟動osd.x

$ osd systemctl start ceph-osd@x

c. 查看集群健康狀態(tài)

$ ceph health detail
HEALTH_WARN Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded
PG_DEGRADED Degraded data redundancy: 6/57927 objects degraded (0.010%), 1 pg unclean, 1 pg degraded
    pg 3.7f is active+undersized+degraded+remapped+backfilling, acting [21,29]

3.5.3 總結(jié)

無法根據(jù)記錄的PGLog進行恢復數(shù)據(jù)時，就需要執(zhí)行Backfill過程全量恢復數(shù)據(jù)。
如果超過osd_max_pg_log_entries=10000條，這個時候需要全量恢復數(shù)據(jù)。

3.6 Stale

3.6.1 說明

mon檢測到當前PG的Primary所在的osd宕機。
Primary超時未向mon上報pg相關(guān)的信息(例如網(wǎng)絡阻塞)。
PG內(nèi)三個副本都掛掉的情況。

3.6.2 故障模擬

a. 分別停止PG中的三個副本osd, 首先停止osd.23

$ systemctl stop ceph-osd@23

b. 然后停止osd.24

$ systemctl stop ceph-osd@24

c. 查看停止兩個副本PG 1.45的狀態(tài)(undersized+degraded+peered)

$ ceph health detail
HEALTH_WARN 2 osds down; Reduced data availability: 9 pgs inactive; Degraded data redundancy: 3041/47574 objects degraded (6.392%), 149 pgs unclean, 149 pgs degraded, 149 pgs undersized
OSD_DOWN 2 osds down
    osd.23 (root=default,host=ceph-xx-osd02) is down
    osd.24 (root=default,host=ceph-xx-osd03) is down
PG_AVAILABILITY Reduced data availability: 9 pgs inactive
    pg 1.45 is stuck inactive for 281.355588, current state undersized+degraded+peered, last acting [10]

d. 在停止PG 1.45中第三個副本osd.10

$ systemctl stop ceph-osd@10

e. 查看停止三個副本PG 1.45的狀態(tài)(stale+undersized+degraded+peered)

$ ceph health detail
HEALTH_WARN 3 osds down; Reduced data availability: 26 pgs inactive, 2 pgs stale; Degraded data redundancy: 4770/47574 objects degraded (10.026%), 222 pgs unclean, 222 pgs degraded, 222 pgs undersized
OSD_DOWN 3 osds down
    osd.10 (root=default,host=ceph-xx-osd01) is down
    osd.23 (root=default,host=ceph-xx-osd02) is down
    osd.24 (root=default,host=ceph-xx-osd03) is down
PG_AVAILABILITY Reduced data availability: 26 pgs inactive, 2 pgs stale
    pg 1.9 is stuck inactive for 171.200290, current state undersized+degraded+peered, last acting [13]
    pg 1.45 is stuck stale for 171.206909, current state stale+undersized+degraded+peered, last acting [10]
    pg 1.89 is stuck inactive for 435.573694, current state undersized+degraded+peered, last acting [32]
    pg 1.119 is stuck inactive for 435.574626, current state undersized+degraded+peered, last acting [28]

f. 客戶端IO操作

#讀寫掛載磁盤IO 夯住
ll /mnt/

故障總結(jié)：
先停止同一個PG內(nèi)兩個副本，狀態(tài)是undersized+degraded+peered。
然后停止同一個PG內(nèi)三個副本，狀態(tài)是stale+undersized+degraded+peered。

3.6.3 總結(jié)

當出現(xiàn)一個PG內(nèi)三個副本都掛掉的情況，就會出現(xiàn)stale狀態(tài)。
此時該PG不能提供客戶端讀寫，IO掛起夯住。
Primary超時未向mon上報pg相關(guān)的信息(例如網(wǎng)絡阻塞),也會出現(xiàn)stale狀態(tài)。

3.7 Inconsistent

3.7.1 說明

PG通過Scrub檢測到某個或者某些對象在PG實例間出現(xiàn)了不一致

3.7.2 故障模擬

a. 刪除PG 3.0中副本osd.34頭文件

$ rm -rf /var/lib/ceph/osd/ceph-34/current/3.0_head/DIR_0/1000000697c.0000122c__head_19785300__3

b. 手動執(zhí)行PG 3.0進行數(shù)據(jù)清洗

$ ceph pg scrub 3.0
instructing pg 3.0 on osd.34 to scrub

c. 檢查集群監(jiān)控狀態(tài)

$ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 3.0 is active+clean+inconsistent, acting [34,23,1]

d. 修復PG 3.0

$ ceph pg repair 3.0
instructing pg 3.0 on osd.34 to repair

#查看集群監(jiān)控狀態(tài)
$ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent, 1 pg repair
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent, 1 pg repair
    pg 3.0 is active+clean+scrubbing+deep+inconsistent+repair, acting [34,23,1]

#集群監(jiān)控狀態(tài)已恢復正常
$ ceph health detail
HEALTH_OK

故障總結(jié)：
當PG內(nèi)部三個副本有數(shù)據(jù)不一致的情況，想要修復不一致的數(shù)據(jù)文件，只需要執(zhí)行ceph pg repair修復指令，ceph就會從其他的副本中將丟失的文件拷貝過來就行修復數(shù)據(jù)。

3.7.3 故障模擬

當osd短暫掛掉的時候，因為集群內(nèi)還存在著兩個副本，是可以正常寫入的，但是 osd.34 內(nèi)的數(shù)據(jù)并沒有得到更新，過了一會osd.34上線了，這個時候osd.34的數(shù)據(jù)是陳舊的，就通過其他的OSD 向 osd.34 進行數(shù)據(jù)的恢復，使其數(shù)據(jù)為最新的，而這個恢復的過程中，PG的狀態(tài)會從inconsistent ->recover -> clean,最終恢復正常。
這是集群故障自愈一種場景。

3.8 Down

3.8.1 說明

Peering過程中，PG檢測到某個不能被跳過的Interval中(例如該Interval期間，PG完成了Peering，并且成功切換至Active狀態(tài)，從而有可能正常處理了來自客戶端的讀寫請求),當前剩余在線的OSD不足以完成數(shù)據(jù)修復.

3.8.2 故障模擬

a. 查看PG 3.7f內(nèi)副本數(shù)

$ ceph pg dump | grep ^3.7f
dumped all
3.7f         43                  0        0         0       0 494927872 1569     1569               active+clean 2018-07-05 02:52:51.512598  21315'80115  21356:111666  [5,21,29]          5  [5,21,29]              5  21315'80115 2018-07-05 02:52:51.512568      6206'80083 2018-06-29 22:51:05.831219

b. 停止PG 3.7f 副本osd.21

$ systemctl stop ceph-osd@21

c. 查看PG 3.7f狀態(tài)

$ ceph pg dump | grep ^3.7f
dumped all
3.7f         66                  0       89         0       0 591396864 1615     1615 active+undersized+degraded 2018-07-05 15:29:15.741318  21361'80161  21365:128307     [5,29]          5     [5,29]              5  21315'80115 2018-07-05 02:52:51.512568      6206'80083 2018-06-29 22:51:05.831219

d. 客戶端寫入數(shù)據(jù)，一定要確保數(shù)據(jù)寫入到PG 3.7f的副本中[5,29]

$ fio -filename=/mnt/xxxsssss -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=4M -size=2G -numjobs=30 -runtime=200 -group_reporting -name=read-libaio
read-libaio: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1
...
fio-2.2.8
Starting 30 threads
read-libaio: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 5 (f=5): [_(5),R(1),_(5),R(1),_(3),R(1),_(2),R(1),_(1),R(1),_(9)] [96.5% done] [1052MB/0KB/0KB /s] [263/0/0 iops] [eta 00m:02s]                                                            s]
read-libaio: (groupid=0, jobs=30): err= 0: pid=32966: Thu Jul  5 15:35:16 2018
  read : io=61440MB, bw=1112.2MB/s, iops=278, runt= 55203msec
    slat (msec): min=18, max=418, avg=103.77, stdev=46.19
    clat (usec): min=0, max=33, avg= 2.51, stdev= 1.45
     lat (msec): min=18, max=418, avg=103.77, stdev=46.19
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    1], 20.00th=[    2],
     | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    2],
     | 70.00th=[    3], 80.00th=[    3], 90.00th=[    4], 95.00th=[    5],
     | 99.00th=[    7], 99.50th=[    8], 99.90th=[   10], 99.95th=[   14],
     | 99.99th=[   32]
    bw (KB  /s): min=15058, max=185448, per=3.48%, avg=39647.57, stdev=12643.04
    lat (usec) : 2=19.59%, 4=64.52%, 10=15.78%, 20=0.08%, 50=0.03%
  cpu          : usr=0.01%, sys=0.37%, ctx=491792, majf=0, minf=15492
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=15360/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
   READ: io=61440MB, aggrb=1112.2MB/s, minb=1112.2MB/s, maxb=1112.2MB/s, mint=55203msec, maxt=55203msec

e. 停止PG 3.7f中副本osd.29,并且查看PG 3.7f狀態(tài)(undersized+degraded+peered)

#停止該PG副本osd.29
systemctl stop ceph-osd@29
 
#查看該PG 3.7f狀態(tài)為undersized+degraded+peered
ceph pg dump | grep ^3.7f
dumped all
3.7f         70                  0      140         0       0 608174080 1623     1623 undersized+degraded+peered 2018-07-05 15:35:51.629636  21365'80169  21367:132165        [5]          5        [5]              5  21315'80115 2018-07-05 02:52:51.512568      6206'80083 2018-06-29 22:51:05.831219

f. 停止PG 3.7f中副本osd.5,并且查看PG 3.7f狀態(tài)(undersized+degraded+peered)

#停止該PG副本osd.5
$ systemctl stop ceph-osd@5
 
#查看該PG狀態(tài)undersized+degraded+peered
$ ceph pg dump | grep ^3.7f
dumped all
3.7f         70                  0      140         0       0 608174080 1623     1623 stale+undersized+degraded+peered 2018-07-05 15:35:51.629636  21365'80169  21367:132165        [5]          5        [5]              5  21315'80115 2018-07-05 02:52:51.512568      6206'80083 2018-06-29 22:51:05.831219

g. 拉起PG 3.7f中副本osd.21(此時的osd.21數(shù)據(jù)比較陳舊), 查看PG狀態(tài)(down)

#拉起該PG的osd.21
$ systemctl start ceph-osd@21
 
#查看該PG的狀態(tài)down
$ ceph pg dump | grep ^3.7f
dumped all
3.7f         66                  0        0         0       0 591396864 1548     1548                          down 2018-07-05 15:36:38.365500  21361'80161  21370:111729       [21]         21       [21]             21  21315'80115 2018-07-05 02:52:51.512568      6206'80083 2018-06-29 22:51:05.831219

h. 客戶端IO操作

#此時客戶端IO都會夯住
ll /mnt/

故障總結(jié)：
首先有一個PG 3.7f有三個副本[5,21,29]，當停掉一個osd.21之后，寫入數(shù)據(jù)到osd.5, osd.29。這個時候停掉osd.29, osd.5 ，最后拉起osd.21。這個時候osd.21的數(shù)據(jù)比較舊，就會出現(xiàn)PG為down的情況，這個時候客戶端IO會夯住，只能拉起掛掉的osd才能修復問題。

3.8.3 PG為Down的OSD丟失或無法拉起

修復方式(生產(chǎn)環(huán)境已驗證)

      a. 刪除無法拉起的OSD
      b. 創(chuàng)建對應編號的OSD
      c. PG的Down狀態(tài)就會消失
      d. 對于unfound 的PG ，可以選擇delete或者revert 
         ceph pg {pg-id} mark_unfound_lost revert|delete

3.8.4 結(jié)論

典型的場景：A(主)、B、C

      a. 首先kill B 
      b. 新寫入數(shù)據(jù)到 A、C 
      c. kill A和C
      d. 拉起B(yǎng)

出現(xiàn)PG為Down的場景是由于osd節(jié)點數(shù)據(jù)太舊，并且其他在線的osd不足以完成數(shù)據(jù)修復。
這個時候該PG不能提供客戶端IO讀寫， IO會掛起夯住。

3.9 Incomplete

Peering過程中，由于 a. 無非選出權(quán)威日志 b. 通過choose_acting選出的Acting Set后續(xù)不足以完成數(shù)據(jù)修復，導致Peering無非正常完成。
常見于ceph集群在peering狀態(tài)下，來回重啟服務器，或者掉電。

3.9.1 總結(jié)

修復方式 wanted: command to clear 'incomplete' PGs
比如：pg 1.1是incomplete，先對比pg 1.1的主副本之間 pg里面的對象數(shù) 哪個對象數(shù)多就把哪個pg export出來
然后import到對象數(shù)少的pg里面然后再mark complete，一定要先export pg備份。

簡單方式，數(shù)據(jù)可能又丟的情況

   a. stop the osd that is primary for the incomplete PG;
   b. run: ceph-objectstore-tool --data-path ... --journal-path ... --pgid $PGID --op mark-complete
   c. start the osd.

保證數(shù)據(jù)完整性

#1. 查看pg 1.1主副本里面的對象數(shù)，假設主本對象數(shù)多，則到主本所在osd節(jié)點執(zhí)行
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ --journal-path /var/lib/ceph/osd/ceph-0/journal --pgid 1.1 --op export --file /home/pg1.1

#2. 然后將/home/pg1.1 scp到副本所在節(jié)點（有多個副本，每個副本都要這么操作），然后到副本所在節(jié)點執(zhí)行
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op import --file /home/pg1.1

#3. 然后再makr complete
$ ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-1/ --journal-path /var/lib/ceph/osd/ceph-1/journal --pgid 1.1 --op mark-complete

#4. 最后啟動osd
$ start osd

驗證方案

#1. 把狀態(tài)incomplete的pg，標記為complete。建議操作前，先在測試環(huán)境驗證，并熟悉ceph-objectstore-tool工具的使用。
PS：使用ceph-objectstore-tool之前需要停止當前操作的osd，否則會報錯。

#2. 查詢pg 7.123的詳細信息，在線使用查詢
ceph pg 7.123 query > /export/pg-7.123-query.txt

#3. 每個osd副本節(jié)點進行查詢
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op info > /export/pg-7.123-info-osd641.txt
如
pg 7.123 OSD 1 存在1,2,3,4,5 object
pg 7.123 OSD 2 存在1,2,3,6   object
pg 7.123 OSD 2 存在1,2,3,7   object

#4. 查詢對比數(shù)據(jù)
#4.1 導出pg的object清單
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op list > /export/pg-7.123-object-list-osd-641.txt

#4.2 查詢pg的object數(shù)量
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op list|wc -l

#4.3 對比所有副本的object是否一致。
diff -u /export/pg-7.123-object-list-osd-1.txt /export/pg-7.123-object-list-osd-2.txt
比如：pg 7.123是incomplete，對比7.123的所有副本之間pg里面的object數(shù)量。
 - 如上述情況，diff對比后，每個副本（主從所有副本）的object list是否一致。避免有數(shù)據(jù)不一致。使用數(shù)量最多，并且diff對比后，數(shù)量最多的包含所有object的備份。
 - 如上述情況，diff對比后，數(shù)量是不一致，最多的不包含所有的object，則需要考慮不覆蓋導入，再導出。最終使用完整的所有的object進行導入。注：import是需要提前remove pg后進行導入，等于覆蓋導入。
 - 如上述情況，diff對比后，數(shù)據(jù)是一致，則使用object數(shù)量最多的備份，然后import到object數(shù)量少的pg里面 然后在所有副本mark complete，一定要先在所有副本的osd節(jié)點export pg備份，避免異常后可恢復pg。

#5. 導出備份
查看pg 7.123所有副本里面的object數(shù)量，參考上述情況，假設osd-641的object數(shù)量多，數(shù)據(jù)diff對比一致后，則到object數(shù)量最多，object list一致的副本osd節(jié)點執(zhí)行（最好是每個副本都進行導出備份,為0也需要導出備份）
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-641/ --type bluestore --pgid 7.123 --op export --file /export/pg1.414-osd-1.obj

#6. 導入備份
然后將/export/pg1.414-osd-1.obj scp到副本所在節(jié)點，在對象少的副本osd節(jié)點執(zhí)行導入。（最好是每個副本都進行導出備份,為0也需要導出備份）
將指定的pg元數(shù)據(jù)導入到當前pg,導入前需要先刪除當前pg（remove之前請先export備份一下pg數(shù)據(jù)）。需要remove當前pg,否則無法導入，提示已存在。
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op remove 需要加–force才可以刪除。
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op import --file /export/pg1.414-osd-1.obj

#7. 標記pg狀態(tài)，makr complete（主從所有副本執(zhí)行）
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-57/ --type bluestore --pgid 7.123 --op mark-complete

作者信息

作者：李航
個人簡介： 多年的底層開發(fā)經(jīng)驗，在高性能nginx開發(fā)和分布式緩存redis cluster有著豐富的經(jīng)驗，目前從事Ceph工作兩年左右。
先后在58同城、汽車之家、優(yōu)酷土豆集團工作。
目前供職于滴滴基礎(chǔ)平臺運維部-技術(shù)專家崗位負責分布式Ceph集群開發(fā)及運維等工作。
個人主要關(guān)注的技術(shù)領(lǐng)域：高性能Nginx開發(fā)、分布式緩存、分布式存儲。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

分布式存儲Ceph之PG狀態(tài)詳解

1. PG介紹

2. PG狀態(tài)表

3. 狀態(tài)詳解及故障模擬復現(xiàn)

3.1 Degraded

3.1.1 說明

3.1.2 故障模擬

3.1.3 總結(jié)

3.2 Peered

3.2.1 說明

3.2.2 故障模擬

3.2.3 總結(jié)

3.3 Remapped

3.3.1 說明

3.3.2 故障模擬

3.3.3 總結(jié)

3.4 Recovery

3.4.1 說明

3.4.2 故障模擬

3.4.3 總結(jié)

3.5 Backfill

3.5.1 說明

3.5.2 故障模擬

3.5.3 總結(jié)

3.6 Stale

3.6.1 說明

3.6.2 故障模擬

3.6.3 總結(jié)

3.7 Inconsistent

3.7.1 說明

3.7.2 故障模擬

3.7.3 故障模擬

3.8 Down

3.8.1 說明

3.8.2 故障模擬

3.8.3 PG為Down的OSD丟失或無法拉起

3.8.4 結(jié)論

3.9 Incomplete

3.9.1 總結(jié)

作者信息

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频