故障轉移
接著上章構建的sentinel網絡構建后分析sentinel的故障轉移。sentinel本身做為redis的分布式存儲的高可用方案,進行故障轉移就是高可用方案解決的核心問題。同樣在分析sentinel的故障轉移的方案前,先理解三個問題:
- 如何確認故障的發生?
- 故障發生后,誰來進行轉移操作?
- 如何進行轉移操作?
個人認為這個是所有故障轉移方案中不得不解決的三個問題。在redis中,面對這三個問題的就是sentinel節點而master和slave則是sentinel操作的對象,因而sentinel具有監督者的身份。在實際應用中,一般是由sentinels集群共同來監控master節點的,這樣就可以讓sentinel集群具備具一定容錯性,當某個sentinel節點出現問題時,sentinels體系結構也能夠繼續的進行服務。
確認故障
主觀下線
在上章的網絡構建代碼中知道通過時間事件周期性方法,sentinel會向master
和slave
每10s發送info
命令、至少每1s發送ping
命令。其中ping
命令的作用則啟著探測master
節點的作用。
- 當sentinel向master發送
ping
命令時,如果收到的返回結果不是有效回復+PONG、-LOADING、-MASTERDOWN
中的一種。當sentinel在配置的down-after-milliseconds
時間內連續收到無效回復,便會將在對應的sentinelRedisInstance
的flags
屬性上帶上SRI_S_DOWN
的標記,認為主觀下線。 -
flags
是int類型占2個字節有16位,因此flags的幾個標志位的具體內容如下:
#define SRI_MASTER (1<<0)
#define SRI_SLAVE (1<<1)
#define SRI_SENTINEL (1<<2)
#define SRI_S_DOWN (1<<3) /* Subjectively down (no quorum). */
#define SRI_O_DOWN (1<<4) /* Objectively down (confirmed by others). */
#define SRI_MASTER_DOWN (1<<5) /* A Sentinel with this flag set thinks that
its master is down. */
#define SRI_FAILOVER_IN_PROGRESS (1<<6) /* Failover is in progress for
this master. */
#define SRI_PROMOTED (1<<7) /* Slave selected for promotion. */
#define SRI_RECONF_SENT (1<<8) /* SLAVEOF <newmaster> sent. */
#define SRI_RECONF_INPROG (1<<9) /* Slave synchronization in progress. */
#define SRI_RECONF_DONE (1<<10) /* Slave synchronized with new master. */
#define SRI_FORCE_FAILOVER (1<<11) /* Force failover with master up. */
#define SRI_SCRIPT_KILL_SENT (1<<12) /* SCRIPT KILL already sent on -BUSY */
如:
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
0-12
位分別代表著上圖從上到下標志的意思,使用這種記錄法不僅可以節約空間還可以同時表示多種狀態。0和1分別表示是否處于該狀態。通過簡單的或運算就可以設置對應的標志位,而不影響其他標志位。
-
down-after-milliseconds
可通過配置文件sentinel.conf
配置也可以通過連接了sentinel
的客戶端發送命令設置,對于該配置的比較維度是在master
維度,對于sentinel監聽的不同master可以配置不同的down-after-milliseconds
值,而該master
對應的slaves
和sentinels
同樣繼承該值,使用其來判斷節點是否下線。 - 其實整個
ping
命令的探測模式不僅是針對master
,對于slave
和sentinel
實例也是如此。這里以master
為例講解。 - 由于確認主觀下線依賴于
down-after-milliseconds
值,而該值可以配置從而監聽同一臺master
的sentinels
則可以配置不同的主觀下線時間。
在sentinel中每個時間周期,都會遍歷檢查對應的節點是否主觀下線,這個周期事件在上章中有提及。
sentinelHandleRedisInstance
/* Perform scheduled operations for the specified Redis instance. */
void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {
/* ========== MONITORING HALF ============ */
/* Every kind of instance */
sentinelReconnectInstance(ri);
sentinelSendPeriodicCommands(ri);
/* ============== ACTING HALF ============= */
/* We don't proceed with the acting half if we are in TILT mode.
* TILT happens when we find something odd with the time, like a
* sudden change in the clock. */
if (sentinel.tilt) {
if (mstime()-sentinel.tilt_start_time < SENTINEL_TILT_PERIOD) return;
sentinel.tilt = 0;
sentinelEvent(LL_WARNING,"-tilt",NULL,"#tilt mode exited");
}
/* Every kind of instance */
sentinelCheckSubjectivelyDown(ri);
/* Masters and slaves */
if (ri->flags & (SRI_MASTER|SRI_SLAVE)) {
/* Nothing so far. */
}
/* Only masters */
if (ri->flags & SRI_MASTER) {
sentinelCheckObjectivelyDown(ri);
if (sentinelStartFailoverIfNeeded(ri))
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
sentinelFailoverStateMachine(ri);
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
}
}
sentinelCheckSubjectivelyDown
/* Is this instance down from our point of view? */
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
mstime_t elapsed = 0;
if (ri->link->act_ping_time)
elapsed = mstime() - ri->link->act_ping_time;
else if (ri->link->disconnected)
elapsed = mstime() - ri->link->last_avail_time;
/* Check if we are in need for a reconnection of one of the
* links, because we are detecting low activity.
*
* 1) Check if the command link seems connected, was connected not less
* than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
* pending ping for more than half the timeout. */
if (ri->link->cc &&
(mstime() - ri->link->cc_conn_time) >
SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
ri->link->act_ping_time != 0 && /* Ther is a pending ping... */
/* The pending ping is delayed, and we did not received
* error replies as well. */
(mstime() - ri->link->act_ping_time) > (ri->down_after_period/2) &&
(mstime() - ri->link->last_pong_time) > (ri->down_after_period/2))
{
instanceLinkCloseConnection(ri->link,ri->link->cc);
}
/* 2) Check if the pubsub link seems connected, was connected not less
* than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
* activity in the Pub/Sub channel for more than
* SENTINEL_PUBLISH_PERIOD * 3.
*/
if (ri->link->pc &&
(mstime() - ri->link->pc_conn_time) >
SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
(mstime() - ri->link->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
{
instanceLinkCloseConnection(ri->link,ri->link->pc);
}
/* Update the SDOWN flag. We believe the instance is SDOWN if:
*
* 1) It is not replying.
* 2) We believe it is a master, it reports to be a slave for enough time
* to meet the down_after_period, plus enough time to get two times
* INFO report from the instance. */
if (elapsed > ri->down_after_period ||
(ri->flags & SRI_MASTER &&
ri->role_reported == SRI_SLAVE && mstime() - ri->role_reported_time >
(ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
{
/* Is subjectively down */
if ((ri->flags & SRI_S_DOWN) == 0) {
sentinelEvent(LL_WARNING,"+sdown",ri,"%@");
ri->s_down_since_time = mstime();
ri->flags |= SRI_S_DOWN;
}
} else {
/* Is subjectively up */
if (ri->flags & SRI_S_DOWN) {
sentinelEvent(LL_WARNING,"-sdown",ri,"%@");
ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
}
}
}
- 檢查
command
連接是否需要被關閉。 - 檢查
pubsub
連接是否需要重被關閉。 - 更新
SDOWN
標志位,規則:一是沒有在規定的時間(默認30s)連續沒有回應。二是當其slave
上報的連接時間間隔時間要大于down_after_period+SENTINEL_INFO_PERIOD*2
時間(即30s+20s)時也將被認為主觀下線,因為slave
已經長時間聯系不到master
了。
客觀下線
當一臺sentinel檢測到master節點已經掉線,并已經將其在自己維護的狀態中設置為SRI_S_DOWN
時,由于是在sentinel集群中且每個節點判斷master下線的時間間隔可能不一樣,所以它必須要去詢問其他sentinel節點這臺監督的master節點是否下線。那么問題就來了:
- 怎么廣播命令去詢問其他sentinel節點對某個master的下線探測結果?
- 怎么統計探測的結果?
- 怎么讓所有的節點對master狀態的認知都保持一致?
通過這三個問題,又發現了分布式解決方法中兩個常見的問題:
- 命令的廣播。
- 如何達成共識,最終保證狀態的一致性。
看到sentinel
對于這一問題的解決方案,如果可以,我們也可以自己思考一下對于這些問題自己的解決方案,是否可以比sentinel
做的更好。
master下線狀態信息的詢問廣播
入口還是上面那段sentinelHandleRedisInstance
代碼
...
/* Only masters */
if (ri->flags & SRI_MASTER) {
sentinelCheckObjectivelyDown(ri);
if (sentinelStartFailoverIfNeeded(ri))
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
sentinelFailoverStateMachine(ri);
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
}
在這段代碼中,sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
觸發了sentinel在檢測到自己監督的master主觀下線之后去詢問其他sentinel的方法。
/* If we think the master is down, we start sending
* SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
* in order to get the replies that allow to reach the quorum
* needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
dictIterator *di;
dictEntry *de;
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
char port[32];
int retval;
/* If the master state from other sentinel is too old, we clear it. */
if (elapsed > SENTINEL_ASK_PERIOD*5) {
ri->flags &= ~SRI_MASTER_DOWN;
sdsfree(ri->leader);
ri->leader = NULL;
}
/* Only ask if master is down to other sentinels if:
*
* 1) We believe it is down, or there is a failover in progress.
* 2) Sentinel is connected.
* 3) We did not received the info within SENTINEL_ASK_PERIOD ms. */
if ((master->flags & SRI_S_DOWN) == 0) continue;
if (ri->link->disconnected) continue;
if (!(flags & SENTINEL_ASK_FORCED) &&
mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
continue;
/* Ask */
ll2string(port,sizeof(port),master->addr->port);
retval = redisAsyncCommand(ri->link->cc,
sentinelReceiveIsMasterDownReply, ri,
"SENTINEL is-master-down-by-addr %s %s %llu %s",
master->addr->ip, port,
sentinel.current_epoch,
(master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
sentinel.myid : "*");
if (retval == C_OK) ri->link->pending_commands++;
}
dictReleaseIterator(di);
}
通過遍歷自己維護的sentinels dict
向其他的sentinel
節點發送SENTINEL is-master-down-by-addr
命令,命令格式如:SENTINEL is-master-down-by-addr <master-ip> <master-port> <current_epoch> <leader_id>
,其中leader_id
的參數在第一次詢問客觀下線時,默認*
號。接著看該命令的解析方法sentinelReceiveIsMasterDownReply
/* Receive the SENTINEL is-master-down-by-addr reply, see the
* sentinelAskMasterStateToOtherSentinels() function for more information. */
void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
sentinelRedisInstance *ri = privdata;
instanceLink *link = c->data;
redisReply *r;
if (!reply || !link) return;
link->pending_commands--;
r = reply;
/* Ignore every error or unexpected reply.
* Note that if the command returns an error for any reason we'll
* end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
r->element[0]->type == REDIS_REPLY_INTEGER &&
r->element[1]->type == REDIS_REPLY_STRING &&
r->element[2]->type == REDIS_REPLY_INTEGER)
{
ri->last_master_down_reply_time = mstime();
if (r->element[0]->integer == 1) {
ri->flags |= SRI_MASTER_DOWN;
} else {
ri->flags &= ~SRI_MASTER_DOWN;
}
if (strcmp(r->element[1]->str,"*")) {
/* If the runid in the reply is not "*" the Sentinel actually
* replied with a vote. */
sdsfree(ri->leader);
if ((long long)ri->leader_epoch != r->element[2]->integer)
serverLog(LL_WARNING,
"%s voted for %s %llu", ri->name,
r->element[1]->str,
(unsigned long long) r->element[2]->integer);
ri->leader = sdsnew(r->element[1]->str);
ri->leader_epoch = r->element[2]->integer;
}
}
}
該命令返回三個值如:
127.0.0.1:26380> sentinel is-master-down-by-addr 127.0.0.1 6379 0 *
1) (integer) 0
2) "*"
3) (integer) 0
- <down_state> :master的下線狀態,0未下線,1已下線。當返回為已下線時,會同步更新
flags
的對應的第5位標志位SRI_MASTER_DOWN
為1。 - <leader_runid>:leader sentinel的
runid
,像第一次的客觀下線檢測時返回*
,因為命令發送的時候<leader_id>
為*
。 - <leader_epoch>:當前投票紀元,當runid為
*
時,該值總為0。
該命令也是領頭選舉時發送的命令,稍后介紹。在詢問完其他sentinel該master的狀態的后,在下個周期,會進行客觀下線檢查。但是在此之前還需要分析一個邏輯就是sentinel
如何處理sentinel is-master-down-by-addr
命令的。回憶起上章初始化時加載的命令表,在sentinel
命令注冊的方法sentinelCommand
中相關該命令的部分代碼
...
else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
/* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
*
* Arguments:
*
* ip and port are the ip and port of the master we want to be
* checked by Sentinel. Note that the command will not check by
* name but just by master, in theory different Sentinels may monitor
* differnet masters with the same name.
*
* current-epoch is needed in order to understand if we are allowed
* to vote for a failover leader or not. Each Sentinel can vote just
* one time per epoch.
*
* runid is "*" if we are not seeking for a vote from the Sentinel
* in order to elect the failover leader. Otherwise it is set to the
* runid we want the Sentinel to vote if it did not already voted.
*/
sentinelRedisInstance *ri;
long long req_epoch;
uint64_t leader_epoch = 0;
char *leader = NULL;
long port;
int isdown = 0;
if (c->argc != 6) goto numargserr;
if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
!= C_OK)
return;
ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
c->argv[2]->ptr,port,NULL);
/* It exists? Is actually a master? Is subjectively down? It's down.
* Note: if we are in tilt mode we always reply with "0". */
if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
(ri->flags & SRI_MASTER))
isdown = 1;
/* Vote for the master (or fetch the previous vote) if the request
* includes a runid, otherwise the sender is not seeking for a vote. */
if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
c->argv[5]->ptr,
&leader_epoch);
}
...
從這個代碼看,當收到其他sentinel節點的關于master下線詢問,是直接讀取對應master實例對象中保存的flags的狀態的,并不會觸發一些再次探測等其他操作。
master客觀下線狀態的檢查
這里只有對master
節點才會進行客觀下線判斷代碼如下:
sentinelCheckObjectivelyDown
/* Is this instance down according to the configured quorum?
*
* Note that ODOWN is a weak quorum, it only means that enough Sentinels
* reported in a given time range that the instance was not reachable.
* However messages can be delayed so there are no strong guarantees about
* N instances agreeing at the same time about the down state. */
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
dictIterator *di;
dictEntry *de;
unsigned int quorum = 0, odown = 0;
if (master->flags & SRI_S_DOWN) {
/* Is down for enough sentinels? */
quorum = 1; /* the current sentinel. */
/* Count all the other sentinels. */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
if (ri->flags & SRI_MASTER_DOWN) quorum++;
}
dictReleaseIterator(di);
if (quorum >= master->quorum) odown = 1;
}
/* Set the flag accordingly to the outcome. */
if (odown) {
if ((master->flags & SRI_O_DOWN) == 0) {
sentinelEvent(LL_WARNING,"+odown",master,"%@ #quorum %d/%d",
quorum, master->quorum);
master->flags |= SRI_O_DOWN;
master->o_down_since_time = mstime(); }
} else {
if (master->flags & SRI_O_DOWN) {
sentinelEvent(LL_WARNING,"-odown",master,"%@");
master->flags &= ~SRI_O_DOWN;
}
}
}
當詢問后的結果都處理更新至對應sentinels結構中時,就可以開始查看,對應的master被判斷下線的數量是否超過了在配置文件sentinel.conf
中配置的quorum值,如果達到或者超過該值即認為該master進入了客觀下線的狀態,將會修改其標志位SRI_O_DOWN
為1正式進入接下來的選舉領頭。
問題:在sentinel監督的master由主觀下線狀態到客觀下線的過程,從命令廣播和判斷master客觀下線這個共識,sentinel并沒有采用什么特殊的算法,特別是master的客觀下線這個狀態,那么sentinel的選舉領頭會不會在一個所有的
sentinel
節點都達到一致的狀態后進行呢?
答: 從代碼看sentinel
幾乎并沒有刻意的去同步一次master狀態在sentinel集群中的客觀狀態,也就是說master的客觀下線,需要等到集群中絕大部分節點都通過周期性事件判斷出master主觀下線,才有可能形成客觀下線。客觀下線這一狀態是在通過命令不斷交互慢慢達成的一個共識。而在由某個節點主觀下線到整個集群客觀下線的整個共識形成中,sentinel is-master-down-by-addr
命令大量充斥在sentinel的網絡結構中。當某個sentinel節點的客觀條件得到滿足時,選舉故障轉移的領頭選舉便也開始了。由于每個sentinel節點客觀條件是手動可配,并沒有什么算法來支持自動調節,這點確是可以有必要學習區塊鏈,個人覺得這個客觀條件觸發本身就是調節集群達成共識快慢的一個重要因子。最后值得說明的一點是,從整個集群看,當開始進入領頭選舉狀態時,集群中可能還有sentinel節點并沒有判斷出該master已經掉線。
選舉leader節點
當sentinel集群中的某個節點已經識別到master進入客觀下線的狀態,那么開始發起選舉領頭的投票。還是上面那段sentinelHandleRedisInstance
代碼
...
/* Only masters */
if (ri->flags & SRI_MASTER) {
sentinelCheckObjectivelyDown(ri);
if (sentinelStartFailoverIfNeeded(ri))
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
sentinelFailoverStateMachine(ri);
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
}
而這次先關注的是sentinelStartFailoverIfNeeded
方法,顧明思義該方法是判斷是否需要開始故障轉移,如果需要則開始進行領頭選舉。代碼如下:
/* This function checks if there are the conditions to start the failover,
* that is:
*
* 1) Master must be in ODOWN condition.
* 2) No failover already in progress.
* 3) No failover already attempted recently.
*
* We still don't know if we'll win the election so it is possible that we
* start the failover but that we'll not be able to act.
*
* Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
/* We can't failover if the master is not in O_DOWN state. */
if (!(master->flags & SRI_O_DOWN)) return 0;
/* Failover already in progress? */
if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;
/* Last failover attempt started too little time ago? */
if (mstime() - master->failover_start_time <
master->failover_timeout*2)
{
if (master->failover_delay_logged != master->failover_start_time) {
time_t clock = (master->failover_start_time +
master->failover_timeout*2) / 1000;
char ctimebuf[26];
ctime_r(&clock,ctimebuf);
ctimebuf[24] = '\0'; /* Remove newline. */
master->failover_delay_logged = master->failover_start_time;
serverLog(LL_WARNING,
"Next failover delay: I will not start a failover before %s",
ctimebuf);
}
return 0;
}
sentinelStartFailover(master);
return 1;
}
-
master
必須滿足客觀下線。 -
master
沒有在故障轉移中。 -
master
是不是距離上次嘗試故障轉移時間間隔小于2倍故障轉移超時(默認超時是3分鐘),意思是如果出現故障轉移超時默認至少隔六分鐘再開始下一輪。 - 如果以上三點都滿足的話執行
sentinelStartFailover
方法。
在開始看sentinelStartFailover
方法之前又有兩個問題需要我們在下面的代碼分析中得以解決:
- 故障轉移中,是在什么時候設置的狀態。這個狀態是集群中sentinel都同步的一個狀態,還是單個被選舉出來sentinel節點的自身內部的狀態?
- 故障轉移的超時指的是什么超時,什么情況會引起超時,這種超時會導致領頭重新選舉嗎?
帶著問題看到sentinelStartFailover
的代碼
/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
serverAssert(master->flags & SRI_MASTER);
master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
master->flags |= SRI_FAILOVER_IN_PROGRESS;
master->failover_epoch = ++sentinel.current_epoch;
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
master->failover_state_change_time = mstime();
}
這里主要是初始化了開始故障轉移新紀元的配置:
- 將狀態機置為
SENTINEL_FAILOVER_STATE_WAIT_START
狀態。 - 將
flags
標記為SRI_FAILOVER_IN_PROGRESS
表示正在進行故障轉移,周期事件可以不用重復進行。 - 更新故障轉移的紀元。
- 設置故障轉移開始時間(不知道為什么要加1000以內的隨機數)。
- 設置故障轉移狀態機的改變時間。
從這個方法中,解決了第一個問題SRI_FAILOVER_IN_PROGRESS
標志位的設置,表示當前sentinel節點正在進行故障轉移。接著回到之前詢問其他sentinel節點master狀態的方法sentinelAskMasterStateToOtherSentinels
,此時入參flags=SENTINEL_ASK_PERIOD
,意味著sentinel將再次向其他節點發送SENTINEL is-master-down-by-addr
命令,只不過這次<runId>
參數不再是空而是加上了當前sentinel
的runId
,期望其他節點選舉其為leader節點。
leader節點的選規則
同樣如同客觀下線SENTINEL is-master-down-by-addr
命令的處理一樣,只是會多調用sentinelVoteLeader
方法
...
else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
...
/* Vote for the master (or fetch the previous vote) if the request
* includes a runid, otherwise the sender is not seeking for a vote. */
if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
c->argv[5]->ptr,
&leader_epoch);
}
...
sentinelVoteLeader
/* Vote for the sentinel with 'req_runid' or return the old vote if already
* voted for the specifed 'req_epoch' or one greater.
*
* If a vote is not available returns NULL, otherwise return the Sentinel
* runid and populate the leader_epoch with the epoch of the vote. */
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
if (req_epoch > sentinel.current_epoch) {
sentinel.current_epoch = req_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
}
if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
{
sdsfree(master->leader);
master->leader = sdsnew(req_runid);
master->leader_epoch = sentinel.current_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
master->leader, (unsigned long long) master->leader_epoch);
/* If we did not voted for ourselves, set the master failover start
* time to now, in order to force a delay before we can start a
* failover for the same master. */
if (strcasecmp(master->leader,sentinel.myid))
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
}
*leader_epoch = master->leader_epoch;
return master->leader ? sdsnew(master->leader) : NULL;
}
- 同步投票紀元。
- 當master沒有設置leader時,就將廣播中的runId設置為leader,這個runId可能是sentinel自己的。
- 當runId不是自己時,設置故障轉移開始的時間。
- 每次這些狀態的改動都保存至配置文件中去。
在這里我們有看到了sentinel集群中對于failover狀態開始時間的一個統一同步,非leader的sentinel節點是在收到投票的命令廣播時認為故障轉移開始。整個投票的過程有如下的交互流程如下:
- sentinel節點發送
SENTINEL is-master-down-by-addr
命令要求接收節點設置自己為leader,此時有兩種情況:1)接收節點在當前投票紀元中沒有設置leader,便設置將其設置為leader。2)接收節點在當前投票紀元中已設置了leader,便將已設置的leader返回。 - sentinel節點,接收到返回結果后,將leader runid結果更新在對應的
sentinel
的sentinelRedisInstance
結構中,以便后續統計票數。 - 通過一輪詢問,詢問的sentinel節點就將會獲得其他sentinel的投票結果。
- 進入狀態機中的
SENTINEL_FAILOVER_STATE_WAIT_START
進行唱票。
void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {
char *leader;
int isleader;
/* Check if we are the leader for the failover epoch. */
leader = sentinelGetLeader(ri, ri->failover_epoch);
isleader = leader && strcasecmp(leader,sentinel.myid) == 0;
sdsfree(leader);
/* If I'm not the leader, and it is not a forced failover via
* SENTINEL FAILOVER, then I can't continue with the failover. */
if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {
int election_timeout = SENTINEL_ELECTION_TIMEOUT;
/* The election timeout is the MIN between SENTINEL_ELECTION_TIMEOUT
* and the configured failover timeout. */
if (election_timeout > ri->failover_timeout)
election_timeout = ri->failover_timeout;
/* Abort the failover if I'm not the leader after some time. */
if (mstime() - ri->failover_start_time > election_timeout) {
sentinelEvent(LL_WARNING,"-failover-abort-not-elected",ri,"%@");
sentinelAbortFailover(ri);
}
return;
}
sentinelEvent(LL_WARNING,"+elected-leader",ri,"%@");
if (sentinel.simfailure_flags & SENTINEL_SIMFAILURE_CRASH_AFTER_ELECTION)
sentinelSimFailureCrash();
ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
ri->failover_state_change_time = mstime();
sentinelEvent(LL_WARNING,"+failover-state-select-slave",ri,"%@");
}
sentinelGetLeader
/* Scan all the Sentinels attached to this master to check if there
* is a leader for the specified epoch.
*
* To be a leader for a given epoch, we should have the majority of
* the Sentinels we know (ever seen since the last SENTINEL RESET) that
* reported the same instance as leader for the same epoch. */
char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {
dict *counters;
dictIterator *di;
dictEntry *de;
unsigned int voters = 0, voters_quorum;
char *myvote;
char *winner = NULL;
uint64_t leader_epoch;
uint64_t max_votes = 0;
serverAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));
counters = dictCreate(&leaderVotesDictType,NULL);
voters = dictSize(master->sentinels)+1; /* All the other sentinels and me.*/
/* Count other sentinels votes */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)
sentinelLeaderIncr(counters,ri->leader);
}
dictReleaseIterator(di);
/* Check what's the winner. For the winner to win, it needs two conditions:
* 1) Absolute majority between voters (50% + 1).
* 2) And anyway at least master->quorum votes. */
di = dictGetIterator(counters);
while((de = dictNext(di)) != NULL) {
uint64_t votes = dictGetUnsignedIntegerVal(de);
if (votes > max_votes) {
max_votes = votes;
winner = dictGetKey(de);
}
}
dictReleaseIterator(di);
/* Count this Sentinel vote:
* if this Sentinel did not voted yet, either vote for the most
* common voted sentinel, or for itself if no vote exists at all. */
if (winner)
myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);
else
myvote = sentinelVoteLeader(master,epoch,sentinel.myid,&leader_epoch);
if (myvote && leader_epoch == epoch) {
uint64_t votes = sentinelLeaderIncr(counters,myvote);
if (votes > max_votes) {
max_votes = votes;
winner = myvote;
}
}
voters_quorum = voters/2+1;
if (winner && (max_votes < voters_quorum || max_votes < master->quorum))
winner = NULL;
winner = winner ? sdsnew(winner) : NULL;
sdsfree(myvote);
dictRelease(counters);
return winner;
}
- 先統計選票,采用的是redis自己的數據結構
leaderVotesDictType
(本質可以理解為一個k-v的Map),將選票分類整合票數。 - 選出票數最多的runid。
- 然后投出自己的一票,如果有winer,就將該票投給winer,如果沒有就把票投給自己。
- 當winer的選票大于所有節點的一半以上或者大于監督master時配置的
quorum
時,winner產生。而該winner就是被選舉出來的leader。
NOTE: 在leader選舉的一開始,sentinel節點是不會投票給自己的。但sentinel是可以投票給自己的,sentinel的節點有兩個投票時機,但每個節點在當前紀元只能投一次票。sentinel節點拉票的過程是異步的,所以可能有些詢問的結果都會得不到及時的反饋。
從上面整個選舉的過程,發現要產生leader
有幾個重要的條件,
- 至少我們會收到集群中
voters/2+1
個節點的投票(包括節點自己),如果設置的quorum
小于voters/2+1
,就是quorum
個節點。 - 選票最多的節點得到一定要或者一半節點以上支持票在成為
leader
。 - 選票產生的結果在當前紀元內才有效。
因為整個拉票的過程是異步的,并且如果有節點掉線的話,或者票數最多的節點滿足不了上述的要求的話,那么當前紀元時產生不了最終的leader的,只能等待超時,然后開啟下一輪的新紀元,直到該次故障轉移leader被選舉出來,進入到狀態機的下一個狀態。
leader選舉小結
上面通過代碼解釋了sentinel
的leader
選舉流程,來總結一下sentinel
是如何達到一個共識的狀態。
在一個集群中所有節點要達到一個共識就需要交互集群維度的狀態,sentinel
節點通過發送SENTINEL is-master-down-by-addr
命令來交互,獲得其他節點的內容。因為每個sentinel
節點自身都會維護一份基于master
維度的數據結構,某一方面我們可以把它理解成路由表,而SENTINEL is-master-down-by-addr
命令則可以理解為交互的協議。由于sentinel
體系網絡結構的特殊性,sentinel
節點是通過訂閱了共同master
節點的hello
頻道間接相互發現的。這個master
節點充當了媒介。而master
節點和sentinel
節點卻由屬于截然不同的兩種功能的節點。
簡而言之,抽象出來的分布式集群高可用性,需要解決的基礎就是:
- 集群節點間通信網絡的構建
- 集群節點間協議交互的傳播方式
- 集群節點間交互的協議
在sentinel
體系中除了解決這些基礎問題之外,就是如何達成共識。我們知道不管是master
的客觀下線還是故障轉移的leader
選舉都一個共識達成的過程。自然形成共識的規則、標準我們希望每個節點(這里指的是sentinel
節點)是一致的從而來保證每個節點都是公平的。當然在sentinel
中有手動配置的quorum
,其實這個quorum
個人認為它是調節整個sentinel
集群達到共識狀態的一個重要因子,可惜的是這個因子每個節點可配,并不是整個集群可配。這使得單個節點獲得了巨大的決定權,有點破壞了集群的穩定性。
言歸正傳,對于每個sentinel
節點而言都在進行著自己對master
節點的周期探測,當有一個節點探測到其監督的master
掉線的狀態并認為其主觀下線的話,那么sentinel
體系的第一次共識決定便開始了。因為該節點會開始不停的詢問其他節點,是否也認為該master
已經下線。如果已經下線的話,將會更新對應sentinelRedisInstance
中的flags
,隨著時間的推移,該節點會得到越來越多其他節點判斷檢測的master
下線的節點,直到某個臨界值。換個角度看,其實集群的每個節點都在自己的周期探測中逐漸進入到判斷master
節點客觀下線的狀態。因此集群中這一狀態的獲得,并不需要互相通知,都是靠自感知的。而單個節點獲知其他節點的狀態也可以看做是輪詢的。此時,當集群中的某個節點率先滿足了設置master
節點客觀下線條件時(>=quorum
值),便開始第二輪共識"發起投票"。前面也有提及就是在集群中這兩輪共識狀態并沒有明顯的界限,都是由每個節點自己去獲得,并不會被其他節點狀態所影響。也值得一提的就是sentinel
節點選票時卻是每個sentinel
節點都有投票權,即便是并沒有確認master
節點已下線的節點也可以參與。在這一輪的共識中有一個條件就是,進行故障轉移leader的票數一定至少要超過集群中節點的一半。并且這個選舉是有時間期限的,在規定期限內沒有獲得這個leader,將會進行下一輪的投票,直到在這個期限內獲得leader
,這個共識便達成了,因此對于每一輪的投票,都有一個epoch
紀元來控制,都點類似于版本號。從兩輪共識中又可以抽象出來sentinel
節點滿足達成共識必要的五點,:
- 每個節點都可以參加選舉和投票,當前紀元有且僅有一票。保證每個節點對此輪選舉計算得結果是一致。
- 交互選舉的結果。
- 選舉結果達成共識的觸發規則(votes/2+1)。
- 選舉結果達成共識有時間期限。
故障轉移其他操作
狀態機
結束了leader
選舉后,被選舉leader
節點,便開始了正式的故障轉移。在前面通過代碼也發現了,sentinel
是通過一個狀態機來操作進行故障轉移。
void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
serverAssert(ri->flags & SRI_MASTER);
if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;
switch(ri->failover_state) {
case SENTINEL_FAILOVER_STATE_WAIT_START:
sentinelFailoverWaitStart(ri);
break;
case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
sentinelFailoverSelectSlave(ri);
break;
case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
sentinelFailoverSendSlaveOfNoOne(ri);
break;
case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
sentinelFailoverWaitPromotion(ri);
break;
case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
sentinelFailoverReconfNextSlave(ri);
break;
}
}
整個的狀態變化圖如:
SENTINEL_FAILOVER_STATE_WAIT_START
||
\/
SENTINEL_FAILOVER_STATE_SELECT_SLAVE
||
\/
SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE
||
\/
SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
||
\/
SENTINEL_FAILOVER_STATE_RECONF_SLAVES
||
\/
SENTINEL_FAILOVER_STATE_UPDATE_CONFIG
選擇slave
選擇的規則如下調用鏈sentinelFailoverSelectSlave->sentinelSelectSlave
sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
sentinelRedisInstance **instance =
zmalloc(sizeof(instance[0])*dictSize(master->slaves));
sentinelRedisInstance *selected = NULL;
int instances = 0;
dictIterator *di;
dictEntry *de;
mstime_t max_master_down_time = 0;
if (master->flags & SRI_S_DOWN)
max_master_down_time += mstime() - master->s_down_since_time;
max_master_down_time += master->down_after_period * 10;
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
mstime_t info_validity_time;
if (slave->flags & (SRI_S_DOWN|SRI_O_DOWN)) continue;
if (slave->link->disconnected) continue;
if (mstime() - slave->link->last_avail_time > SENTINEL_PING_PERIOD*5) continue;
if (slave->slave_priority == 0) continue;
/* If the master is in SDOWN state we get INFO for slaves every second.
* Otherwise we get it with the usual period so we need to account for
* a larger delay. */
if (master->flags & SRI_S_DOWN)
info_validity_time = SENTINEL_PING_PERIOD*5;
else
info_validity_time = SENTINEL_INFO_PERIOD*3;
if (mstime() - slave->info_refresh > info_validity_time) continue;
if (slave->master_link_down_time > max_master_down_time) continue;
instance[instances++] = slave;
}
dictReleaseIterator(di);
if (instances) {
qsort(instance,instances,sizeof(sentinelRedisInstance*),
compareSlavesForPromotion);
selected = instance[0];
}
zfree(instance);
return selected;
}
選擇策略:
- 排除已經判斷主客觀判斷掉線的。
- 排除已經斷開連接的。
- 排除超過
5*SENTINEL_PING_PERIOD
秒(即5s)沒有獲得ping
回應的。 - 排除優先級為0的。
- 如果master是
SRI_S_DOWN
的狀態sentinel會每1s發送info
給slave所以此時排除超過SENTINEL_PING_PERIOD*5
秒(即5s)沒有獲得info回應的,反之排除超過3* SENTINEL_INFO_PERIOD
秒(即30s)沒有獲得info
回應的。 - 排除與master保持連接時間要大于
master
客觀下線的時間或者master->down_after_period * 10
。這樣可以盡可能保證slave
在master
掉線前是與master保持連接的。 - 剩下的
slave
按照如下規則選出一個slave,先按照優先級選優先級最高的,再按照slave
復制的offset
的大小,盡可能挑offset
最大的。表示數據的完整度最接近master
,最后按照runId的大小,選擇runId最大的。
int compareSlavesForPromotion(const void *a, const void *b) {
sentinelRedisInstance **sa = (sentinelRedisInstance **)a,
**sb = (sentinelRedisInstance **)b;
char *sa_runid, *sb_runid;
if ((*sa)->slave_priority != (*sb)->slave_priority)
return (*sa)->slave_priority - (*sb)->slave_priority;
/* If priority is the same, select the slave with greater replication
* offset (processed more data from the master). */
if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) {
return -1; /* a < b */
} else if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) {
return 1; /* a > b */
}
/* If the replication offset is the same select the slave with that has
* the lexicographically smaller runid. Note that we try to handle runid
* == NULL as there are old Redis versions that don't publish runid in
* INFO. A NULL runid is considered bigger than any other runid. */
sa_runid = (*sa)->runid;
sb_runid = (*sb)->runid;
if (sa_runid == NULL && sb_runid == NULL) return 0;
else if (sa_runid == NULL) return 1; /* a > b */
else if (sb_runid == NULL) return -1; /* a < b */
return strcasecmp(sa_runid, sb_runid);
}
發送將升級slave至master的命令
調用鏈sentinelFailoverSendSlaveOfNoOne->sentinelSendSlaveOf
void sentinelFailoverSendSlaveOfNoOne(sentinelRedisInstance *ri) {
int retval;
/* We can't send the command to the promoted slave if it is now
* disconnected. Retry again and again with this state until the timeout
* is reached, then abort the failover. */
if (ri->promoted_slave->link->disconnected) {
if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
sentinelAbortFailover(ri);
}
return;
}
/* Send SLAVEOF NO ONE command to turn the slave into a master.
* We actually register a generic callback for this command as we don't
* really care about the reply. We check if it worked indirectly observing
* if INFO returns a different role (master instead of slave). */
retval = sentinelSendSlaveOf(ri->promoted_slave,NULL,0);
if (retval != C_OK) return;
sentinelEvent(LL_NOTICE, "+failover-state-wait-promotion",
ri->promoted_slave,"%@");
ri->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION;
ri->failover_state_change_time = mstime();
}
向被選出來的slave發送一個slaveof no one
的命令將其升級為master,而且這次發送命令,并不會注冊slave
返回結果處理方法,而是通過sentinel
向slave
發送的info
命令,來獲知slave
的角色是否已被改變。當然如果在發送之前發現與已選擇的slave
斷開了連接則,宣告故障轉移超時失敗,重置故障轉移,進入新一輪的投票選舉。
等待slave升級
當slaveof no one
的命令發出后,故障轉移的狀態機便進入了SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
狀態,處于這個狀態的sentinel
只是檢查一下failover_state_change_time
是否已經超時,如果超時則宣告故障轉移超時失敗,重置故障轉移,進入新一輪的投票選舉。
/* We actually wait for promotion indirectly checking with INFO when the
* slave turns into a master. */
void sentinelFailoverWaitPromotion(sentinelRedisInstance *ri) {
/* Just handle the timeout. Switching to the next state is handled
* by the function parsing the INFO command of the promoted slave. */
if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
sentinelAbortFailover(ri);
}
}
前面在發送slaveof no one
命令的時候有提到,sentinel
并沒有注冊響應回調方法,而是通過周期性的info
命令來探測slave
的角色改變,關于info
命令返回結果的解析上章也有提到,再次回到這段代碼。
/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
sds *lines;
int numlines, j;
int role = 0;
/* cache full INFO output for instance */
sdsfree(ri->info);
ri->info = sdsnew(info);
/* The following fields must be reset to a given value in the case they
* are not found at all in the INFO output. */
ri->master_link_down_time = 0;
...
/* Handle slave -> master role switch. */
if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {
/* If this is a promoted slave we can change state to the
* failover state machine. */
if ((ri->flags & SRI_PROMOTED) &&
(ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&
(ri->master->failover_state ==
SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))
{
/* Now that we are sure the slave was reconfigured as a master
* set the master configuration epoch to the epoch we won the
* election to perform this failover. This will force the other
* Sentinels to update their config (assuming there is not
* a newer one already available). */
ri->master->config_epoch = ri->master->failover_epoch;
ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
ri->master->failover_state_change_time = mstime();
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+promoted-slave",ri,"%@");
if (sentinel.simfailure_flags &
SENTINEL_SIMFAILURE_CRASH_AFTER_PROMOTION)
sentinelSimFailureCrash();
sentinelEvent(LL_WARNING,"+failover-state-reconf-slaves",
ri->master,"%@");
sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,
"start",ri->master->addr,ri->addr);
sentinelForceHelloUpdateForMaster(ri->master);
} else {
/* A slave turned into a master. We want to force our view and
* reconfigure as slave. Wait some time after the change before
* going forward, to receive new configs if any. */
mstime_t wait_time = SENTINEL_PUBLISH_PERIOD*4;
if (!(ri->flags & SRI_PROMOTED) &&
sentinelMasterLooksSane(ri->master) &&
sentinelRedisInstanceNoDownFor(ri,wait_time) &&
mstime() - ri->role_reported_time > wait_time)
{
int retval = sentinelSendSlaveOf(ri,
ri->master->addr->ip,
ri->master->addr->port);
if (retval == C_OK)
sentinelEvent(LL_NOTICE,"+convert-to-slave",ri,"%@");
}
}
}
/* Handle slaves replicating to a different master address. */
if ((ri->flags & SRI_SLAVE) &&
role == SRI_SLAVE &&
(ri->slave_master_port != ri->master->addr->port ||
strcasecmp(ri->slave_master_host,ri->master->addr->ip)))
{
mstime_t wait_time = ri->master->failover_timeout;
/* Make sure the master is sane before reconfiguring this instance
* into a slave. */
if (sentinelMasterLooksSane(ri->master) &&
sentinelRedisInstanceNoDownFor(ri,wait_time) &&
mstime() - ri->slave_conf_change_time > wait_time)
{
int retval = sentinelSendSlaveOf(ri,
ri->master->addr->ip,
ri->master->addr->port);
if (retval == C_OK)
sentinelEvent(LL_NOTICE,"+fix-slave-config",ri,"%@");
}
}
/* Detect if the slave that is in the process of being reconfigured
* changed state. */
if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
(ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
{
/* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
if ((ri->flags & SRI_RECONF_SENT) &&
ri->slave_master_host &&
strcmp(ri->slave_master_host,
ri->master->promoted_slave->addr->ip) == 0 &&
ri->slave_master_port == ri->master->promoted_slave->addr->port)
{
ri->flags &= ~SRI_RECONF_SENT;
ri->flags |= SRI_RECONF_INPROG;
sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
}
/* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
if ((ri->flags & SRI_RECONF_INPROG) &&
ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
{
ri->flags &= ~SRI_RECONF_INPROG;
ri->flags |= SRI_RECONF_DONE;
sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
}
}
}
這里只截取了有關slave-·>master
的部分代碼。
- 先檢驗
sentinel
監督的slave
是否是正處在這種故障轉移的狀態中。 - 如果是則更新配置紀元、還有設置進入下一狀態
SENTINEL_FAILOVER_STATE_RECONF_SLAVES
、修改狀態變更的時間以及保存至配置文件。 - 調用client的重新配置的腳本。
- 調用
sentinelForceHelloUpdateForMaster->sentinelForceHelloUpdateDictOfRedisInstances
方法,如此來使得下一個周期廣播hello msg
/* Reset last_pub_time in all the instances in the specified dictionary
* in order to force the delivery of an Hello update ASAP. */
void sentinelForceHelloUpdateDictOfRedisInstances(dict *instances) {
dictIterator *di;
dictEntry *de;
di = dictGetSafeIterator(instances);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
if (ri->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
ri->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
}
dictReleaseIterator(di);
}
/* This function forces the delivery of an "Hello" message (see
* sentinelSendHello() top comment for further information) to all the Redis
* and Sentinel instances related to the specified 'master'.
*
* It is technically not needed since we send an update to every instance
* with a period of SENTINEL_PUBLISH_PERIOD milliseconds, however when a
* Sentinel upgrades a configuration it is a good idea to deliever an update
* to the other Sentinels ASAP. */
int sentinelForceHelloUpdateForMaster(sentinelRedisInstance *master) {
if (!(master->flags & SRI_MASTER)) return C_ERR;
if (master->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
master->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
sentinelForceHelloUpdateDictOfRedisInstances(master->sentinels);
sentinelForceHelloUpdateDictOfRedisInstances(master->slaves);
return C_OK;
}
重新配置slave
在通過info
命令探測到被選舉的slave
已成功變成master
后,進入SENTINEL_FAILOVER_STATE_RECONF_SLAVES
狀態,重新配置剩下的slaves
。
/* Send SLAVE OF <new master address> to all the remaining slaves that
* still don't appear to have the configuration updated. */
void sentinelFailoverReconfNextSlave(sentinelRedisInstance *master) {
dictIterator *di;
dictEntry *de;
int in_progress = 0;
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG))
in_progress++;
}
dictReleaseIterator(di);
di = dictGetIterator(master->slaves);
while(in_progress < master->parallel_syncs &&
(de = dictNext(di)) != NULL)
{
sentinelRedisInstance *slave = dictGetVal(de);
int retval;
/* Skip the promoted slave, and already configured slaves. */
if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;
/* If too much time elapsed without the slave moving forward to
* the next state, consider it reconfigured even if it is not.
* Sentinels will detect the slave as misconfigured and fix its
* configuration later. */
if ((slave->flags & SRI_RECONF_SENT) &&
(mstime() - slave->slave_reconf_sent_time) >
SENTINEL_SLAVE_RECONF_TIMEOUT)
{
sentinelEvent(LL_NOTICE,"-slave-reconf-sent-timeout",slave,"%@");
slave->flags &= ~SRI_RECONF_SENT;
slave->flags |= SRI_RECONF_DONE;
}
/* Nothing to do for instances that are disconnected or already
* in RECONF_SENT state. */
if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)) continue;
if (slave->link->disconnected) continue;
/* Send SLAVEOF <new master>. */
retval = sentinelSendSlaveOf(slave,
master->promoted_slave->addr->ip,
master->promoted_slave->addr->port);
if (retval == C_OK) {
slave->flags |= SRI_RECONF_SENT;
slave->slave_reconf_sent_time = mstime();
sentinelEvent(LL_NOTICE,"+slave-reconf-sent",slave,"%@");
in_progress++;
}
}
dictReleaseIterator(di);
/* Check if all the slaves are reconfigured and handle timeout. */
sentinelFailoverDetectEnd(master);
}
向其他slaves
發送SLAVE OF <new master address>
命令,而這個命令,也沒有狀態回復,依然是通過info
命令的探測得知,每個slave
是否已重新配置了新的master
。
/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
sds *lines;
int numlines, j;
int role = 0;
/* cache full INFO output for instance */
sdsfree(ri->info);
ri->info = sdsnew(info);
/* The following fields must be reset to a given value in the case they
* are not found at all in the INFO output. */
ri->master_link_down_time = 0;
...
/* Detect if the slave that is in the process of being reconfigured
* changed state. */
if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
(ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
{
/* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
if ((ri->flags & SRI_RECONF_SENT) &&
ri->slave_master_host &&
strcmp(ri->slave_master_host,
ri->master->promoted_slave->addr->ip) == 0 &&
ri->slave_master_port == ri->master->promoted_slave->addr->port)
{
ri->flags &= ~SRI_RECONF_SENT;
ri->flags |= SRI_RECONF_INPROG;
sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
}
/* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
if ((ri->flags & SRI_RECONF_INPROG) &&
ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
{
ri->flags &= ~SRI_RECONF_INPROG;
ri->flags |= SRI_RECONF_DONE;
sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
}
}
}
如上的代碼,其實在最后還有一個等待其他slave
轉向promote slave
的狀態變化過程。
SRI_RECONF_SENT->SRI_RECONF_INPROG->SRI_RECONF_DONE
-
SRI_RECONF_SENT
:就是前面已發送SLAVE OF <new master address>
的狀態。 -
SRI_RECONF_INPROG
:就是收到SLAVE OF <new master address>
命令的slave
已經配置成新master
的從服務器的狀態。 -
SRI_RECONF_DONE
:就是slave
重新配置master
結束的狀態。達到這個狀態有一個前提條件是master_link_status:up
則表示slave
節點的重新配置master
結束。
最后在sentinelFailoverReconfNextSlave
調用了sentinelFailoverDetectEnd
方法來檢查是否所有的slave
都已正常配置了新的master
。如果都已經配置完畢,則進入到了下一個狀態SENTINEL_FAILOVER_STATE_UPDATE_CONFIG
。
void sentinelFailoverDetectEnd(sentinelRedisInstance *master) {
int not_reconfigured = 0, timeout = 0;
dictIterator *di;
dictEntry *de;
mstime_t elapsed = mstime() - master->failover_state_change_time;
/* We can't consider failover finished if the promoted slave is
* not reachable. */
if (master->promoted_slave == NULL ||
master->promoted_slave->flags & SRI_S_DOWN) return;
/* The failover terminates once all the reachable slaves are properly
* configured. */
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;
if (slave->flags & SRI_S_DOWN) continue;
not_reconfigured++;
}
dictReleaseIterator(di);
/* Force end of failover on timeout. */
if (elapsed > master->failover_timeout) {
not_reconfigured = 0;
timeout = 1;
sentinelEvent(LL_WARNING,"+failover-end-for-timeout",master,"%@");
}
if (not_reconfigured == 0) {
sentinelEvent(LL_WARNING,"+failover-end",master,"%@");
master->failover_state = SENTINEL_FAILOVER_STATE_UPDATE_CONFIG;
master->failover_state_change_time = mstime();
}
/* If I'm the leader it is a good idea to send a best effort SLAVEOF
* command to all the slaves still not reconfigured to replicate with
* the new master. */
if (timeout) {
dictIterator *di;
dictEntry *de;
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
int retval;
if (slave->flags & (SRI_RECONF_DONE|SRI_RECONF_SENT)) continue;
if (slave->link->disconnected) continue;
retval = sentinelSendSlaveOf(slave,
master->promoted_slave->addr->ip,
master->promoted_slave->addr->port);
if (retval == C_OK) {
sentinelEvent(LL_NOTICE,"+slave-reconf-sent-be",slave,"%@");
slave->flags |= SRI_RECONF_SENT;
}
}
dictReleaseIterator(di);
}
}
更新master地址
在完成所有的slave
轉換后,故障轉移已變成SENTINEL_FAILOVER_STATE_UPDATE_CONFIG
狀態。當sentinel
處于這種狀態時,代碼在周期方法中處理該狀態,而不是在狀態機中處理的。主要是因為,處于這種狀態的master
將要被選舉的slave
替換,只需要改變原master的地址。且這個方法是遞歸的,如果不將該狀態的處理放置在原master
及其slave
和sentinel
節點的周期性事件處理的最后面的話,有可能會引起一些不必要的問題。重新設置的代碼如下,重新創建了一個slaves
的dict。并將原來的master
節點變為slave
加入到字典表中。而原有的master sentinelRedisInstance
的地址將會被替換各種狀態和連接都會被重置sentinelResetMaster
/* Perform scheduled operations for all the instances in the dictionary.
* Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) {
dictIterator *di;
dictEntry *de;
sentinelRedisInstance *switch_to_promoted = NULL;
/* There are a number of things we need to perform against every master. */
di = dictGetIterator(instances);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
sentinelHandleRedisInstance(ri);
if (ri->flags & SRI_MASTER) {
sentinelHandleDictOfRedisInstances(ri->slaves);
sentinelHandleDictOfRedisInstances(ri->sentinels);
if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) {
switch_to_promoted = ri;
}
}
}
if (switch_to_promoted)
sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
dictReleaseIterator(di);
}
sentinelFailoverSwitchToPromotedSlave->sentinelResetMasterAndChangeAddress
/* Reset the specified master with sentinelResetMaster(), and also change
* the ip:port address, but take the name of the instance unmodified.
*
* This is used to handle the +switch-master event.
*
* The function returns C_ERR if the address can't be resolved for some
* reason. Otherwise C_OK is returned. */
int sentinelResetMasterAndChangeAddress(sentinelRedisInstance *master, char *ip, int port) {
sentinelAddr *oldaddr, *newaddr;
sentinelAddr **slaves = NULL;
int numslaves = 0, j;
dictIterator *di;
dictEntry *de;
newaddr = createSentinelAddr(ip,port);
if (newaddr == NULL) return C_ERR;
/* Make a list of slaves to add back after the reset.
* Don't include the one having the address we are switching to. */
di = dictGetIterator(master->slaves);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *slave = dictGetVal(de);
if (sentinelAddrIsEqual(slave->addr,newaddr)) continue;
slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
slaves[numslaves++] = createSentinelAddr(slave->addr->ip,
slave->addr->port);
}
dictReleaseIterator(di);
/* If we are switching to a different address, include the old address
* as a slave as well, so that we'll be able to sense / reconfigure
* the old master. */
if (!sentinelAddrIsEqual(newaddr,master->addr)) {
slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
slaves[numslaves++] = createSentinelAddr(master->addr->ip,
master->addr->port);
}
/* Reset and switch address. */
sentinelResetMaster(master,SENTINEL_RESET_NO_SENTINELS);
oldaddr = master->addr;
master->addr = newaddr;
master->o_down_since_time = 0;
master->s_down_since_time = 0;
/* Add slaves back. */
for (j = 0; j < numslaves; j++) {
sentinelRedisInstance *slave;
slave = createSentinelRedisInstance(NULL,SRI_SLAVE,slaves[j]->ip,
slaves[j]->port, master->quorum, master);
releaseSentinelAddr(slaves[j]);
if (slave) sentinelEvent(LL_NOTICE,"+slave",slave,"%@");
}
zfree(slaves);
/* Release the old address at the end so we are safe even if the function
* gets the master->addr->ip and master->addr->port as arguments. */
releaseSentinelAddr(oldaddr);
sentinelFlushConfig();
return C_OK;
}
故障轉移終于結束了,但還有一個遺留的問題尚未解決就是,故障轉移只有被選舉的leader才能操作,其他sentinel節點是如何同步到被選舉的新master
并更新對應的結構的呢?還記得在處理info
命令中當收到的role由slave轉為master時,代碼會強制更新hello msg
的pub的周期,盡快的廣播hello msg
,因此又看回到hello msg
的處理方法中的部分代碼。
/* Process an hello message received via Pub/Sub in master or slave instance,
* or sent directly to this sentinel via the (fake) PUBLISH command of Sentinel.
*
* If the master name specified in the message is not known, the message is
* discarded. */
void sentinelProcessHelloMessage(char *hello, int hello_len) {
/* Format is composed of 8 tokens:
* 0=ip,1=port,2=runid,3=current_epoch,4=master_name,
* 5=master_ip,6=master_port,7=master_config_epoch. */
int numtokens, port, removed, master_port;
uint64_t current_epoch, master_config_epoch;
char **token = sdssplitlen(hello, hello_len, ",", 1, &numtokens);
sentinelRedisInstance *si, *master;
if (numtokens == 8) {
/* Obtain a reference to the master this hello message is about */
master = sentinelGetMasterByName(token[4]);
if (!master) goto cleanup; /* Unknown master, skip the message. */
/* First, try to see if we already have this sentinel. */
port = atoi(token[1]);
master_port = atoi(token[6]);
si = getSentinelRedisInstanceByAddrAndRunID(
master->sentinels,token[0],port,token[2]);
current_epoch = strtoull(token[3],NULL,10);
master_config_epoch = strtoull(token[7],NULL,10);
...
/* Update master info if received configuration is newer. */
if (si && master->config_epoch < master_config_epoch) {
master->config_epoch = master_config_epoch;
if (master_port != master->addr->port ||
strcmp(master->addr->ip, token[5]))
{
sentinelAddr *old_addr;
sentinelEvent(LL_WARNING,"+config-update-from",si,"%@");
sentinelEvent(LL_WARNING,"+switch-master",
master,"%s %s %d %s %d",
master->name,
master->addr->ip, master->addr->port,
token[5], master_port);
old_addr = dupSentinelAddr(master->addr);
sentinelResetMasterAndChangeAddress(master, token[5], master_port);
sentinelCallClientReconfScript(master,
SENTINEL_OBSERVER,"start",
old_addr,master->addr);
releaseSentinelAddr(old_addr);
}
}
/* Update the state of the Sentinel. */
if (si) si->last_hello_time = mstime();
}
cleanup:
sdsfreesplitres(token,numtokens);
}
其他節點當發現master
的配置紀元小于廣播的配置紀元,且master
的ip
和port
都變了時,開始重置master
了,方法還是上面分析過的sentinelResetMasterAndChangeAddress
。至此最后的謎團也解開了,其他sentinel
的監督狀態也得到了更新,注意從代碼看master name
非常重要,升級slave的時候master name
依然不變。
到這里有關sentinel
的故障轉移的絕大部分內容都已經分析完了,基本流程也都串起來了。
小結
- 確認節點下線,分為了主觀下線和客觀下線。
- 主觀下線是一段時間內探測的
ping
命令返回無效。主觀下線探測是對所有節點都一致的,該時間可配,且以master
為維度配置。 - 客觀下線,是只針對
master
節點的,通過向其他sentinel
節點發送SENTINEL is-master-down-by-addr
命令來進行詢問其他節點master下線的問題,并達成共識的一個狀態。 - 選舉leader節點進行故障轉移,當
sentinel
集群中有節點檢測到某個master
滿足客觀下線的條件(判斷master下線的節點數大于配置的quorum
),便觸發了leader選舉。 - 每個sentinel節點都可以參加選舉和進行投票,但當前紀元的投票,每個節點有且只有1投票,整個選舉有時間限制(默認10s,如果配置的故障轉移超時時間小于10s,則為故障轉移超時時間),在一定時間內,沒有選舉出leader,便更新紀元,重新開始新一輪的選舉。
- sentinel節點也是通過
SENTINEL is-master-down-by-addr
命令來進行拉票,因此該命令在整個故障轉移中有兩種作用。 - 最先獲得至少集群節點數一半以上投票的節點當選
leader
。 -
sentinel
使用狀態機來控制故障轉移的流程。每個狀態都是異步且在不同周期被調用。 - leader在
slaves
節點中選舉最合適成為新的master
,并向其發送slave of no one
來進行升級。且通過info
命令來獲取新master
的轉換信息。 - leader通過向其他
slave
節點發送slave of <new master address>
來重新配置master
,也是通過info
命令的探測來其他節點的配置是否已經配置完成。
11.當slaves
的節點重新構建完成,leader開始更新master
的結構,重新建立slaves dict
,并重置master
的sentinelRedisInstance
。但一直保持master name
不變。
12.其他sentinel
節點在leader完成了故障轉移后,通過訂閱了同一批的slave節點的hello
頻道,收到leader廣播的hello msg
而更新自身的master
結構數據。 - 最后通過redis的
sentinel
解決方案就可以更好的去理解Raft
算法的內容了。