上周聽了院長講了一課ES之后,繞梁三日,并對里面關(guān)于節(jié)點failover 的邏輯非常好奇,乃至自己這兩天搗鼓一下,沒事做設(shè)定一個很簡單的場景,想看ES是怎么走的??戳瞬簧傥恼路瞬簧俅a,有收獲顧記錄之。
本篇打算討論下面這個很簡單的問題:
- 客戶端首先發(fā)送文檔A到Node1 也就是Primary Shard 索引文檔A(簡單起見就假設(shè)直接發(fā)送給Primary),并且文檔A已正常同步到所有的Replica
- 接著客戶端發(fā)送文檔B到Node1 并索引文檔B,這時Node1 會同步到Node2 和Node3
- 假設(shè)Node1 到Node3 的鏈路超時了并且這時Node1 掛了。
其實大家都知道這時Cluster會是紅色,因為shard0 的 primary 掛了。那么ES會做下面兩件事的哪一件呢?
- 隨便找到一個replica 并推選為primary,如果它保存的數(shù)據(jù)不是最新的,則replica之間會先進行一次同步(在INITIALIZING 階段)
- ES能找出Node2 的R0是包含了最新代碼的,因此直接推舉Node2 的R0 為primary
答案是:如果是Elasticsearch 5.x 則是后者,其他版本不詳(沒看其他版本代碼),ES會把當前狀態(tài)標成RED,并且會把Node2 的R0晉升到primary,然后會把剩下的一個R0 標記成UNassigned,待Node1 重新加入集群時,會為Node1 分配這個UNassigned的R0。
下面就來過一遍代碼!
預備知識
這里先分享Jasper兄的兩篇博文,這兩篇博文詳細介紹了ES的Gateway 模塊和Allocation 模塊,本文是跟著Jasper兄的思路繼續(xù)走讀完剩下的代碼而已。
拜讀完后你基本能夠了解到ES在什么時候會執(zhí)行一個Allocation 操作。這里再簡單總結(jié)一下幾個貫穿其中的類:
-
ShardStateAction
響應Shard State 改變時間的邏輯入口 -
AllocationService
Allocation 的主邏輯類,它回答的問題是index 的shards 應該在nodes 間如何分配,這個類封裝了RoutingNodes
、GatewayAllocator
等,我們要關(guān)注的自然就是里面的applyFailedShards()
方法 -
Allocator
有很多子類,具體的子類通過makeAllocationDecision()
方法來決定某個策略并會產(chǎn)生一個是否進行allocation的Decision 決策,舉個例子,PrimaryShardAllocator
會決策一個primary shard 是否應該存放在本 Node 上,Decision
類就會有一堆的枚舉結(jié)果。 -
RoutingNodes
,用源碼上的描述說就是代表了clusterState
對象的信息,封裝出很多用于本次clusterState變更事件需要用到的很多屬性,例如當前的nodesToShards
、unassignedShards
、assignedShards
等
源碼分析
從#Jasper 的博文#你應該會了解到一個Cluster初始化會如何去allocate index的所有的shards,那我還是接著來看我的這個場景,當Node1 的P0 掛了,會做些什么事情。
首先從最頂層入口開始
ShardStateAction ::ShardFailedClusterStateTaskExecutor
它有一個邏輯方法execute(),其中一段邏輯是先把當前ClusterState中的shards 歸類
ShardRouting matched = currentState.getRoutingTable().getByAllocationId(task.shardId, task.allocationId);
if (matched == null) {
Set<String> inSyncAllocationIds = indexMetaData.inSyncAllocationIds(task.shardId.id());
// mark shard copies without routing entries that are in in-sync allocations set only as stale if the reason why
// they were failed is because a write made it into the primary but not to this copy (which corresponds to
// the check "primaryTerm > 0").
if (task.primaryTerm > 0 && inSyncAllocationIds.contains(task.allocationId)) {
logger.debug("{} marking shard {} as stale (shard failed task: [{}])", task.shardId, task.allocationId, task);
tasksToBeApplied.add(task);
staleShardsToBeApplied.add(new StaleShard(task.shardId, task.allocationId));
} else {
// tasks that correspond to non-existent shards are marked as successful
logger.debug("{} ignoring shard failed task [{}] (shard does not exist anymore)", task.shardId, task);
batchResultBuilder.success(task);
}
} else {
// failing a shard also possibly marks it as stale (see IndexMetaDataUpdater)
logger.debug("{} failing shard {} (shard failed task: [{}])", task.shardId, matched, task);
tasksToBeApplied.add(task);
failedShardsToBeApplied.add(new FailedShard(matched, task.message, task.failure));
}
上面代碼完成了把所有fail的shards進行歸類,并把所有已經(jīng)處于stale狀態(tài)的節(jié)點都篩選出來,這段代碼對于ES < 5 來說都是新概念,所以逐個技術(shù)點做介紹;
AllocationId
這個是master 在保存一個具體的index的shard時配置的一個唯一標識一個物理shard data的標識,在代碼里用getByAllocationId(shardId, allocationId)可以唯一找到一個ShardRouting 對象,在這個例子里我們可以認為原始的P0,兩個R0都具有自己的allocationID,最后Node1 的那個allocationID將會標記為UNassigned.
stalesShards
標識為不是包含最新數(shù)據(jù)的shard,在ES5 里這種shard是不會被推舉成primary的
failedShards
本次clusterState change事件中fail 的shard。
indexMetaData.inSyncAllocationIds
ES5 中的cluster state中維護著這樣一個Set
,在這個里面的allocationId的集合才會被認為是包含最新的data的。
那么用我假設(shè)的例子來演繹的話就是:node1 的P0 是failShard,Node3的R0 是staleShard
想要了解更詳細信息的可以參考下面這個Elasticsearch 的Blog,這里就是介紹了ES5 的這個新功能,它能維護著一個具有最新數(shù)據(jù)集合的shards的ID集合
Elasticsearch Internals - Tracking in-sync shard copies
(https://www.elastic.co/blog/tracking-in-sync-shard-copies)
Allocation IDs are assigned by the master during shard allocation and are stored on disk by the data nodes, right next to the actual shard data. The master is responsible for tracking the subset of copies that contain the most recent data. This set of copies, known as in-sync allocation IDs, is stored in the cluster state, which is persisted on all master and data nodes. Changes to the cluster state are backed by Elasticsearch’s consensus implementation, called zen discovery. This ensures that there is a shared understanding in the cluster as to which shard copies are considered as in-sync, implicitly marking those shard copies that are not in the in-sync set as stale.
好了對fail 的shard 分好類后就會調(diào)allocationService 的方法
try {
maybeUpdatedState = applyFailedShards(currentState, failedShardsToBeApplied, staleShardsToBeApplied);
batchResultBuilder.successes(tasksToBeApplied);
}
ClusterState applyFailedShards(ClusterState currentState, List<FailedShard> failedShards, List<StaleShard> staleShards) {
return allocationService.applyFailedShards(currentState, failedShards, staleShards);
}
在allocationService的applyFailedShards()里面最開始做的就是先把所有的staleShards 排除在可以做Routing 之外,并產(chǎn)生出一個臨時的clusterState,進而構(gòu)造出一個RoutingNodes對象。
ClusterState tmpState = IndexMetaDataUpdater.removeStaleIdsWithoutRoutings(clusterState, staleShards);
RoutingNodes routingNodes = getMutableRoutingNodes(tmpState);
// shuffle the unassigned nodes, just so we won't have things like poison failed shards
routingNodes.unassigned().shuffle();
上面介紹也說了,這個routingNodes
解析了一次clusterState,并計算出當前的一些assigned,UNassigned,failed,stale 之類的所有的shards的集合,因此這個構(gòu)造函數(shù)值得拜讀一下,構(gòu)造方法中有兩點比較重要的東西需要留意,一是特別注意此時的routingTable 其實已經(jīng)并沒有包含Node3中的R0 這個shard了;二是留意一下在什么時候把一個shard標記為assignedShard。
用我假設(shè)的例子來演繹的話就是:Node3的R0 不在assignedShardList 里。
最后就是最核心的方法routingNodes的failShard方法了,因為之前的疑點都在前面方法里找到結(jié)果了,所以帶著這些結(jié)果來理解這個方法就比較好懂了,anyway這個方法從頭到結(jié)尾都是比較重要的,所以這里也不吝嗇全貼出來了
/**
* Applies the relevant logic to handle a cancelled or failed shard.
*
* Moves the shard to unassigned or completely removes the shard (if relocation target).
*
* - If shard is a primary, this also fails initializing replicas.
* - If shard is an active primary, this also promotes an active replica to primary (if such a replica exists).
* - If shard is a relocating primary, this also removes the primary relocation target shard.
* - If shard is a relocating replica, this promotes the replica relocation target to a full initializing replica, removing the
* relocation source information. This is possible as peer recovery is always done from the primary.
* - If shard is a (primary or replica) relocation target, this also clears the relocation information on the source shard.
*
*/
public void failShard(Logger logger, ShardRouting failedShard, UnassignedInfo unassignedInfo, IndexMetaData indexMetaData,
RoutingChangesObserver routingChangesObserver) {
ensureMutable();
assert failedShard.assignedToNode() : "only assigned shards can be failed";
assert indexMetaData.getIndex().equals(failedShard.index()) :
"shard failed for unknown index (shard entry: " + failedShard + ")";
assert getByAllocationId(failedShard.shardId(), failedShard.allocationId().getId()) == failedShard :
"shard routing to fail does not exist in routing table, expected: " + failedShard + " but was: " +
getByAllocationId(failedShard.shardId(), failedShard.allocationId().getId());
logger.debug("{} failing shard {} with unassigned info ({})", failedShard.shardId(), failedShard, unassignedInfo.shortSummary());
// if this is a primary, fail initializing replicas first (otherwise we move RoutingNodes into an inconsistent state)
if (failedShard.primary()) {
List<ShardRouting> assignedShards = assignedShards(failedShard.shardId());
if (assignedShards.isEmpty() == false) {
// copy list to prevent ConcurrentModificationException
for (ShardRouting routing : new ArrayList<>(assignedShards)) {
if (!routing.primary() && routing.initializing()) {
// re-resolve replica as earlier iteration could have changed source/target of replica relocation
ShardRouting replicaShard = getByAllocationId(routing.shardId(), routing.allocationId().getId());
assert replicaShard != null : "failed to re-resolve " + routing + " when failing replicas";
UnassignedInfo primaryFailedUnassignedInfo = new UnassignedInfo(UnassignedInfo.Reason.PRIMARY_FAILED,
"primary failed while replica initializing", null, 0, unassignedInfo.getUnassignedTimeInNanos(),
unassignedInfo.getUnassignedTimeInMillis(), false, AllocationStatus.NO_ATTEMPT);
failShard(logger, replicaShard, primaryFailedUnassignedInfo, indexMetaData, routingChangesObserver);
}
}
}
}
if (failedShard.relocating()) {
// find the shard that is initializing on the target node
ShardRouting targetShard = getByAllocationId(failedShard.shardId(), failedShard.allocationId().getRelocationId());
assert targetShard.isRelocationTargetOf(failedShard);
if (failedShard.primary()) {
logger.trace("{} is removed due to the failure/cancellation of the source shard", targetShard);
// cancel and remove target shard
remove(targetShard);
routingChangesObserver.shardFailed(targetShard, unassignedInfo);
} else {
logger.trace("{}, relocation source failed / cancelled, mark as initializing without relocation source", targetShard);
// promote to initializing shard without relocation source and ensure that removed relocation source
// is not added back as unassigned shard
removeRelocationSource(targetShard);
routingChangesObserver.relocationSourceRemoved(targetShard);
}
}
// fail actual shard
if (failedShard.initializing()) {
if (failedShard.relocatingNodeId() == null) {
if (failedShard.primary()) {
// promote active replica to primary if active replica exists (only the case for shadow replicas)
ShardRouting activeReplica = activeReplicaWithHighestVersion(failedShard.shardId());
if (activeReplica == null) {
moveToUnassigned(failedShard, unassignedInfo);
} else {
movePrimaryToUnassignedAndDemoteToReplica(failedShard, unassignedInfo);
promoteReplicaToPrimary(activeReplica, indexMetaData, routingChangesObserver);
}
} else {
// initializing shard that is not relocation target, just move to unassigned
moveToUnassigned(failedShard, unassignedInfo);
}
} else {
// The shard is a target of a relocating shard. In that case we only need to remove the target shard and cancel the source
// relocation. No shard is left unassigned
logger.trace("{} is a relocation target, resolving source to cancel relocation ({})", failedShard,
unassignedInfo.shortSummary());
ShardRouting sourceShard = getByAllocationId(failedShard.shardId(),
failedShard.allocationId().getRelocationId());
assert sourceShard.isRelocationSourceOf(failedShard);
logger.trace("{}, resolved source to [{}]. canceling relocation ... ({})", failedShard.shardId(), sourceShard,
unassignedInfo.shortSummary());
cancelRelocation(sourceShard);
remove(failedShard);
}
routingChangesObserver.shardFailed(failedShard, unassignedInfo);
} else {
assert failedShard.active();
if (failedShard.primary()) {
// promote active replica to primary if active replica exists
ShardRouting activeReplica = activeReplicaWithHighestVersion(failedShard.shardId());
if (activeReplica == null) {
moveToUnassigned(failedShard, unassignedInfo);
} else {
movePrimaryToUnassignedAndDemoteToReplica(failedShard, unassignedInfo);
promoteReplicaToPrimary(activeReplica, indexMetaData, routingChangesObserver);
}
} else {
assert failedShard.primary() == false;
if (failedShard.relocating()) {
remove(failedShard);
} else {
moveToUnassigned(failedShard, unassignedInfo);
}
}
routingChangesObserver.shardFailed(failedShard, unassignedInfo);
}
assert node(failedShard.currentNodeId()).getByShardId(failedShard.shardId()) == null : "failedShard " + failedShard +
" was matched but wasn't removed";
}
上面的幾個if可以總結(jié)為:
- 如果fail的是一個primary shard,那么這個shardId的所有replica 都應該標記fail
- 如果fail 了一個primary 的shard,那么就會從replica中promote 一個replica晉升為primay
答案就是這里,剛剛說了,routingNodes里的assignedShards已經(jīng)把stale的shards剔除出去了,所以這里只要隨便找一個作為primary即可
- 如果fail 了一個正在relocating的primary的shard,那么會把relocating的目標也清理了(等于源掛了,目標也不要了)
- 如果fail 了一個正在relocating的replica,那么直接把目標值為initializing 就可以了,源直接刪掉
這幾點都很好理解,就不解釋了。
那么本篇開頭設(shè)計的例子和推論就基本演繹了一遍并成立了,不過多說一句就是,所謂的保證那些replica 具有最新的數(shù)據(jù),這是ES的索引文檔內(nèi)部機理決定的,當ES索引一個doc時,primary 寫入完成之后,需要等待quorum
個replica完成寫入才會認為這個doc已經(jīng)寫入,這個操作是并發(fā)異步操作,因此才會出現(xiàn)這個所謂的某些replica是最新的這種問題出現(xiàn)。
https://www.elastic.co/guide/en/elasticsearch/guide/current/distrib-write.html
全篇完,如有錯漏歡迎指正和交流。