11.Spark Streaming源碼解讀之Driver中的ReceiverTracker架構(gòu)設(shè)計以及具體實現(xiàn)徹底研究

上篇文章詳細解析了Receiver不斷接收數(shù)據(jù)的過程,在Receiver接收數(shù)據(jù)的過程中會將數(shù)據(jù)的元信息發(fā)送給ReceiverTracker:

本文將詳細解析ReceiverTracker的的架構(gòu)設(shè)計和具體實現(xiàn)

一、ReceiverTracker的主要功能

ReceiverTracker的主要功能有:

1.在Executor上啟動Receivers

2.接受Receiver的注冊

3.借助ReceivedBlockTracker來管理Receiver接收數(shù)據(jù)的元數(shù)據(jù)

4.接受Receiver發(fā)送的各種消息,并作相應(yīng)處理

5.更新Receiver接收數(shù)據(jù)的速率(也就是限流)

6.不斷的等待Receivers的運行狀態(tài),只要Receivers停止運行,就重新啟動Receiver。也就是Receiver的容錯功能。

7.停止Receivers

8.匯報Receiver發(fā)送過來的錯誤信息

二、ReceiverTracker具體功能詳解

2.1 啟動receiver并管理receiver接收數(shù)據(jù)的元數(shù)據(jù)

首先,ReceiverTracker內(nèi)部有一個ReceiverTrackerEndPoint通訊體endpoint變量,endpoint用來和Receiver和ReceiverTracker本身進行消息通訊。這個ReceiverTrackerEndPoint通訊體在ReceiverTracker啟動時被初始化:

ReceiverTracker啟動Receiver時候,向ReceiverTrackerEndPoint通訊體endpoint變量發(fā)送了StartAllReceivers(receivers)消息:

Receiver啟動后會向ReceiverTracker注冊,告訴ReceiverTracker自己啟動成功:

代碼中的trackerEndpoint就是ReceiverTracker中ReceiverTrackerEndPoint通訊體endpoint的引用。

Receiver會不斷將接收的數(shù)據(jù)封裝成Block,并將這些Block推送給BlockManager管理,在將這些Block推送給BlockManager之后,ReceiverSupervisor會將Block的元信息發(fā)送給ReceiverTracker的endpoint:

可以看到ReceiverSupervisor向ReceiverTracker的endpoint發(fā)送了AddBlock(blockInfo)消息:

ReceiverTracker收到AddBlock(blockInfo)消息后,會啟動一個線程進行處理:

ReceiverTracker收到AddBlock(blockInfo)消息后,調(diào)用了addBlock(receiveedBlockInfo)方法進行處理,下面是addBlock的源碼:

這里其實調(diào)用了receivedBlockTracker的addBlock方法,receivedBlockTracker是ReceivedBlockTracker對象,它是在ReceiverTracker實例化時候被創(chuàng)建:

下面看一下ReceivedBlockTracker的addBlock方法:

可以看到ReceivedBlockTracker的addBlock方法將block的元信息添加到了一個隊隊列中,最終是添加到一個叫做streamIdToUnallocatedBlockQueues的HashMap中,其中key是streamId,值是該streamid對應(yīng)的block隊列。

2.2 為Batch分配Block

當spark streaming應(yīng)用程序動態(tài)生成job的時候,JobGenerator會調(diào)用generateJobs方法,在該方法中會為批處理分配已經(jīng)接收的Block

這里調(diào)用了jobScheduler中receiverTracker的allocatedBlockToBatch方法,這里的receiverTracker就是ReceiverTracker對象,下面看一下該方法的實現(xiàn):

可以看到,最終調(diào)用了ReceivedBlockTracker的allocatedBlockToBatch方法:

這里先根據(jù)streamId,從streamIdToUnallocatedBlockQueues中取出接收到的block隊列,并將streamId和block隊列封裝成AllocatedBlocks,最后根據(jù)batchTime將其對應(yīng)的AllocatedBlocks對象加入timeToAllocatedBlocks中,timeToAllocatedBlocks是一個HashMap:

這樣Batch的Block就分配完成。

2.3 ReceiverTracker處理的其他消息

ReceiverTracker中ReceiverTrackerEndpoint的receive方法定義了各種消息的處理邏輯:

(1) 收到StartAllReceivers(receivers)消息后,ReceiverTracker會為receivers分配executor,并在executor上啟動相應(yīng)的receiver

(2)當ReceiverTracker監(jiān)控到receiver退出返回時,會給ReceiverTrackerEndpoint發(fā)送RestartTracker(receiver)消息。收到該消息后,會重新為receiver分配executor啟動receiver(如果原來的executor運行正常就在原先的executor上重新啟動,否則重新調(diào)度executor)。

(3)當Spark Streaming 的job結(jié)束后,JobScheduler會調(diào)用handleJobCompletion方法,最終會調(diào)用cleanupOldBlocksAndBatches方法給endpoint發(fā)送CleanupOldBlocks消息:

收到該消息后,會被路由到Receiver 進行Block的清理。

(4)UpdateReceiverRateLimit消息

收到UpdateReceiverRateLimit消息后,會將其路由到receiver,當receiver收到該消息后會調(diào)用BlockGenerator的update方法更新Block生成速率。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容