上篇文章詳細解析了Receiver不斷接收數(shù)據(jù)的過程,在Receiver接收數(shù)據(jù)的過程中會將數(shù)據(jù)的元信息發(fā)送給ReceiverTracker:
本文將詳細解析ReceiverTracker的的架構(gòu)設(shè)計和具體實現(xiàn)
一、ReceiverTracker的主要功能
ReceiverTracker的主要功能有:
1.在Executor上啟動Receivers
2.接受Receiver的注冊
3.借助ReceivedBlockTracker來管理Receiver接收數(shù)據(jù)的元數(shù)據(jù)
4.接受Receiver發(fā)送的各種消息,并作相應(yīng)處理
5.更新Receiver接收數(shù)據(jù)的速率(也就是限流)
6.不斷的等待Receivers的運行狀態(tài),只要Receivers停止運行,就重新啟動Receiver。也就是Receiver的容錯功能。
7.停止Receivers
8.匯報Receiver發(fā)送過來的錯誤信息
二、ReceiverTracker具體功能詳解
2.1 啟動receiver并管理receiver接收數(shù)據(jù)的元數(shù)據(jù)
首先,ReceiverTracker內(nèi)部有一個ReceiverTrackerEndPoint通訊體endpoint變量,endpoint用來和Receiver和ReceiverTracker本身進行消息通訊。這個ReceiverTrackerEndPoint通訊體在ReceiverTracker啟動時被初始化:
ReceiverTracker啟動Receiver時候,向ReceiverTrackerEndPoint通訊體endpoint變量發(fā)送了StartAllReceivers(receivers)消息:
Receiver啟動后會向ReceiverTracker注冊,告訴ReceiverTracker自己啟動成功:
代碼中的trackerEndpoint就是ReceiverTracker中ReceiverTrackerEndPoint通訊體endpoint的引用。
Receiver會不斷將接收的數(shù)據(jù)封裝成Block,并將這些Block推送給BlockManager管理,在將這些Block推送給BlockManager之后,ReceiverSupervisor會將Block的元信息發(fā)送給ReceiverTracker的endpoint:
可以看到ReceiverSupervisor向ReceiverTracker的endpoint發(fā)送了AddBlock(blockInfo)消息:
ReceiverTracker收到AddBlock(blockInfo)消息后,會啟動一個線程進行處理:
ReceiverTracker收到AddBlock(blockInfo)消息后,調(diào)用了addBlock(receiveedBlockInfo)方法進行處理,下面是addBlock的源碼:
這里其實調(diào)用了receivedBlockTracker的addBlock方法,receivedBlockTracker是ReceivedBlockTracker對象,它是在ReceiverTracker實例化時候被創(chuàng)建:
下面看一下ReceivedBlockTracker的addBlock方法:
可以看到ReceivedBlockTracker的addBlock方法將block的元信息添加到了一個隊隊列中,最終是添加到一個叫做streamIdToUnallocatedBlockQueues的HashMap中,其中key是streamId,值是該streamid對應(yīng)的block隊列。
2.2 為Batch分配Block
當spark streaming應(yīng)用程序動態(tài)生成job的時候,JobGenerator會調(diào)用generateJobs方法,在該方法中會為批處理分配已經(jīng)接收的Block
這里調(diào)用了jobScheduler中receiverTracker的allocatedBlockToBatch方法,這里的receiverTracker就是ReceiverTracker對象,下面看一下該方法的實現(xiàn):
可以看到,最終調(diào)用了ReceivedBlockTracker的allocatedBlockToBatch方法:
這里先根據(jù)streamId,從streamIdToUnallocatedBlockQueues中取出接收到的block隊列,并將streamId和block隊列封裝成AllocatedBlocks,最后根據(jù)batchTime將其對應(yīng)的AllocatedBlocks對象加入timeToAllocatedBlocks中,timeToAllocatedBlocks是一個HashMap:
這樣Batch的Block就分配完成。
2.3 ReceiverTracker處理的其他消息
ReceiverTracker中ReceiverTrackerEndpoint的receive方法定義了各種消息的處理邏輯:
(1) 收到StartAllReceivers(receivers)消息后,ReceiverTracker會為receivers分配executor,并在executor上啟動相應(yīng)的receiver
(2)當ReceiverTracker監(jiān)控到receiver退出返回時,會給ReceiverTrackerEndpoint發(fā)送RestartTracker(receiver)消息。收到該消息后,會重新為receiver分配executor啟動receiver(如果原來的executor運行正常就在原先的executor上重新啟動,否則重新調(diào)度executor)。
(3)當Spark Streaming 的job結(jié)束后,JobScheduler會調(diào)用handleJobCompletion方法,最終會調(diào)用cleanupOldBlocksAndBatches方法給endpoint發(fā)送CleanupOldBlocks消息:
收到該消息后,會被路由到Receiver 進行Block的清理。
(4)UpdateReceiverRateLimit消息
收到UpdateReceiverRateLimit消息后,會將其路由到receiver,當receiver收到該消息后會調(diào)用BlockGenerator的update方法更新Block生成速率。