Spark Streaming源碼解讀之流數(shù)據(jù)不斷接收全生命周期徹底研究和思考

Spark Streaming應(yīng)用程序有以下特點(diǎn):

1. 不斷持續(xù)接收數(shù)據(jù)

2. ?Receiver和Driver不在同一節(jié)點(diǎn)中

? ? ? ?Spark Streaming應(yīng)用程序接收數(shù)據(jù)、存儲(chǔ)數(shù)據(jù)、匯報(bào)數(shù)據(jù)的metedata給Driver。數(shù)據(jù)接收的模式類似于MVC,其中Driver是Model,Receiver是View,ReceiverSupervisorImpl是Controller。Receiver的啟動(dòng)由ReceiverSupervisorImpl來(lái)控制,Receiver接收到數(shù)據(jù)交給ReceiverSupervisorImpl來(lái)存儲(chǔ)。RDD中的元素必須要實(shí)現(xiàn)序列化,才能將RDD序列化給Executor端。Receiver就實(shí)現(xiàn)了Serializable接口。


ReceiverTracker的代碼片段:

//?Create?the?RDD?using?the?scheduledLocations?to?run?the?receiver?in?a?Spark?job

val?receiverRDD:?RDD[Receiver[_]]?=

if?(scheduledLocations.isEmpty)?{

ssc.sc.makeRDD(Seq(receiver),?1)

}?else?{

val?preferredLocations?=?scheduledLocations.map(_.toString).distinct

ssc.sc.makeRDD(Seq(receiver?->?preferredLocations))

}

Receiver的代碼片段:

@DeveloperApi

abstract?class?Receiver[T](val?storageLevel:?StorageLevel)?extends?Serializable?{

處理Receiver接收到的數(shù)據(jù),存儲(chǔ)數(shù)據(jù)并匯報(bào)給Driver,Receiver是一條一條的接收數(shù)據(jù)的。

/**

*?Concrete?implementation?of?[[org.apache.spark.streaming.receiver.ReceiverSupervisor]]

*?which?provides?all?the?necessary?functionality?for?handling?the?data?received?by

*?the?receiver.?Specifically,?it?creates?a?[[org.apache.spark.streaming.receiver.BlockGenerator]]

*?object?that?is?used?to?divide?the?received?data?stream?into?blocks?of?data.

*/

private[streaming]?class?ReceiverSupervisorImpl(

receiver:?Receiver[_],

env:?SparkEnv,

hadoopConf:?Configuration,

checkpointDirOption:?Option[String]

)?extends?ReceiverSupervisor(receiver,?env.conf)?with?Logging?{

通過(guò)限定數(shù)據(jù)存儲(chǔ)速度來(lái)實(shí)現(xiàn)限流接收數(shù)據(jù),合并成buffer,放入block隊(duì)列在ReceiverSupervisorImpl啟動(dòng)會(huì)調(diào)用BlockGenerator對(duì)象的start方法。

override?protected?def?onStart()?{

registeredBlockGenerators.foreach?{?_.start()?}

...

private?val?registeredBlockGenerators?=?new?mutable.ArrayBuffer[BlockGenerator]

with?mutable.SynchronizedBuffer[BlockGenerator]

? ? ? 源碼注釋說(shuō)明了BlockGenerator把一個(gè)Receiver接收到的數(shù)據(jù)合并到一個(gè)Block然后寫入到BlockManager中。該類內(nèi)部有兩個(gè)線程,一個(gè)是周期性把數(shù)據(jù)生成一批對(duì)象,然后把先前的一批數(shù)據(jù)封裝成Block。另一個(gè)線程時(shí)把Block寫入到BlockManager中。

private?val?defaultBlockGenerator?=?createBlockGenerator(defaultBlockGeneratorListener)

BlockGenerator類繼承自RateLimiter類,說(shuō)明我們不能限定接收數(shù)據(jù)的速度,但是可以限定存儲(chǔ)數(shù)據(jù)的速度,轉(zhuǎn)過(guò)來(lái)就限定流動(dòng)的速度。

BlockGenerator類有一個(gè)定時(shí)器(默認(rèn)每200ms將接收到的數(shù)據(jù)合并成block)和一個(gè)線程(把block寫入到BlockManager),200ms會(huì)產(chǎn)生一個(gè)Block,即1秒鐘生成5個(gè)Partition。太小則生成的數(shù)據(jù)片中數(shù)據(jù)太小,導(dǎo)致一個(gè)Task處理的數(shù)據(jù)少,性能差。實(shí)際經(jīng)驗(yàn)得到不要低于50ms。

BlockGenerator代碼片段:

private?val?blockIntervalTimer?=

new?RecurringTimer(clock,?blockIntervalMs,?updateCurrentBuffer,?"BlockGenerator")

...

private?val?blockPushingThread?=?new?Thread()?{?override?def?run()?{?keepPushingBlocks()?}?}

那BlockGenerator是怎么被創(chuàng)建的?

private?val?defaultBlockGenerator?=?createBlockGenerator(defaultBlockGeneratorListener)

...

override?def?createBlockGenerator(

blockGeneratorListener:?BlockGeneratorListener):?BlockGenerator?=?{

//?Cleanup?BlockGenerators?that?have?already?been?stopped

registeredBlockGenerators?--=?registeredBlockGenerators.filter{?_.isStopped()?}

val?newBlockGenerator?=?new?BlockGenerator(blockGeneratorListener,?streamId,?env.conf)

registeredBlockGenerators?+=?newBlockGenerator

newBlockGenerator

}

BlockGenerator類中的定時(shí)器會(huì)回調(diào)updateCurrentBuffer方法。

Receiver不斷的接收數(shù)據(jù),BlockGenerator類通過(guò)一個(gè)定時(shí)器,把Receiver接收到的數(shù)據(jù),把多條合并成Block,再放入到Block隊(duì)列中。

/**?Change?the?buffer?to?which?single?records?are?added?to.?*/

private?def?updateCurrentBuffer(time:?Long):?Unit?=?{

try?{

var?newBlock:?Block?=?null

//?不同線程都會(huì)訪問(wèn)currentBuffer,故需加鎖

synchronized?{

//?如果緩沖器不為空,則生成StreamBlockId對(duì)象,

//?調(diào)用listener的onGenerateBlock來(lái)通知Block已生成,

//?再實(shí)例化block對(duì)象。

if?(currentBuffer.nonEmpty)?{

val?newBlockBuffer?=?currentBuffer

currentBuffer?=?new?ArrayBuffer[Any]

val?blockId?=?StreamBlockId(receiverId,?time?-?blockIntervalMs)

listener.onGenerateBlock(blockId)

newBlock?=?new?Block(blockId,?newBlockBuffer)

}

}

//?最后,把Block對(duì)象放入

if?(newBlock?!=?null)?{

blocksForPushing.put(newBlock)??//?put?is?blocking?when?queue?is?full

}

}?catch?{

case?ie:?InterruptedException?=>

logInfo("Block?updating?timer?thread?was?interrupted")

case?e:?Exception?=>

reportError("Error?in?block?updating?thread",?e)

}

}

該函數(shù)200ms回調(diào)一次,可以設(shè)置,但不能小于50ms。

運(yùn)行在Executor端的ReceiverSupervisorImpl需要與Driver端的ReceiverTracker進(jìn)行通信,傳遞元數(shù)據(jù)信息metedata,其中ReceiverSupervisorImpl通過(guò)RPC的名稱獲取到ReceiverTrcker的遠(yuǎn)程調(diào)用。

ReceiverSupervisorImpl代碼片段:

/**?Remote?RpcEndpointRef?for?the?ReceiverTracker?*/

private?val?trackerEndpoint?=?RpcUtils.makeDriverRef("ReceiverTracker",?env.conf,?env.rpcEnv)

在ReceiverTracker調(diào)用start方法啟動(dòng)的時(shí)候,會(huì)以ReceiverTracker的名稱創(chuàng)建RPC通信體。ReceiverSupervisorImpl就是和這個(gè)RPC通信體進(jìn)行消息交互的。

/**?Start?the?endpoint?and?receiver?execution?thread.?*/

def?start():?Unit?=?synchronized?{

if?(isTrackerStarted)?{

throw?new?SparkException("ReceiverTracker?already?started")

}

if?(!receiverInputStreams.isEmpty)?{

endpoint?=?ssc.env.rpcEnv.setupEndpoint(

"ReceiverTracker",?newReceiverTrackerEndpoint(ssc.env.rpcEnv))

if?(!skipReceiverLaunch)?launchReceivers()

logInfo("ReceiverTracker?started")

trackerState?=?Started

}

}

在ReceiverTrackerEndpoint接收到ReceiverSupervisorImpl發(fā)送的注冊(cè)消息,把其RpcEndpoint保存起來(lái)。

override?def?receiveAndReply(context:?RpcCallContext):?PartialFunction[Any,?Unit]?=?{

//?Remote?messages

caseRegisterReceiver(streamId,?typ,?host,?executorId,receiverEndpoint)?=>

val?successful?=

registerReceiver(streamId,?typ,?host,?executorId,?receiverEndpoint,?context.senderAddress)

context.reply(successful)

case?AddBlock(receivedBlockInfo)?=>

if?(WriteAheadLogUtils.isBatchingEnabled(ssc.conf,?isDriver?=?true))?{

walBatchingThreadPool.execute(new?Runnable?{

override?def?run():?Unit?=?Utils.tryLogNonFatalError?{

if?(active)?{

context.reply(addBlock(receivedBlockInfo))

}?else?{

throw?new?IllegalStateException("ReceiverTracker?RpcEndpoint?shut?down.")

}

}

})

}?else?{

context.reply(addBlock(receivedBlockInfo))

}

case?DeregisterReceiver(streamId,?message,?error)?=>

deregisterReceiver(streamId,?message,?error)

context.reply(true)

//?Local?messages

case?AllReceiverIds?=>

context.reply(receiverTrackingInfos.filter(_._2.state?!=?ReceiverState.INACTIVE).keys.toSeq)

case?StopAllReceivers?=>

assert(isTrackerStopping?||?isTrackerStopped)

stopReceivers()

context.reply(true)

}

對(duì)應(yīng)的Executor端的ReceiverSupervisorImpl也會(huì)創(chuàng)建Rpc消息通信體,來(lái)接收來(lái)自Driver端ReceiverTacker的消息。

/**?RpcEndpointRef?for?receiving?messages?from?the?ReceiverTracker?in?the?driver?*/

private?val?endpoint?=?env.rpcEnv.setupEndpoint(

"Receiver-"?+?streamId?+?"-"?+?System.currentTimeMillis(),?new?ThreadSafeRpcEndpoint?{

override?val?rpcEnv:?RpcEnv?=?env.rpcEnv

override?def?receive:?PartialFunction[Any,?Unit]?=?{

case?StopReceiver?=>

logInfo("Received?stop?signal")

ReceiverSupervisorImpl.this.stop("Stopped?by?driver",?None)

case?CleanupOldBlocks(threshTime)?=>

logDebug("Received?delete?old?batch?signal")

cleanupOldBlocks(threshTime)

case?UpdateRateLimit(eps)?=>

logInfo(s"Received?a?new?rate?limit:?$eps.")

registeredBlockGenerators.foreach?{?bg?=>

bg.updateRate(eps)

}

}

})

BlockGenerator類中的線程每隔10ms從隊(duì)列中獲取Block,寫入到BlockManager中。

/**?Keep?pushing?blocks?to?the?BlockManager.?*/

private?def?keepPushingBlocks()?{

logInfo("Started?block?pushing?thread")

def?areBlocksBeingGenerated:?Boolean?=?synchronized?{

state?!=?StoppedGeneratingBlocks

}

try?{

//?While?blocks?are?being?generated,?keep?polling?for?to-be-pushed?blocks?and?push?them.

while?(areBlocksBeingGenerated)?{

Option(blocksForPushing.poll(10,?TimeUnit.MILLISECONDS))?match?{

case?Some(block)?=>pushBlock(block)

case?None?=>

}

}

//?At?this?point,?state?is?StoppedGeneratingBlock.?So?drain?the?queue?of?to-be-pushed?blocks.

logInfo("Pushing?out?the?last?"?+?blocksForPushing.size()?+?"?blocks")

while?(!blocksForPushing.isEmpty)?{

val?block?=?blocksForPushing.take()

logDebug(s"Pushing?block?$block")

pushBlock(block)

logInfo("Blocks?left?to?push?"?+?blocksForPushing.size())

}

logInfo("Stopped?block?pushing?thread")

}?catch?{

case?ie:?InterruptedException?=>

logInfo("Block?pushing?thread?was?interrupted")

case?e:?Exception?=>

reportError("Error?in?block?pushing?thread",?e)

}

}

ReceiverSupervisorImpl代碼片段:

/**?Divides?received?data?records?into?data?blocks?for?pushing?in?BlockManager.?*/

private?val?defaultBlockGeneratorListener?=?new?BlockGeneratorListener?{

def?onAddData(data:?Any,?metadata:?Any):?Unit?=?{?}

def?onGenerateBlock(blockId:?StreamBlockId):?Unit?=?{?}

def?onError(message:?String,?throwable:?Throwable)?{

reportError(message,?throwable)

}

defonPushBlock(blockId:?StreamBlockId,?arrayBuffer:?ArrayBuffer[_])?{

pushArrayBuffer(arrayBuffer,?None,?Some(blockId))

}

}

...

/**?Store?block?and?report?it?to?driver?*/

defpushAndReportBlock(

receivedBlock:?ReceivedBlock,

metadataOption:?Option[Any],

blockIdOption:?Option[StreamBlockId]

)?{

val?blockId?=?blockIdOption.getOrElse(nextBlockId)

val?time?=?System.currentTimeMillis

val?blockStoreResult?=?receivedBlockHandler.storeBlock(blockId,?receivedBlock)

logDebug(s"Pushed?block?$blockId?in?${(System.currentTimeMillis?-?time)}?ms")

val?numRecords?=?blockStoreResult.numRecords

val?blockInfo?=?ReceivedBlockInfo(streamId,?numRecords,?metadataOption,?blockStoreResult)

trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))

logDebug(s"Reported?block?$blockId")

}

將數(shù)據(jù)存儲(chǔ)在BlockManager中,并將源數(shù)據(jù)信息告訴Driver端的ReceiverTracker。

defstoreBlock(blockId:?StreamBlockId,?block:?ReceivedBlock):?ReceivedBlockStoreResult?=?{

var?numRecords?=?None:?Option[Long]

val?putResult:?Seq[(BlockId,?BlockStatus)]?=?block?match?{

case?ArrayBufferBlock(arrayBuffer)?=>

numRecords?=?Some(arrayBuffer.size.toLong)

blockManager.putIterator(blockId,?arrayBuffer.iterator,?storageLevel,

tellMaster?=?true)

case?IteratorBlock(iterator)?=>

val?countIterator?=?new?CountingIterator(iterator)

// 把數(shù)據(jù)寫入BlockManager

val?putResult?=blockManager.putIterator(blockId,?countIterator,?storageLevel,

tellMaster?=?true)

numRecords?=?countIterator.count

putResult

case?ByteBufferBlock(byteBuffer)?=>

blockManager.putBytes(blockId,?byteBuffer,?storageLevel,?tellMaster?=?true)

case?o?=>

throw?new?SparkException(

s"Could?not?store?$blockId?to?block?manager,?unexpected?block?type?${o.getClass.getName}")

}

if?(!putResult.map?{?_._1?}.contains(blockId))?{

throw?new?SparkException(

s"Could?not?store?$blockId?to?block?manager?with?storage?level?$storageLevel")

}

BlockManagerBasedStoreResult(blockId,?numRecords)

}


備注:

資料來(lái)源于:DT_大數(shù)據(jù)夢(mèng)工廠(Spark發(fā)行版本定制)

更多私密內(nèi)容,請(qǐng)關(guān)注微信公眾號(hào):DT_Spark

如果您對(duì)大數(shù)據(jù)Spark感興趣,可以免費(fèi)聽(tīng)由王家林老師每天晚上20:00開(kāi)設(shè)的Spark永久免費(fèi)公開(kāi)課,地址YY房間號(hào):68917580

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 230,501評(píng)論 6 544
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 99,673評(píng)論 3 429
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人,你說(shuō)我怎么就攤上這事。” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 178,610評(píng)論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我,道長(zhǎng),這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 63,939評(píng)論 1 318
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 72,668評(píng)論 6 412
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 56,004評(píng)論 1 329
  • 那天,我揣著相機(jī)與錄音,去河邊找鬼。 笑死,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 44,001評(píng)論 3 449
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 43,173評(píng)論 0 290
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 49,705評(píng)論 1 336
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 41,426評(píng)論 3 359
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 43,656評(píng)論 1 374
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,139評(píng)論 5 364
  • 正文 年R本政府宣布,位于F島的核電站,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 44,833評(píng)論 3 350
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 35,247評(píng)論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 36,580評(píng)論 1 295
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 52,371評(píng)論 3 400
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 48,621評(píng)論 2 380

推薦閱讀更多精彩內(nèi)容