Spark Streaming應(yīng)用程序有以下特點(diǎn):
1. 不斷持續(xù)接收數(shù)據(jù)
2. ?Receiver和Driver不在同一節(jié)點(diǎn)中
? ? ? ?Spark Streaming應(yīng)用程序接收數(shù)據(jù)、存儲(chǔ)數(shù)據(jù)、匯報(bào)數(shù)據(jù)的metedata給Driver。數(shù)據(jù)接收的模式類似于MVC,其中Driver是Model,Receiver是View,ReceiverSupervisorImpl是Controller。Receiver的啟動(dòng)由ReceiverSupervisorImpl來(lái)控制,Receiver接收到數(shù)據(jù)交給ReceiverSupervisorImpl來(lái)存儲(chǔ)。RDD中的元素必須要實(shí)現(xiàn)序列化,才能將RDD序列化給Executor端。Receiver就實(shí)現(xiàn)了Serializable接口。
ReceiverTracker的代碼片段:
//?Create?the?RDD?using?the?scheduledLocations?to?run?the?receiver?in?a?Spark?job
val?receiverRDD:?RDD[Receiver[_]]?=
if?(scheduledLocations.isEmpty)?{
ssc.sc.makeRDD(Seq(receiver),?1)
}?else?{
val?preferredLocations?=?scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver?->?preferredLocations))
}
Receiver的代碼片段:
@DeveloperApi
abstract?class?Receiver[T](val?storageLevel:?StorageLevel)?extends?Serializable?{
處理Receiver接收到的數(shù)據(jù),存儲(chǔ)數(shù)據(jù)并匯報(bào)給Driver,Receiver是一條一條的接收數(shù)據(jù)的。
/**
*?Concrete?implementation?of?[[org.apache.spark.streaming.receiver.ReceiverSupervisor]]
*?which?provides?all?the?necessary?functionality?for?handling?the?data?received?by
*?the?receiver.?Specifically,?it?creates?a?[[org.apache.spark.streaming.receiver.BlockGenerator]]
*?object?that?is?used?to?divide?the?received?data?stream?into?blocks?of?data.
*/
private[streaming]?class?ReceiverSupervisorImpl(
receiver:?Receiver[_],
env:?SparkEnv,
hadoopConf:?Configuration,
checkpointDirOption:?Option[String]
)?extends?ReceiverSupervisor(receiver,?env.conf)?with?Logging?{
通過(guò)限定數(shù)據(jù)存儲(chǔ)速度來(lái)實(shí)現(xiàn)限流接收數(shù)據(jù),合并成buffer,放入block隊(duì)列在ReceiverSupervisorImpl啟動(dòng)會(huì)調(diào)用BlockGenerator對(duì)象的start方法。
override?protected?def?onStart()?{
registeredBlockGenerators.foreach?{?_.start()?}
...
private?val?registeredBlockGenerators?=?new?mutable.ArrayBuffer[BlockGenerator]
with?mutable.SynchronizedBuffer[BlockGenerator]
? ? ? 源碼注釋說(shuō)明了BlockGenerator把一個(gè)Receiver接收到的數(shù)據(jù)合并到一個(gè)Block然后寫入到BlockManager中。該類內(nèi)部有兩個(gè)線程,一個(gè)是周期性把數(shù)據(jù)生成一批對(duì)象,然后把先前的一批數(shù)據(jù)封裝成Block。另一個(gè)線程時(shí)把Block寫入到BlockManager中。
private?val?defaultBlockGenerator?=?createBlockGenerator(defaultBlockGeneratorListener)
BlockGenerator類繼承自RateLimiter類,說(shuō)明我們不能限定接收數(shù)據(jù)的速度,但是可以限定存儲(chǔ)數(shù)據(jù)的速度,轉(zhuǎn)過(guò)來(lái)就限定流動(dòng)的速度。
BlockGenerator類有一個(gè)定時(shí)器(默認(rèn)每200ms將接收到的數(shù)據(jù)合并成block)和一個(gè)線程(把block寫入到BlockManager),200ms會(huì)產(chǎn)生一個(gè)Block,即1秒鐘生成5個(gè)Partition。太小則生成的數(shù)據(jù)片中數(shù)據(jù)太小,導(dǎo)致一個(gè)Task處理的數(shù)據(jù)少,性能差。實(shí)際經(jīng)驗(yàn)得到不要低于50ms。
BlockGenerator代碼片段:
private?val?blockIntervalTimer?=
new?RecurringTimer(clock,?blockIntervalMs,?updateCurrentBuffer,?"BlockGenerator")
...
private?val?blockPushingThread?=?new?Thread()?{?override?def?run()?{?keepPushingBlocks()?}?}
那BlockGenerator是怎么被創(chuàng)建的?
private?val?defaultBlockGenerator?=?createBlockGenerator(defaultBlockGeneratorListener)
...
override?def?createBlockGenerator(
blockGeneratorListener:?BlockGeneratorListener):?BlockGenerator?=?{
//?Cleanup?BlockGenerators?that?have?already?been?stopped
registeredBlockGenerators?--=?registeredBlockGenerators.filter{?_.isStopped()?}
val?newBlockGenerator?=?new?BlockGenerator(blockGeneratorListener,?streamId,?env.conf)
registeredBlockGenerators?+=?newBlockGenerator
newBlockGenerator
}
BlockGenerator類中的定時(shí)器會(huì)回調(diào)updateCurrentBuffer方法。
Receiver不斷的接收數(shù)據(jù),BlockGenerator類通過(guò)一個(gè)定時(shí)器,把Receiver接收到的數(shù)據(jù),把多條合并成Block,再放入到Block隊(duì)列中。
/**?Change?the?buffer?to?which?single?records?are?added?to.?*/
private?def?updateCurrentBuffer(time:?Long):?Unit?=?{
try?{
var?newBlock:?Block?=?null
//?不同線程都會(huì)訪問(wèn)currentBuffer,故需加鎖
synchronized?{
//?如果緩沖器不為空,則生成StreamBlockId對(duì)象,
//?調(diào)用listener的onGenerateBlock來(lái)通知Block已生成,
//?再實(shí)例化block對(duì)象。
if?(currentBuffer.nonEmpty)?{
val?newBlockBuffer?=?currentBuffer
currentBuffer?=?new?ArrayBuffer[Any]
val?blockId?=?StreamBlockId(receiverId,?time?-?blockIntervalMs)
listener.onGenerateBlock(blockId)
newBlock?=?new?Block(blockId,?newBlockBuffer)
}
}
//?最后,把Block對(duì)象放入
if?(newBlock?!=?null)?{
blocksForPushing.put(newBlock)??//?put?is?blocking?when?queue?is?full
}
}?catch?{
case?ie:?InterruptedException?=>
logInfo("Block?updating?timer?thread?was?interrupted")
case?e:?Exception?=>
reportError("Error?in?block?updating?thread",?e)
}
}
該函數(shù)200ms回調(diào)一次,可以設(shè)置,但不能小于50ms。
運(yùn)行在Executor端的ReceiverSupervisorImpl需要與Driver端的ReceiverTracker進(jìn)行通信,傳遞元數(shù)據(jù)信息metedata,其中ReceiverSupervisorImpl通過(guò)RPC的名稱獲取到ReceiverTrcker的遠(yuǎn)程調(diào)用。
ReceiverSupervisorImpl代碼片段:
/**?Remote?RpcEndpointRef?for?the?ReceiverTracker?*/
private?val?trackerEndpoint?=?RpcUtils.makeDriverRef("ReceiverTracker",?env.conf,?env.rpcEnv)
在ReceiverTracker調(diào)用start方法啟動(dòng)的時(shí)候,會(huì)以ReceiverTracker的名稱創(chuàng)建RPC通信體。ReceiverSupervisorImpl就是和這個(gè)RPC通信體進(jìn)行消息交互的。
/**?Start?the?endpoint?and?receiver?execution?thread.?*/
def?start():?Unit?=?synchronized?{
if?(isTrackerStarted)?{
throw?new?SparkException("ReceiverTracker?already?started")
}
if?(!receiverInputStreams.isEmpty)?{
endpoint?=?ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker",?newReceiverTrackerEndpoint(ssc.env.rpcEnv))
if?(!skipReceiverLaunch)?launchReceivers()
logInfo("ReceiverTracker?started")
trackerState?=?Started
}
}
在ReceiverTrackerEndpoint接收到ReceiverSupervisorImpl發(fā)送的注冊(cè)消息,把其RpcEndpoint保存起來(lái)。
override?def?receiveAndReply(context:?RpcCallContext):?PartialFunction[Any,?Unit]?=?{
//?Remote?messages
caseRegisterReceiver(streamId,?typ,?host,?executorId,receiverEndpoint)?=>
val?successful?=
registerReceiver(streamId,?typ,?host,?executorId,?receiverEndpoint,?context.senderAddress)
context.reply(successful)
case?AddBlock(receivedBlockInfo)?=>
if?(WriteAheadLogUtils.isBatchingEnabled(ssc.conf,?isDriver?=?true))?{
walBatchingThreadPool.execute(new?Runnable?{
override?def?run():?Unit?=?Utils.tryLogNonFatalError?{
if?(active)?{
context.reply(addBlock(receivedBlockInfo))
}?else?{
throw?new?IllegalStateException("ReceiverTracker?RpcEndpoint?shut?down.")
}
}
})
}?else?{
context.reply(addBlock(receivedBlockInfo))
}
case?DeregisterReceiver(streamId,?message,?error)?=>
deregisterReceiver(streamId,?message,?error)
context.reply(true)
//?Local?messages
case?AllReceiverIds?=>
context.reply(receiverTrackingInfos.filter(_._2.state?!=?ReceiverState.INACTIVE).keys.toSeq)
case?StopAllReceivers?=>
assert(isTrackerStopping?||?isTrackerStopped)
stopReceivers()
context.reply(true)
}
對(duì)應(yīng)的Executor端的ReceiverSupervisorImpl也會(huì)創(chuàng)建Rpc消息通信體,來(lái)接收來(lái)自Driver端ReceiverTacker的消息。
/**?RpcEndpointRef?for?receiving?messages?from?the?ReceiverTracker?in?the?driver?*/
private?val?endpoint?=?env.rpcEnv.setupEndpoint(
"Receiver-"?+?streamId?+?"-"?+?System.currentTimeMillis(),?new?ThreadSafeRpcEndpoint?{
override?val?rpcEnv:?RpcEnv?=?env.rpcEnv
override?def?receive:?PartialFunction[Any,?Unit]?=?{
case?StopReceiver?=>
logInfo("Received?stop?signal")
ReceiverSupervisorImpl.this.stop("Stopped?by?driver",?None)
case?CleanupOldBlocks(threshTime)?=>
logDebug("Received?delete?old?batch?signal")
cleanupOldBlocks(threshTime)
case?UpdateRateLimit(eps)?=>
logInfo(s"Received?a?new?rate?limit:?$eps.")
registeredBlockGenerators.foreach?{?bg?=>
bg.updateRate(eps)
}
}
})
BlockGenerator類中的線程每隔10ms從隊(duì)列中獲取Block,寫入到BlockManager中。
/**?Keep?pushing?blocks?to?the?BlockManager.?*/
private?def?keepPushingBlocks()?{
logInfo("Started?block?pushing?thread")
def?areBlocksBeingGenerated:?Boolean?=?synchronized?{
state?!=?StoppedGeneratingBlocks
}
try?{
//?While?blocks?are?being?generated,?keep?polling?for?to-be-pushed?blocks?and?push?them.
while?(areBlocksBeingGenerated)?{
Option(blocksForPushing.poll(10,?TimeUnit.MILLISECONDS))?match?{
case?Some(block)?=>pushBlock(block)
case?None?=>
}
}
//?At?this?point,?state?is?StoppedGeneratingBlock.?So?drain?the?queue?of?to-be-pushed?blocks.
logInfo("Pushing?out?the?last?"?+?blocksForPushing.size()?+?"?blocks")
while?(!blocksForPushing.isEmpty)?{
val?block?=?blocksForPushing.take()
logDebug(s"Pushing?block?$block")
pushBlock(block)
logInfo("Blocks?left?to?push?"?+?blocksForPushing.size())
}
logInfo("Stopped?block?pushing?thread")
}?catch?{
case?ie:?InterruptedException?=>
logInfo("Block?pushing?thread?was?interrupted")
case?e:?Exception?=>
reportError("Error?in?block?pushing?thread",?e)
}
}
ReceiverSupervisorImpl代碼片段:
/**?Divides?received?data?records?into?data?blocks?for?pushing?in?BlockManager.?*/
private?val?defaultBlockGeneratorListener?=?new?BlockGeneratorListener?{
def?onAddData(data:?Any,?metadata:?Any):?Unit?=?{?}
def?onGenerateBlock(blockId:?StreamBlockId):?Unit?=?{?}
def?onError(message:?String,?throwable:?Throwable)?{
reportError(message,?throwable)
}
defonPushBlock(blockId:?StreamBlockId,?arrayBuffer:?ArrayBuffer[_])?{
pushArrayBuffer(arrayBuffer,?None,?Some(blockId))
}
}
...
/**?Store?block?and?report?it?to?driver?*/
defpushAndReportBlock(
receivedBlock:?ReceivedBlock,
metadataOption:?Option[Any],
blockIdOption:?Option[StreamBlockId]
)?{
val?blockId?=?blockIdOption.getOrElse(nextBlockId)
val?time?=?System.currentTimeMillis
val?blockStoreResult?=?receivedBlockHandler.storeBlock(blockId,?receivedBlock)
logDebug(s"Pushed?block?$blockId?in?${(System.currentTimeMillis?-?time)}?ms")
val?numRecords?=?blockStoreResult.numRecords
val?blockInfo?=?ReceivedBlockInfo(streamId,?numRecords,?metadataOption,?blockStoreResult)
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
logDebug(s"Reported?block?$blockId")
}
將數(shù)據(jù)存儲(chǔ)在BlockManager中,并將源數(shù)據(jù)信息告訴Driver端的ReceiverTracker。
defstoreBlock(blockId:?StreamBlockId,?block:?ReceivedBlock):?ReceivedBlockStoreResult?=?{
var?numRecords?=?None:?Option[Long]
val?putResult:?Seq[(BlockId,?BlockStatus)]?=?block?match?{
case?ArrayBufferBlock(arrayBuffer)?=>
numRecords?=?Some(arrayBuffer.size.toLong)
blockManager.putIterator(blockId,?arrayBuffer.iterator,?storageLevel,
tellMaster?=?true)
case?IteratorBlock(iterator)?=>
val?countIterator?=?new?CountingIterator(iterator)
// 把數(shù)據(jù)寫入BlockManager
val?putResult?=blockManager.putIterator(blockId,?countIterator,?storageLevel,
tellMaster?=?true)
numRecords?=?countIterator.count
putResult
case?ByteBufferBlock(byteBuffer)?=>
blockManager.putBytes(blockId,?byteBuffer,?storageLevel,?tellMaster?=?true)
case?o?=>
throw?new?SparkException(
s"Could?not?store?$blockId?to?block?manager,?unexpected?block?type?${o.getClass.getName}")
}
if?(!putResult.map?{?_._1?}.contains(blockId))?{
throw?new?SparkException(
s"Could?not?store?$blockId?to?block?manager?with?storage?level?$storageLevel")
}
BlockManagerBasedStoreResult(blockId,?numRecords)
}
備注:
資料來(lái)源于:DT_大數(shù)據(jù)夢(mèng)工廠(Spark發(fā)行版本定制)
更多私密內(nèi)容,請(qǐng)關(guān)注微信公眾號(hào):DT_Spark
如果您對(duì)大數(shù)據(jù)Spark感興趣,可以免費(fèi)聽(tīng)由王家林老師每天晚上20:00開(kāi)設(shè)的Spark永久免費(fèi)公開(kāi)課,地址YY房間號(hào):68917580