韓晶晶 嚴(yán)律 黃春超
簡(jiǎn)介
Spark Streaming是Spark Core的擴(kuò)展,是構(gòu)建于Spark Core之上的實(shí)時(shí)流處理系統(tǒng)。相對(duì)于其他實(shí)時(shí)流處理系統(tǒng),Spark Streaming最大的優(yōu)勢(shì)在于其位于Spark技術(shù)棧中,也即流處理引擎與數(shù)據(jù)處理引擎在同一個(gè)軟件棧中。在Spark Streaming中,數(shù)據(jù)的采集是以逐條方式,而數(shù)據(jù)處理是按批進(jìn)行的。因此,其系統(tǒng)吞吐量會(huì)比流行的純實(shí)時(shí)流處理引擎Storm高2~5倍。
Spark Streaming對(duì)流數(shù)據(jù)處理的過(guò)成大致可以分為:?jiǎn)?dòng)流處理引擎、接收和存儲(chǔ)流數(shù)據(jù)、處理流數(shù)據(jù)和輸出處理結(jié)果等四個(gè)步驟。其運(yùn)行架構(gòu)圖如下所示:
[圖片上傳失敗...(image-f1cfaf-1542849231639)]
Step1 啟動(dòng)流處理引擎
StreamingContext為Spark Streaming在Driver端的上下文,是spark streaming程序的入口。在該對(duì)象的啟 動(dòng)過(guò)程中,會(huì)初始化其內(nèi)部的組件,其中最為重要的是DStreamGraph以及JobScheduler組件的初始化。
class StreamingContext private[streaming] (
_sc: SparkContext,
_cp: Checkpoint,
_batchDur: Duration
) extends Logging {
...
private[streaming] val conf = sc.conf
private[streaming] val env = sc.env
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
...
private[streaming] val scheduler = new JobScheduler(this)
...
}
Spark Streaming中作業(yè)的生成方式類似Spark核心,對(duì)DStream進(jìn)行的各種操作讓他們之間構(gòu)建起依賴關(guān)系,DStreamGraph記錄了DStream之間的依賴關(guān)系等信息。
JobScheduler是Spark Streaming的Job總調(diào)度者。JobScheduler 有兩個(gè)非常重要的成員:JobGenerator 和 ReceiverTracker。JobGenerator維護(hù)一個(gè)定時(shí)器,定時(shí)為每個(gè) batch 生成RDD DAG的實(shí)例;ReceiverTracker負(fù)責(zé)啟動(dòng)、管理各個(gè) receiver及管理各個(gè)receiver 接收到的數(shù)據(jù)。
通過(guò)調(diào)用StreamingContext#start()方法啟動(dòng)流處理引擎。在StreamingContext#start()中,調(diào)用StreamingContext#validate()方法對(duì)DStreamGraph及checkpoint等做有效性檢查,然后啟動(dòng)新的線程設(shè)置SparkContext,并啟動(dòng)JobScheduler。
def start(): Unit = synchronized {
...
validate()
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext
.localProperties.get()))
scheduler.start()
}
state = StreamingContextState.ACTIVE
StreamingContext.setActiveContext(this)
...
}
Step2 接收與存儲(chǔ)流數(shù)據(jù)
JobScheduler啟動(dòng)時(shí),會(huì)創(chuàng)建一個(gè)新的 ReceiverTracker 實(shí)例 receiverTracker,并調(diào)用其start() 方法。在ReceiverTracker #start()中會(huì)初始化一個(gè)endpoint: ReceiverTrackerEndpoint對(duì)象,該對(duì)象用于接收和處理ReceiverTracker和 receivers之間 發(fā)送的消息。此外,在ReceiverTracker#start()中還會(huì)調(diào)用 launchReceivers 將各個(gè)receivers 分發(fā)到 executors 上。
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
ReceiverTracker#launchReceivers()會(huì)從DStreamGraph.inputStreams 中抽取出receivers,也即數(shù)據(jù)接收器。得到receivers后,給消息接收處理器 endpoint 發(fā)送 StartAllReceivers(receivers)消息。
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers))
}
endpoint在接收到消息后,首先會(huì)判別消息的類型,對(duì)不同的消息執(zhí)行不同的處理操作。當(dāng)收到StartAllReceivers類型的消息時(shí),首先會(huì)計(jì)算每一個(gè)receiver要發(fā)送的目的executors,其計(jì)算主要遵循兩條原則:一是盡可能的使receiver分布均勻;二是如果receiver本身的preferredLocation不均勻,則以preferredLocation為準(zhǔn)。然后遍歷每一個(gè)receiver,根據(jù)計(jì)算出的executors調(diào)用startReceiver方法來(lái)啟動(dòng)receivers。
case StartAllReceivers(receivers) =>
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors)
}
由于ReceiverInputDStream實(shí)例只有一個(gè)receiver,但receiver可能需要在多個(gè)worker上啟動(dòng)線程來(lái)接收數(shù)據(jù),因此在startReceiver中需要將receiver及其對(duì)應(yīng)的目的excutors轉(zhuǎn)換成RDD。
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
轉(zhuǎn)換為RDD后,需要把receiver所進(jìn)行的計(jì)算定義為startReceiverFunc函數(shù),該函數(shù)以receiver實(shí)例為參數(shù)構(gòu)造ReceiverSupervisorImpl實(shí)例supervisor,構(gòu)造完畢后使用新線程啟動(dòng)該supervisor并阻塞該線程。
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
最后,將receiverRDD以及要在receiverRDD上執(zhí)行的函數(shù)作為Job提交,以真正在各個(gè)executors上啟動(dòng)Receiver。Job執(zhí)行后將會(huì)持續(xù)的進(jìn)行數(shù)據(jù)的接收。
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
Receiver源源不斷的接收到實(shí)時(shí)流數(shù)據(jù)后,根據(jù)接收數(shù)據(jù)的大小進(jìn)行判斷,若數(shù)據(jù)量很小,則會(huì)聚集多條數(shù)據(jù)成一塊,然后進(jìn)行塊存儲(chǔ);若數(shù)據(jù)量很大,則直接進(jìn)行塊存儲(chǔ)。對(duì)于這些數(shù)據(jù),Receiver會(huì)直接交由ReceiverSupervisor,由其進(jìn)行數(shù)據(jù)的轉(zhuǎn)儲(chǔ)操作。配置參數(shù)spark.streaming.receiver.writeAheadLog.enable的值決定是否預(yù)寫日志。根據(jù)參數(shù)值會(huì)產(chǎn)生不同類型的存儲(chǔ)receivedBlockHandler對(duì)象。
private val receivedBlockHandler: ReceivedBlockHandler = {
if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
//先寫 WAL,再存儲(chǔ)到 executor 的內(nèi)存或硬盤
new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
} else {
//直接存到 executor 的內(nèi)存或硬盤
new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
}
}
根據(jù)receivedBlockHandler進(jìn)行塊存儲(chǔ)。將 block 存儲(chǔ)之后,會(huì)獲得 block 描述信息 blockInfo:ReceivedBlockInfo,這其中包含:streamId、數(shù)據(jù)位置、數(shù)據(jù)條數(shù)、數(shù)據(jù) size 等信息。接著,封裝以 block 作為參數(shù)的 AddBlock(blockInfo) 消息并發(fā)送給 ReceiverTracker 以通知其有新增 block 數(shù)據(jù)塊。
//調(diào)用 receivedBlockHandler.storeBlock 方法存儲(chǔ) block,并得到一個(gè) blockStoreResult
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
//使用blockStoreResult初始化一個(gè)ReceivedBlockInfo實(shí)例
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//發(fā)送消息通知 ReceiverTracker 新增并存儲(chǔ)了 block
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
ReceiverTracker再把這些信息轉(zhuǎn)發(fā)給ReceivedBlockTracker,由其負(fù)責(zé)管理收到數(shù)據(jù)塊元信息。
private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
receivedBlockTracker.addBlock(receivedBlockInfo)
}
step3 處理流數(shù)據(jù)
JobScheduler中有兩個(gè)主要的成員,一個(gè)是上文提到的ReceiverTracker,另一個(gè)則是JobGenerator 。在JobScheduler啟動(dòng)時(shí),會(huì)創(chuàng)建一個(gè)新的 JobGenerator 實(shí)例 jobGenerator,并調(diào)用其start() 方法。在 JobGenerator 的主構(gòu)造函數(shù)中,會(huì)創(chuàng)建一個(gè)定時(shí)器:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
定時(shí)器中定義了批處理時(shí)間間隔ssc.graph.batchDuration.milliseconds。每當(dāng)批處理時(shí)間到來(lái)時(shí),會(huì)執(zhí)行一次eventLoop.post(GenerateJobs(new Time(longTime)))方法來(lái)向 eventLoop 發(fā)送 GenerateJobs(new Time(longTime))消息,eventLoop收到消息后會(huì)基于當(dāng)前batch內(nèi)的數(shù)據(jù)進(jìn)行Job的生成及提交執(zhí)行。
private def generateJobs(time: Time) {
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
Try {
// allocate received blocks to batch
jobScheduler.receiverTracker.allocateBlocksToBatch(time)
// generate jobs using allocated block
graph.generateJobs(time)
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
由源碼可知,eventLoop 在接收到 GenerateJobs(new Time(longTime))消息后首先調(diào)用了allocateBlocksToBatch()方法將已收到的blocks分配給batch。緊接著調(diào)用DStreamGraph類中的generateJobs()方法來(lái)生成基于該batch的Job序列。然后將批處理時(shí)間time、作業(yè)序列Seq[Job]和本批次數(shù)據(jù)的源信息包裝為JobSet,調(diào)用JobScheduler.submitJobSet(JobSet)提交給JobScheduler,JobScheduler將這些作業(yè)發(fā)送給Spark核心進(jìn)行處理。
Step4 輸出處理結(jié)果
由于數(shù)據(jù)的處理有Spark核心來(lái)完成,因此處理的結(jié)果會(huì)從Spark核心中直接輸出至外部系統(tǒng),如數(shù)據(jù)庫(kù)或者文件系統(tǒng)等,同時(shí)輸出的數(shù)據(jù)也可以直接被外部系統(tǒng)所使用。由于實(shí)時(shí)流數(shù)據(jù)的數(shù)據(jù)源源不斷的流入,Spark會(huì)周而復(fù)始的進(jìn)行數(shù)據(jù)的計(jì)算,相應(yīng)也會(huì)持續(xù)輸出處理結(jié)果。