- 本文內容以以Socket數據來源為例,通過WordCount計算來跟蹤Job的生成
代碼如下:
objectNetworkWordCount {
defmain(args:Array[String]) {
if (args.length< 2) {
System.err.println("Usage:NetworkWordCount<hostname> <port>")
System.exit(1)
}
val sparkConf= newSparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = newStreamingContext(sparkConf,Seconds(1))
val lines= ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_AND_DISK_SER)
val words= lines.flatMap(_.split(""))
val wordCounts= words.map(x => (x,1)).reduceByKey(_+ _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
- 從ssc.start()開始看,在start方法中調用了scheduler的start()方法,這里的scheduler就是
JobScheduler,我們看start的代碼
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
// 啟動JobScheduler的事件循環器
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for { inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
// 啟動ReceiverTracker,數據的接收邏輯從這里開始
receiverTracker.start()
// 啟動JobGenerator,job的生成從這里開始
jobGenerator.start()
logInfo("Started JobScheduler")
}
Spark Streaming由JobScheduler、ReceiverTracker、JobGenerator三大組件組成,其中ReceiverTracker、
JobGenerator包含在JobScheduler中。這里分別執行三大組件的start方法。
- 我們先看Job的生成,jobGenerator.start()方法。在JobGenerator的start方法中都做了什么,繼續往下看。
首先啟動了一個EventLoop并來回調processEvent方法,那么什么時候會觸發回調呢,來看一下EventLoop的內部結構
private[spark] abstract class EventLoop\[E](name: String) extends Logging {
//線程安全的阻塞隊列
private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque\[E]()
private val stopped = new AtomicBoolean(false)
private val eventThread = new Thread(name) {
//后臺線程
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take()
try {
//回調子類的onReceive方法,就是事件的邏輯代碼
onReceive(event)
} catch {
case NonFatal(e) => {
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
def start(): Unit = {
if (stopped.get) {
throw new IllegalStateException(name + " has already been stopped")
}
// Call onStart before starting the event thread to make sure it happens before onReceive
onStart()
// 啟動事件循環器
eventThread.start()
}
def stop(): Unit = {
// stopped.compareAndSet(false, true) 判斷是否為false,同時賦值為true
if (stopped.compareAndSet(false, true)) {
eventThread.interrupt()
var onStopCalled = false
try {
eventThread.join()
// Call onStop after the event thread exits to make sure onReceive happens before onStop
onStopCalled = true
onStop()
} catch {
case ie: InterruptedException =>
Thread.currentThread().interrupt()
if (!onStopCalled) {
// ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since
// it's already called.
onStop()
}
}
} else {
// Keep quiet to allow calling `stop` multiple times.
}
}
def post(event: E): Unit = {
eventQueue.put(event)
}
def isActive: Boolean = eventThread.isAlive
protected def onStart(): Unit = {}
protected def onStop(): Unit = {}
protected def onReceive(event: E): Unit
protected def onError(e: Throwable): Unit
}
在EventLoop內部其實是維護了一個隊列,開辟了一條后臺線程來回調實現類的onReceive方法。
那么是什么時候把事件放入EventLoop的隊列中呢,就要找EventLoop的post方法了。在JobGenerator實例化的時
候創建了一個RecurringTimer,代碼如下:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
// 回調 eventLoop.post(GenerateJobs(new Time(longTime)))將GenerateJobs事件放入事件循環器
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
RecurringTimer就是一個定時器,看一下他的構造參數和內部代碼,
* @param clock 時鐘
* @param period 間歇時間
* @param callback 回調方法
* @param name 定時器的名稱
很清楚的知道根據用戶傳入的時間間隔,周期性的回調callback方法。Callback就是前面看到的
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
將GenerateJobs事件提交到EventLoop的隊列中,此時RecurringTimer還沒有執行。
回到JobGenerator中的start方法向下看,因為是第一次運行,所以調用了startFirstTime方法。
在startFirstTime方法中,有一行關鍵代碼timer.start(startTime.milliseconds),終于看到了定時器的啟動
- 從定時器的start方法開始往回看,周期性的回調eventLoop.post方法將GenerateJobs事件發送到EvenLoop的隊列,然后回調rocessEvent方法,看generateJobs(time)。
generateJobs代碼如下
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
// 獲取元數據信息
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
// 提交jobSet
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
進入graph.generateJobs(time) ,調用每一個outputStream的generateJob方法,generateJob代碼如下
private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) => {
// jobRunc中包裝了runJob的方法
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
}
case None => None
}
}
getOrCompute返回一個RDD,RDD的生成以后再說,定義了一個函數jobFunc,可以看到函數的作用是提交job,
把jobFunc封裝到Job對象然后返回。
返回的是多個job,jobs生成成功后提交JobSet,代碼如下
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
然后分別提交每一個job,把job包裝到JobHandler(Runnable子類)交給線程池運行,執行JobHandler的run
方法,調用job.run(),在Job的run方法中就一行,執行Try(func()),這個func()函數就是上面代碼中
的jobFunc,看到這里整個Job的生成與提交就連通了。下面附上一張Job動態生成流程圖
以上內容如有錯誤,歡迎指正