а√天堂资源8在线官网在线,日本亲与子乱ay,女人被狂躁c到高潮喷水演员表

前言

Spark Streaming Job的生成是通過JobGenerator每隔 batchDuration 長時間動態生成的，每個batch 對應提交一個JobSet，因為針對一個batch可能有多個輸出操作。

概述流程：

定時器定時向 eventLoop 發送生成job的請求
通過receiverTracker 為當前batch分配block
為當前batch生成對應的 Jobs
將Jobs封裝成JobSet 提交執行

入口

在 JobGenerator 初始化的時候就創建了一個定時器：

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

每隔 batchDuration 就會向 eventLoop 發送 GenerateJobs(new Time(longTime))消息，eventLoop的事件處理方法中會調用generateJobs(time)方法：

      case GenerateJobs(time) => generateJobs(time)

private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }

為當前batchTime分配Block

首先調用receiverTracker.allocateBlocksToBatch(time)方法為當前batchTime分配對應的Block，最終會調用receiverTracker的Block管理者receivedBlockTracker的allocateBlocksToBatch方法：

def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
      val streamIdToBlocks = streamIds.map { streamId =>
          (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
      }.toMap
      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
        lastAllocatedBatchTime = batchTime
      } else {
        logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery")
      }
    } else {
      logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery")
    }
  }

private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
    streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
  }

可以看到是從streamIdToUnallocatedBlockQueues中獲取到所有streamId對應的未分配的blocks，該隊列的信息是supervisor 存儲好Block后向receiverTracker上報的Block信息，詳情可見 ReceiverTracker 數據產生與存儲。

獲取到所有streamId對應的未分配的blockInfos后，將其放入了timeToAllocatedBlocks:Map[Time, AllocatedBlocks]中，后面生成RDD的時候會用到。

為當前batchTime生成Jobs

調用DStreamGraph的generateJobs方法為當前batchTime生成job：

 def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }

一個outputStream就對應一個job，遍歷所有的outputStreams，為其生成job：

# ForEachDStream
override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

先獲取到time對應的RDD，然后將其作為參數再調用foreachFunc方法，foreachFunc方法是通過構造器傳過來的，我們來看看print()輸出的情況：

def print(num: Int): Unit = ssc.withScope {
    def foreachFunc: (RDD[T], Time) => Unit = {
      (rdd: RDD[T], time: Time) => {
        val firstNum = rdd.take(num + 1)
        // scalastyle:off println
        println("-------------------------------------------")
        println(s"Time: $time")
        println("-------------------------------------------")
        firstNum.take(num).foreach(println)
        if (firstNum.length > num) println("...")
        println()
        // scalastyle:on println
      }
    }
    foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false)
  }

這里的構造的foreachFunc方法就是最終和rdd一起提交job的執行方法，也即對rdd調用take()后并打印，真正觸發action操作的是在這個func函數里，現在再來看看是怎么拿到rdd的，每個DStream都有一個generatedRDDs:Map[Time, RDD[T]]變量，來保存time對應的RDD，若獲取不到則會通過compute()方法來計算，對于需要在executor上啟動Receiver來接收數據的ReceiverInputDStream來說：

 override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
        // for this batch
        val receiverTracker = ssc.scheduler.receiverTracker
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

        // Register the input blocks information into InputInfoTracker
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

會通過receiverTracker來獲取該batch對應的blocks，前面已經分析過為所有streamId分配了對應的未分配的block，并且放在了timeToAllocatedBlocks:Map[Time, AllocatedBlocks]中，這里底層就是從這個timeToAllocatedBlocks獲取到的blocksInfo，然后調用了createBlockRDD(validTime, blockInfos)通過blockId創建了RDD。

最后，將通過此RDD和foreachFun構建jobFunc，并創建Job返回。

封裝jobs成JobSet并提交執行

每個outputStream對應一個Job，最終就會生成一個jobs，為這個jobs創建JobSet，并通過jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))來提交這個JobSet：

jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))

然后通過jobExecutor來執行，jobExecutor是一個線程池，并行度默認為1，可通過spark.streaming.concurrentJobs配置，即同時可執行幾個批次的數據。

處理類JobHandler中調用的是Job.run()，執行的是前面構建的 jobFunc 方法。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

[spark streaming] 動態生成 Job 并提交執行

[spark streaming] 動態生成 Job 并提交執行

前言

入口

為當前batchTime分配Block

為當前batchTime生成Jobs

封裝jobs成JobSet并提交執行

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

[spark streaming] 動態生成 Job 并提交執行

前言

入口

為當前batchTime分配Block

為當前batchTime生成Jobs

封裝jobs成JobSet并提交執行

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频