亚洲精品国产自在久久,黑人又粗又大xxxxoo,女人一旦尝到粗硬的心理

韓晶晶嚴(yán)律黃春超

簡(jiǎn)介

Spark Streaming是Spark Core的擴(kuò)展，是構(gòu)建于Spark Core之上的實(shí)時(shí)流處理系統(tǒng)。相對(duì)于其他實(shí)時(shí)流處理系統(tǒng)，Spark Streaming最大的優(yōu)勢(shì)在于其位于Spark技術(shù)棧中，也即流處理引擎與數(shù)據(jù)處理引擎在同一個(gè)軟件棧中。在Spark Streaming中，數(shù)據(jù)的采集是以逐條方式，而數(shù)據(jù)處理是按批進(jìn)行的。因此，其系統(tǒng)吞吐量會(huì)比流行的純實(shí)時(shí)流處理引擎Storm高2~5倍。

Spark Streaming對(duì)流數(shù)據(jù)處理的過(guò)成大致可以分為：?jiǎn)?dòng)流處理引擎、接收和存儲(chǔ)流數(shù)據(jù)、處理流數(shù)據(jù)和輸出處理結(jié)果等四個(gè)步驟。其運(yùn)行架構(gòu)圖如下所示：

[圖片上傳失敗...(image-f1cfaf-1542849231639)]

Step1 啟動(dòng)流處理引擎

StreamingContext為Spark Streaming在Driver端的上下文，是spark streaming程序的入口。在該對(duì)象的啟動(dòng)過(guò)程中，會(huì)初始化其內(nèi)部的組件，其中最為重要的是DStreamGraph以及JobScheduler組件的初始化。

class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {
...
private[streaming] val conf = sc.conf

private[streaming] val env = sc.env

private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }
...    
private[streaming] val scheduler = new JobScheduler(this)
...
}

Spark Streaming中作業(yè)的生成方式類似Spark核心，對(duì)DStream進(jìn)行的各種操作讓他們之間構(gòu)建起依賴關(guān)系，DStreamGraph記錄了DStream之間的依賴關(guān)系等信息。

JobScheduler是Spark Streaming的Job總調(diào)度者。JobScheduler 有兩個(gè)非常重要的成員：JobGenerator 和 ReceiverTracker。JobGenerator維護(hù)一個(gè)定時(shí)器，定時(shí)為每個(gè) batch 生成RDD DAG的實(shí)例；ReceiverTracker負(fù)責(zé)啟動(dòng)、管理各個(gè) receiver及管理各個(gè)receiver 接收到的數(shù)據(jù)。

通過(guò)調(diào)用StreamingContext#start()方法啟動(dòng)流處理引擎。在StreamingContext#start()中，調(diào)用StreamingContext#validate()方法對(duì)DStreamGraph及checkpoint等做有效性檢查，然后啟動(dòng)新的線程設(shè)置SparkContext，并啟動(dòng)JobScheduler。

 def start(): Unit = synchronized {
...
     validate()
     ThreadUtils.runInNewThread("streaming-start") {
         sparkContext.setCallSite(startSite.get)
         sparkContext.clearJobGroup()
         sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL,                  "false")      
         savedProperties.set(SerializationUtils.clone(sparkContext
                .localProperties.get())) 
         scheduler.start()
     }
     state = StreamingContextState.ACTIVE
     StreamingContext.setActiveContext(this)
...
  }

Step2 接收與存儲(chǔ)流數(shù)據(jù)

JobScheduler啟動(dòng)時(shí)，會(huì)創(chuàng)建一個(gè)新的 ReceiverTracker 實(shí)例 receiverTracker，并調(diào)用其start() 方法。在ReceiverTracker #start()中會(huì)初始化一個(gè)endpoint: ReceiverTrackerEndpoint對(duì)象，該對(duì)象用于接收和處理ReceiverTracker和 receivers之間發(fā)送的消息。此外，在ReceiverTracker#start()中還會(huì)調(diào)用 launchReceivers 將各個(gè)receivers 分發(fā)到 executors 上。

def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }
    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }

ReceiverTracker#launchReceivers()會(huì)從DStreamGraph.inputStreams 中抽取出receivers，也即數(shù)據(jù)接收器。得到receivers后，給消息接收處理器 endpoint 發(fā)送 StartAllReceivers(receivers)消息。

  private def launchReceivers(): Unit = {
    val receivers = receiverInputStreams.map { nis =>
      val rcvr = nis.getReceiver()
      rcvr.setReceiverId(nis.id)
      rcvr
    }
    runDummySparkJob()
    logInfo("Starting " + receivers.length + " receivers")
    endpoint.send(StartAllReceivers(receivers))
  }

endpoint在接收到消息后，首先會(huì)判別消息的類型，對(duì)不同的消息執(zhí)行不同的處理操作。當(dāng)收到StartAllReceivers類型的消息時(shí)，首先會(huì)計(jì)算每一個(gè)receiver要發(fā)送的目的executors，其計(jì)算主要遵循兩條原則：一是盡可能的使receiver分布均勻；二是如果receiver本身的preferredLocation不均勻，則以preferredLocation為準(zhǔn)。然后遍歷每一個(gè)receiver，根據(jù)計(jì)算出的executors調(diào)用startReceiver方法來(lái)啟動(dòng)receivers。

case StartAllReceivers(receivers) =>
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          val executors = scheduledLocations(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          startReceiver(receiver, executors)
        }

由于ReceiverInputDStream實(shí)例只有一個(gè)receiver，但receiver可能需要在多個(gè)worker上啟動(dòng)線程來(lái)接收數(shù)據(jù)，因此在startReceiver中需要將receiver及其對(duì)應(yīng)的目的excutors轉(zhuǎn)換成RDD。

val receiverRDD: RDD[Receiver[_]] =
        if (scheduledLocations.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          val preferredLocations = scheduledLocations.map(_.toString).distinct
          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
        }

轉(zhuǎn)換為RDD后，需要把receiver所進(jìn)行的計(jì)算定義為startReceiverFunc函數(shù)，該函數(shù)以receiver實(shí)例為參數(shù)構(gòu)造ReceiverSupervisorImpl實(shí)例supervisor，構(gòu)造完畢后使用新線程啟動(dòng)該supervisor并阻塞該線程。

val supervisor = new ReceiverSupervisorImpl(
  receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()

最后，將receiverRDD以及要在receiverRDD上執(zhí)行的函數(shù)作為Job提交，以真正在各個(gè)executors上啟動(dòng)Receiver。Job執(zhí)行后將會(huì)持續(xù)的進(jìn)行數(shù)據(jù)的接收。

val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())

Receiver源源不斷的接收到實(shí)時(shí)流數(shù)據(jù)后，根據(jù)接收數(shù)據(jù)的大小進(jìn)行判斷，若數(shù)據(jù)量很小，則會(huì)聚集多條數(shù)據(jù)成一塊，然后進(jìn)行塊存儲(chǔ)；若數(shù)據(jù)量很大，則直接進(jìn)行塊存儲(chǔ)。對(duì)于這些數(shù)據(jù)，Receiver會(huì)直接交由ReceiverSupervisor，由其進(jìn)行數(shù)據(jù)的轉(zhuǎn)儲(chǔ)操作。配置參數(shù)spark.streaming.receiver.writeAheadLog.enable的值決定是否預(yù)寫日志。根據(jù)參數(shù)值會(huì)產(chǎn)生不同類型的存儲(chǔ)receivedBlockHandler對(duì)象。

private val receivedBlockHandler: ReceivedBlockHandler = {
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
    //先寫 WAL，再存儲(chǔ)到 executor 的內(nèi)存或硬盤
    new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
      receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
  } else {
    //直接存到 executor 的內(nèi)存或硬盤
    new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
  }
}

根據(jù)receivedBlockHandler進(jìn)行塊存儲(chǔ)。將 block 存儲(chǔ)之后，會(huì)獲得 block 描述信息 blockInfo:ReceivedBlockInfo，這其中包含：streamId、數(shù)據(jù)位置、數(shù)據(jù)條數(shù)、數(shù)據(jù) size 等信息。接著，封裝以 block 作為參數(shù)的 AddBlock(blockInfo) 消息并發(fā)送給 ReceiverTracker 以通知其有新增 block 數(shù)據(jù)塊。

//調(diào)用 receivedBlockHandler.storeBlock 方法存儲(chǔ) block，并得到一個(gè) blockStoreResult
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
//使用blockStoreResult初始化一個(gè)ReceivedBlockInfo實(shí)例
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//發(fā)送消息通知 ReceiverTracker 新增并存儲(chǔ)了 block
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))

ReceiverTracker再把這些信息轉(zhuǎn)發(fā)給ReceivedBlockTracker，由其負(fù)責(zé)管理收到數(shù)據(jù)塊元信息。

private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
    receivedBlockTracker.addBlock(receivedBlockInfo)
  }

step3 處理流數(shù)據(jù)

JobScheduler中有兩個(gè)主要的成員，一個(gè)是上文提到的ReceiverTracker，另一個(gè)則是JobGenerator 。在JobScheduler啟動(dòng)時(shí)，會(huì)創(chuàng)建一個(gè)新的 JobGenerator 實(shí)例 jobGenerator，并調(diào)用其start() 方法。在 JobGenerator 的主構(gòu)造函數(shù)中，會(huì)創(chuàng)建一個(gè)定時(shí)器：

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

定時(shí)器中定義了批處理時(shí)間間隔ssc.graph.batchDuration.milliseconds。每當(dāng)批處理時(shí)間到來(lái)時(shí)，會(huì)執(zhí)行一次eventLoop.post(GenerateJobs(new Time(longTime)))方法來(lái)向 eventLoop 發(fā)送 GenerateJobs(new Time(longTime))消息，eventLoop收到消息后會(huì)基于當(dāng)前batch內(nèi)的數(shù)據(jù)進(jìn)行Job的生成及提交執(zhí)行。

private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
    // allocate received blocks to batch
    jobScheduler.receiverTracker.allocateBlocksToBatch(time)
    // generate jobs using allocated block
    graph.generateJobs(time)
} match {
    case Success(jobs) =>
    val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
    jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
    jobScheduler.reportError("Error generating jobs for time " + time, e)
    PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
  }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
 }

由源碼可知，eventLoop 在接收到 GenerateJobs(new Time(longTime))消息后首先調(diào)用了allocateBlocksToBatch()方法將已收到的blocks分配給batch。緊接著調(diào)用DStreamGraph類中的generateJobs()方法來(lái)生成基于該batch的Job序列。然后將批處理時(shí)間time、作業(yè)序列Seq[Job]和本批次數(shù)據(jù)的源信息包裝為JobSet，調(diào)用JobScheduler.submitJobSet(JobSet)提交給JobScheduler，JobScheduler將這些作業(yè)發(fā)送給Spark核心進(jìn)行處理。

Step4 輸出處理結(jié)果

由于數(shù)據(jù)的處理有Spark核心來(lái)完成，因此處理的結(jié)果會(huì)從Spark核心中直接輸出至外部系統(tǒng)，如數(shù)據(jù)庫(kù)或者文件系統(tǒng)等，同時(shí)輸出的數(shù)據(jù)也可以直接被外部系統(tǒng)所使用。由于實(shí)時(shí)流數(shù)據(jù)的數(shù)據(jù)源源不斷的流入，Spark會(huì)周而復(fù)始的進(jìn)行數(shù)據(jù)的計(jì)算，相應(yīng)也會(huì)持續(xù)輸出處理結(jié)果。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Spark Streaming運(yùn)行架構(gòu)分析

Spark Streaming運(yùn)行架構(gòu)分析

簡(jiǎn)介

Step1 啟動(dòng)流處理引擎

Step2 接收與存儲(chǔ)流數(shù)據(jù)

step3 處理流數(shù)據(jù)

Step4 輸出處理結(jié)果

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Spark Streaming運(yùn)行架構(gòu)分析

簡(jiǎn)介

Step1 啟動(dòng)流處理引擎

Step2 接收與存儲(chǔ)流數(shù)據(jù)

step3 處理流數(shù)據(jù)

Step4 輸出處理結(jié)果

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频