10 Spark Streaming源碼解讀之流數據不斷接收全生命周期徹底研究和思考

在上一篇中介紹了Receiver在Driver的精妙實現,本篇內容主要介紹Receiver在Executor中的啟動,數據接收和存儲

  1. 從ReceiverTracker的start方法開始,調用launchReceivers()方法,給endpoint發送消息,endpoint.send(StartAllReceivers(receivers)),endpoint就是ReceiverTrackerEndpoint,也可以說是給自己的消息通訊體發送了一條消息。看接收到的消息
case StartAllReceivers(receivers) =>
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        // 循環啟動receiver
        for (receiver <- receivers) {
          val executors = scheduledLocations(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          //啟動receiver
          startReceiver(receiver, executors)
}

startReceiver(receiver, executors)循環調用,每一個receiver會啟動一個job。startReceiver的代碼如下

private def startReceiver(
        receiver: Receiver[_],
        scheduledLocations: Seq[TaskLocation]): Unit = {
      def shouldStartReceiver: Boolean = {
        // It's okay to start when trackerState is Initialized or Started
        !(isTrackerStopping || isTrackerStopped)
      }

      val receiverId = receiver.streamId
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
        return
      }

      val checkpointDirOption = Option(ssc.checkpointDir)
      val serializableHadoopConf = new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

      // Function to start the receiver on the worker node
      // 在worker節點啟動receiver的方法,(就是action中的方法)
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException("Could not start receiver as object not found.")
          }
          //判斷task的重試次數為0,就是沒有task失敗后,重試運行不執行以下代碼
          if (TaskContext.get().attemptNumber() == 0) {
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            //這里創建接收器管理者,在start方法里啟動receiver接收數據
            val supervisor = new ReceiverSupervisorImpl(receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            supervisor.start()
            supervisor.awaitTermination()
          } else {
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // Create the RDD using the scheduledLocations to run the receiver in a Spark job
      // 創建接收數據的RDD
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledLocations.isEmpty) {
          //
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          // 根據數據本地性創建receiverRDD
          val preferredLocations = scheduledLocations.map(_.toString).distinct
          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
        }
      // 對job進行一些配置
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
      // 到這里就提交了receiverRDD到集群中
      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            // 重啟receiver
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logError("Receiver has been stopped. Try to restart it.", e)
            logInfo(s"Restarting Receiver $receiverId")
            // 重啟receiver
            self.send(RestartReceiver(receiver))
          }
      }(submitJobThreadPool)
      logInfo(s"Receiver ${receiver.streamId} started")
}

在startReceiverFunc函數中定義了從iterator中取一條記錄,也就是receiver,然后實例化一個ReceiverSupervisorImpl,把receiver傳遞進入,然后調用ReceiverSupervisorImpl的start方法。當然這里并沒有啟動ReceiverSupervisorImpl,只是定義了操作而已,真正的執行是在Executor中。
然后提交ReceiverRDD到集群運行,代碼如下

val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
  1. 通過startReceiverFunc函數 來看ReceiverSupervisorImpl在Executor上的運行。
    從supervisor.start()開始,start方法代碼如下
def start() {  
      onStart()  
      startReceiver()
}

onStart方法代碼如下

/** 
 * Called when supervisor is started.
 * Note that this must be called before the receiver.onStart() is called to ensure 
 * things like [[BlockGenerator]]s are started before the receiver starts sending data.
 */
protected def onStart() { }

重點是看onStart的注釋,注釋內容說在receiver.onStart()之前,必須BlockGenerator先啟動,以保證接收到的數據能夠被存儲起來。看onStart方法的子類實現,代碼如下

  override protected def onStart() {
    registeredBlockGenerators.foreach { _.start() }
  }

registeredBlockGenerators在ReceiverSupervisorImpl實例化的時候創建,代碼如下

private val registeredBlockGenerators = new mutable.ArrayBuffer[BlockGenerator]

registeredBlockGenerators在createBlockGenerator方法中添加了BlockGenerator,代碼如下

override def createBlockGenerator(blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
    // Cleanup BlockGenerators that have already been stopped
    registeredBlockGenerators --= registeredBlockGenerators.filter{ _.isStopped() }

    // 每一個receiver創建一個BlockGenerator,因為streamId一一對應receiver
    val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
    registeredBlockGenerators += newBlockGenerator
    newBlockGenerator
}

那么createBlockGenerator在什么時候被調用呢?看代碼

private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)

registeredBlockGenerators的BlockGenerator已經有了,看BlockGenerator的start()方法,代碼如下

def start(): Unit = synchronized {
    if (state == Initialized) {
      state = Active
      blockIntervalTimer.start()
      blockPushingThread.start()
      logInfo("Started BlockGenerator")
    } else {
      throw new SparkException(
        s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
    }
}

這里啟動了blockIntervalTimer和blockPushingThread,blockIntervalTimer就是一個定時器,默認每200ms回調一下updateCurrentBuffer方法,回調時間通過參數spark.streaming.blockInterval設置,這也是一個性能調優的參數,時間過短太造成block碎片太多,時間過長可能導致block塊過大,具體時間長短要根據實際業務而定,updateCurrentBuffer方法作用就是將接收到的數據包裝到block存儲,代碼后面再看;blockPushingThread作用是定時從blocksForPushing隊列中取block,然后存儲,并向ReceiverTrackerEndpoint匯報,代碼后面再看

  1. BlockGenerator啟動之后接著看 supervisor.start()方法中的 startReceiver()方法, startReceiver()代碼如下
def startReceiver(): Unit = synchronized {
    try {
      if (onReceiverStart()) {
        logInfo("Starting receiver")
        receiverState = Started
        receiver.onStart()
        logInfo("Called receiver onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
}

首先判斷onReceiverStart()的返回值,onReceiverStart()代碼在子類中的實現如下

override protected def onReceiverStart(): Boolean = {
    val msg = RegisterReceiver(
      streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)
    trackerEndpoint.askWithRetry[Boolean](msg)
}

onReceiverStart內部向trackerEndpoint發送了一條RegisterReceiver注冊receiver的消息,在trackerEndpoint內部收到消息后,將注冊信息包裝到一個ReceiverTrackingInfo的case class類中,然后把ReceiverTrackingInfo按照k-v的方式put到receiverTrackingInfos中,key就是streamId,再次說明一個inputDstream對應一個receiver。
回到上面的調用返回true,將receiverState 標記為Started,然后調用了receiver的onStart方法。

  1. 以SocketReceiver為例,看SocketReceiver的onStart方法 ,啟動了一條后臺線程,調用receive()方法接收數據,代碼如下
def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
}

接著看receive()方法,代碼如下

def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
}

receiver方法的內容就很簡單了,啟動一個socket接收數據,接收一行就調用store方法存儲起來,store方法的代碼如下

def store(dataItem: T) {  
      supervisor.pushSingle(dataItem)
}

調用supervisor的pushSingle方法,supervisor就是ReceiverSupervisor的實現類ReceiverSupervisorImpl的方法,代碼如下

def pushSingle(data: Any) { 
       defaultBlockGenerator.addData(data)
}

defaultBlockGenerator在上面說過,他是ReceiverSupervisorImpl的一個成員變量,接著看他的addData方法,代碼如下

def addData(data: Any): Unit = {
    if (state == Active) {
      waitToPush()
      synchronized {
        if (state == Active) {
          currentBuffer += data
        } else {
          throw new SparkException(
            "Cannot add data as BlockGenerator has not been started or has been stopped")
        }
      }
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
}

currentBuffer += data,在currentBuffer 上不斷的累加數據,那么currentBuffer 的數據是怎樣存儲起來的呢,這時候就用到了前面介紹的 blockIntervalTimer和blockPushingThread

  1. 首先看blockIntervalTimer定時回調的updateCurrentBuffer()方法,代碼如下
private def updateCurrentBuffer(time: Long): Unit = {
   try {
     var newBlock: Block = null
     synchronized {
       if (currentBuffer.nonEmpty) {
         val newBlockBuffer = currentBuffer
         currentBuffer = new ArrayBuffer[Any]
         val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
         listener.onGenerateBlock(blockId)
         newBlock = new Block(blockId, newBlockBuffer)
       }
     }

     if (newBlock != null) {
       blocksForPushing.put(newBlock)  // put is blocking when queue is full
     }
   } catch {
     case ie: InterruptedException =>
       logInfo("Block updating timer thread was interrupted")
     case e: Exception =>
       reportError("Error in block updating thread", e)
   }
}

將currentBuffer交給newBlockBuffer ,然后實例化一個空的ArrayBuffer給currentBuffer,接著實例化一個Block把newBlockBuffer 傳遞進去,最后把newBlock 放入到blocksForPushing隊列中

  1. 接下來就是blockPushingThread干的活了,在blockPushingThread線程中調用keepPushingBlocks方法,代碼如下
private def keepPushingBlocks() {
    logInfo("Started block pushing thread")

    def areBlocksBeingGenerated: Boolean = synchronized {
      state != StoppedGeneratingBlocks
    }

    try {
      // While blocks are being generated, keep polling for to-be-pushed blocks and push them.
      while (areBlocksBeingGenerated) {
        Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
          case Some(block) => pushBlock(block)
          case None =>
        }
      }

      // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks.
      logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
      while (!blocksForPushing.isEmpty) {
        val block = blocksForPushing.take()
        logDebug(s"Pushing block $block")
        pushBlock(block)
        logInfo("Blocks left to push " + blocksForPushing.size())
      }
      logInfo("Stopped block pushing thread")
    } catch {
      case ie: InterruptedException =>
        logInfo("Block pushing thread was interrupted")
      case e: Exception =>
        reportError("Error in block pushing thread", e)
    }
}

從blocksForPushing隊列中定時取出block然后pushBlock,代碼如下

Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
       case Some(block) => pushBlock(block)
       case None =>
}

接著看pushBlock(block)方法,代碼如下

listener.onPushBlock(block.id, block.buffer)

這里調用了listener的onPushBlock方法,那么listener是從哪來的,查詢一下listener變量,listener是在BlockGenerator實例化的時候傳遞進來的,找BlockGenerator的實例化,是通過createBlockGenerator方法接收的參數并傳遞給BlockGenerator。找createBlockGenerator方法的調用,終于看到了defaultBlockGeneratorListener的實例化,代碼如下

private val defaultBlockGeneratorListener = new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = { }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
}

原來onPushBlock方法在這里,看pushArrayBuffer的調用 ,pushArrayBuffer方法的代碼如下

def pushArrayBuffer(
      arrayBuffer: ArrayBuffer[_],
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}

重磅性的一行代碼出現了 pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption),代碼如下

def pushAndReportBlock(
      receivedBlock: ReceivedBlock,
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    val blockId = blockIdOption.getOrElse(nextBlockId)
    val time = System.currentTimeMillis
    val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
    logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
    val numRecords = blockStoreResult.numRecords
    val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
    trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
    logDebug(s"Reported block $blockId")
}

這里面做了幾事件事,第一調用receivedBlockHandler來存儲block
第二向trackerEndpoint匯報block的存儲結果blockInfo

  1. receivedBlockHandler是在ReceiverSupervisorImpl實例化的時候創建的,代碼如下
private val receivedBlockHandler: ReceivedBlockHandler = {
    if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
      if (checkpointDirOption.isEmpty) {
        throw new SparkException(
          "Cannot enable receiver write-ahead log without checkpoint directory set. " +
            "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
            "See documentation for more details.")
      }
      new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
        receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
    } else {
      new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
    }
}

有兩種類型,一種的WAL方式,還有一種普通的方式。WAL的方式以后再看,這里看BlockManagerBasedBlockHandler,代碼如下

private[streaming] class BlockManagerBasedBlockHandler(
    blockManager: BlockManager, storageLevel: StorageLevel)
  extends ReceivedBlockHandler with Logging {

  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords = None: Option[Long]

    val putResult: Seq[(BlockId, BlockStatus)] = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        numRecords = Some(arrayBuffer.size.toLong)
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,tellMaster = true)
      case IteratorBlock(iterator) =>
        val countIterator = new CountingIterator(iterator)
        val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,tellMaster = true)
        numRecords = countIterator.count
        putResult
      case ByteBufferBlock(byteBuffer) =>
        blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster = true)
      case o =>
        throw new SparkException(
          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")
    }
    if (!putResult.map { _._1 }.contains(blockId)) {
      throw new SparkException(s"Could not store $blockId to block manager with storage level $storageLevel")
    }
    BlockManagerBasedStoreResult(blockId, numRecords)
  }

  def cleanupOldBlocks(threshTime: Long) {
    // this is not used as blocks inserted into the BlockManager are cleared by DStream's clearing
    // of BlockRDDs.
  }
}

這里就是借助BlockManager來存儲block并返回block存儲的元數據,終于看完了receiver的整個數據接收和存儲。

  1. 整個過程還是很清晰的,如果有張流程圖就最好了,流程圖以后補上,謝謝
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 229,460評論 6 538
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,067評論 3 423
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 177,467評論 0 382
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,468評論 1 316
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,184評論 6 410
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,582評論 1 325
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,616評論 3 444
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,794評論 0 289
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,343評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,096評論 3 356
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,291評論 1 371
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,863評論 5 362
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,513評論 3 348
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,941評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,190評論 1 291
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,026評論 3 396
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,253評論 2 375

推薦閱讀更多精彩內容