熟妇的奶头又大又长奶水视频,国产sm主人调教女m视频,被黑人猛烈进出到抽搐动态图

引言

上一小節《TaskScheduler源碼與任務提交原理淺析2》介紹了Driver側將Stage進行劃分，根據Executor閑置情況分發任務，最終通過DriverActor向executorActor發送任務消息。
我們要了解Executor的執行機制首先要了解Executor在Driver側的注冊過程，這篇文章先了解一下Application和Executor的注冊過程。

1. Task類及其相關

1.1 Task類

Spark將由Executor執行的Task分為ShuffleMapTask和ResultTask兩種，其源碼存在scheduler package中。
Task是介于DAGScheduler和TaskScheduler中間的接口，在DAGScheduler，需要把DAG中的每個stage的每個partitions封裝成task，最終把taskset提交給TaskScheduler。

/**
 * A unit of execution. We have two kinds of Task's in Spark:
 * - [[org.apache.spark.scheduler.ShuffleMapTask]]
 * - [[org.apache.spark.scheduler.ResultTask]]
 *
 * A Spark job consists of one or more stages. The very last stage in a job consists of multiple
 * ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
 * and sends the task output back to the driver application. A ShuffleMapTask executes the task
 * and divides the task output to multiple buckets (based on the task's partitioner).
 *
 * @param stageId id of the stage this task belongs to
 * @param partitionId index of the number in the RDD
 */
private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable

Task對應一個stageId和partitionId。
提供runTask()接口、kill()接口等。
提供killed變量、TaskMetrics變量、TaskContext變量等。

除了上述基本接口和變量，Task的伴生對象提供了序列化和反序列化應用依賴的jar包的方法。原因是Task需要保證工作節點具備本次Task需要的其他依賴，注冊到SparkContext下，所以提供了把依賴轉成流寫入寫出的方法。

1.2 ShuffleMapTask

對應于ShuffleMap Stage, 產生的結果作為其他stage的輸入。
ShuffleMapTask復寫了MapStatus向外讀寫的方法，因為向外讀寫的內容包括：stageId，rdd，dep，partitionId，epoch和split(某個partition)。對于其中的stageId，rdd，dep有統一的序列化和反序列化操作并會cache在內存里，再放到ObjectOutput里寫出去。序列化操作使用的是Gzip，序列化信息會維護在serializedInfoCache = newHashMap[Int, Array[Byte]]。這部分需要序列化并保存的原因是：stageId，rdd，dep真正代表了本次Shuffle Task的信息，為了減輕master節點負擔，把這部分序列化結果cache了起來。

/**
* A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a partitioner
* specified in the ShuffleDependency).
*
* See [[org.apache.spark.scheduler.Task]] for more information.
*
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
 *                   the type should be (RDD[_], ShuffleDependency[_, _, _]).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 */
private[spark] class ShuffleMapTask(
    stageId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient private var locs: Seq[TaskLocation])
  extends Task[MapStatus](stageId, partition.index) with Logging {

1.3 ResultTask

對應于Result Stage直接產生結果。

/**
 * A task that sends back the output to the driver application.
 *
 * See [[Task]] for more information.
 *
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
 *                   partition of the given RDD. Once deserialized, the type should be
 *                   (RDD[T], (TaskContext, Iterator[T]) => U).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 * @param outputId index of the task in this job (a job can launch tasks on only a subset of the
 *                 input RDD's partitions).
 */
private[spark] class ResultTask[T, U](
    stageId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient locs: Seq[TaskLocation],
    val outputId: Int)
  extends Task[U](stageId, partition.index) with Serializable {

1.4 TaskSet

TaskSet是一個數據結構，用于封裝一個stage的所有的tasks, 以提交給TaskScheduler。
TaskSet就是可以做pipeline的一組完全相同的task，每個task的處理邏輯完全相同，不同的是處理數據，每個task負責處理一個partition。pipeline，可以稱為大數據處理的基石，只有數據進行pipeline處理，才能將其放到集群中去運行。對于一個task來說，它從數據源獲得邏輯，然后按照拓撲順序，順序執行（實際上是調用rdd的compute）。

/**
 * A set of tasks submitted together to the low-level TaskScheduler, usually representing
 * missing partitions of a particular stage.
 */
private[spark] class TaskSet(
    val tasks: Array[Task[_]],
    val stageId: Int,
    val attempt: Int,
    val priority: Int,
    val properties: Properties) {
    val id: String = stageId + "." + attempt

  override def toString: String = "TaskSet " + id
}

2. Executor注冊到Driver

Driver發送LaunchTask消息被Executor接收，Executor會使用launchTask對消息進行處理。
不過在這個過程之前，我們要知道，如果Executor沒有注冊到Driver，即便接收到LaunchTask指令，也不會做任務處理。所以我們要先搞清楚，Executor是如何在Driver側進行注冊的。

2.1 Application注冊

Executor的注冊是發生在Application的注冊過程中的，我們以Standalone模式為例：

SparkContext創建schedulerBackend和taskScheduler，schedulerBackend作為TaskScheduler對象的一個成員存在 --> 在TaskScheduler對象調用start函數時，其實調用了backend.start()函數 --> backend.start()函數中啟動了AppClient，AppClient的其中一個參數ApplicationDescription就是封裝的運行CoarseGrainedExecutorBackend的命令 --> AppClient內部啟動了一個ClientActor，這個ClientActor啟動之后，會嘗試向Master發送一個指令actor ! RegisterApplication(appDescription) 注冊一個Application

下面是SparkDeploySchedulerBackend的start函數中的部分注冊Application的代碼：

    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
    val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      appUIAddress, sc.eventLogDir, sc.eventLogCodec)

    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
    client.start()

AppClient向Master提交Application
AppClient是Application和Master交互的接口。它的包含一個類型為org.apache.spark.deploy.client.AppClient.ClientActor的成員變量actor。它負責了所有的與Master的交互。其中提交Application過程涉及的函數調用為：
ClientActor的preStart() --> 調用registerWithMaster() --> 調用tryRegisterAllMasters --> actor ! RegisterApplication(appDescription) --> Master的receiveWithLogging函數處理RegisterApplication消息。

下面是RegisterApplication(appDescription)消息的相關處理代碼（在Master.scala中的receiveWithLogging部分代碼）：

    case RegisterApplication(description) => {
      if (state == RecoveryState.STANDBY) {
        // ignore, don't send response
      } else {
        logInfo("Registering app " + description.name)
        val app = createApplication(description, sender)
        registerApplication(app)
        logInfo("Registered app " + description.name + " with ID " + app.id)
        persistenceEngine.addApplication(app)
        sender ! RegisteredApplication(app.id, masterUrl)
        schedule()//為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度  
      }
    }

這段代碼做了以下幾件事：

createApplication為這個app構建一個描述App數據結構的ApplicationInfo

注冊該Application，更新相應的映射關系，添加到等待隊列里面
用persistenceEngine持久化Application信息，默認是不保存的，另外還有兩種方式，保存在文件或者Zookeeper當中
通過發送方注冊成功
開始作業調度（為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度）

2.2 Master中的schedule函數

schedule()為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度。為Application分配資源選擇worker（executor），現在有兩種策略：

盡量的打散，即一個Application盡可能多的分配到不同的節點。這個可以通過設置spark.deploy.spreadOut來實現。默認值為true，即盡量的打散。

盡量的集中，即一個Application盡量分配到盡可能少的節點。

對于同一個Application，它在一個worker上只能擁有一個executor；當然了，這個executor可能擁有多于1個core。對于策略1，任務的部署會慢于策略2，但是GC的時間會更快。

schedule函數的源碼，解釋在中文注釋中：

  /*
   * Schedule the currently available resources among waiting apps. This method will be called
   * every time a new app joins or resource availability changes.
   */
  private def schedule() {
    if (state != RecoveryState.ALIVE) { return }

    // First schedule drivers, they take strict precedence over applications
    // Randomization helps balance drivers
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0

    for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
      // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
      // start from the last worker that was assigned a driver, and continue onwards until we have
      // explored all alive workers.
      var launched = false
      var numWorkersVisited = 0
      while (numWorkersVisited < numWorkersAlive && !launched) {
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          launchDriver(worker, driver)
          waitingDrivers -= driver
          launched = true
        }
        curPos = (curPos + 1) % numWorkersAlive
      }
    }

    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    if (spreadOutApps) {//盡量的打散負載，如有可能，每個executor分配一個core  
      // Try to spread out each app among all the nodes, until it has all its cores
      for (app <- waitingApps if app.coresLeft > 0) {
        // 可用的worker的標準：State是Alive，其上并沒有該Application的executor，可用內存滿足要求。  
        // 在可用的worker中，優先選擇可用core數多的。  
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(canUse(app, _)).sortBy(_.coresFree).reverse
        val numUsable = usableWorkers.length
        val assigned = new Array[Int](numUsable) // Number of cores to give on each node
        var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
        var pos = 0
        while (toAssign > 0) {
          if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
            toAssign -= 1
            assigned(pos) += 1
          }
          pos = (pos + 1) % numUsable
        }
        // Now that we've decided how many cores to give on each node, let's actually give them
        for (pos <- 0 until numUsable) {
          if (assigned(pos) > 0) {
            val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
            launchExecutor(usableWorkers(pos), exec)
            app.state = ApplicationState.RUNNING
          }
        }
      }
    } else { //盡可能多的利用worker的core
      // Pack each app into as few nodes as possible until we've assigned all its cores
      for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
        for (app <- waitingApps if app.coresLeft > 0) {
          if (canUse(app, worker)) {
            val coresToUse = math.min(worker.coresFree, app.coresLeft)
            if (coresToUse > 0) {
              val exec = app.addExecutor(worker, coresToUse)
              launchExecutor(worker, exec)
              app.state = ApplicationState.RUNNING
            }
          }
        }
      }
    }
  }

2.3 launchExecutor函數

在選擇了worker和確定了worker上得executor需要的CPU core數后，Master會調用 launchExecutor(worker: WorkerInfo, exec: ExecutorInfo)向Worker發送請求，向AppClient發送executor已經添加的消息。同時會更新master保存的worker的信息，包括增加executor，減少可用的CPU core數和memory數。Master不會等到真正在worker上成功啟動executor后再更新worker的信息。如果worker啟動executor失敗，那么它會發送FAILED的消息給Master，Master收到該消息時再次更新worker的信息即可。

  def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    worker.actor ! LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
    exec.application.driver ! ExecutorAdded(
      exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
  }

2.4 Executor的創建

下面的調用關系鏈是Worker接收到來自Master的LaunchExecutor消息后的調用過程：
LaunchExecutor的消息處理中創建ExecutorRunner --> ExecutorRunner會將在SparkDeploySchedulerBackend中準備好的ApplicationDescription以進程的形式啟動起來 --> 啟動ApplicationDescription中攜帶的CoarseGrainedExecutorBackend --> CoarseGrainedExecutorBackend啟動后，會首先通過傳入的driverUrl這個參數向在CoarseGrainedSchedulerBackend::DriverActor發送RegisterExecutor消息 --> DriverActor會回復RegisteredExecutor --> CoarseGrainedExecutorBackend會創建一個Executor --> Executor創建完畢。

CoarseGrainedExecutorBackend啟動后，preStart函數執行的相關操作：

  override def preStart() {
    logInfo("Connecting to driver: " + driverUrl)
    driver = context.actorSelection(driverUrl)
    driver ! RegisterExecutor(executorId, hostPort, cores, extractLogUrls)
    context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
  }

CoarseGrainedExecutorBackend接收RegisteredExecutor消息后，創建Executor的操作：

  override def receiveWithLogging = {
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      val (hostname, _) = Utils.parseHostPort(hostPort)
      executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)

    ......

參考資料

Spark Core源碼分析: Spark任務模型
 Spark技術內幕：Executor分配詳解 —— 強烈推薦該博文，其中博主結合Spark源碼對Executor的分配講解的非常詳細

轉載請注明作者Jason Ding及其出處
GitCafe博客主頁(http://jasonding1354.gitcafe.io/)
Github博客主頁(http://jasonding1354.github.io/)
CSDN博客(http://blog.csdn.net/jasonding1354)
簡書主頁(http://www.lxweimin.com/users/2bd9b48f6ea8/latest_articles)
Google搜索jasonding1354進入我的博客主頁

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

【Spark Core】任務執行機制和Task源碼淺析1

【Spark Core】任務執行機制和Task源碼淺析1

引言

1. Task類及其相關

1.1 Task類

1.2 ShuffleMapTask

1.3 ResultTask

1.4 TaskSet

2. Executor注冊到Driver

2.1 Application注冊

2.2 Master中的schedule函數

2.3 launchExecutor函數

2.4 Executor的創建

參考資料

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

【Spark Core】任務執行機制和Task源碼淺析1

引言

1. Task類及其相關

1.1 Task類

1.2 ShuffleMapTask

1.3 ResultTask

1.4 TaskSet

2. Executor注冊到Driver

2.1 Application注冊

2.2 Master中的schedule函數

2.3 launchExecutor函數

2.4 Executor的創建

參考資料

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频