Spark Stage

  • 概念

A stage is a set of parallel tasks ① all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
Each Stage can
② either be a shuffle map stage
, in which case its tasks' results are input for other stage(s), or a result stage, in which case its tasks directly compute a Spark action (e.g. count(), save(), etc) by running a function on an RDD. ③For shuffle map stages, we also track the nodes that each output partition is on.
Each Stage also has a firstJobId, identifying the job that first submitted the stage. When FIFO scheduling is used, this ④ allows Stages from earlier jobs to be computed first or recoveredfaster on failure.Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In thatcase, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.The latest one will be accessible through latestInfo.

  • 代碼解讀

  1. [DAGScheduler]->private[scheduler] def handleJobSubmitted
{
   var finalStage: ResultStage = null
    try {
    /**
        ②stage 的類型只有兩種,一種是shuffle map stage 另一種是result 
        stage,并且result stage 一定是調(diào)用action操作的RDD所在
        的stage,參數(shù)含義:func-對每個分區(qū)進行的操作根據(jù)action的不同
        而不同,例如action為count的時候那么func就是計算每個分區(qū)的大小,
        最終結(jié)果由jobwaiter(在SubmitJob方法中有涉及)搜集并計算將func
        的結(jié)果進行相加返回。
    **/
      finalStage = newResultStage(finalRDD, func, partitions, jobId,   
                    callSite)
    } catch {
     case e: Exception => logWarning("Creating new stage failed due to     
     exception - job: " + jobId,   e) listener.jobFailed(e) return 
    } 
                                  . . . 
     /**
        [1]首次提交的一定是finalStage即resultStage,然后會遞歸
        尋找該Stage的依賴直到找到一個沒有依賴的Stage才會生
        成taskSet進行提交
        submitStage(finalStage)
        [2]在遞歸尋找依賴stage的過程中如果發(fā)現(xiàn)當(dāng)前stage有依
        賴則將當(dāng)前stage放入等待隊列中以便后續(xù)調(diào)度
     **/
     submitWaitingStages()
}
  1. [DAGScheduler]->private def submitStage(stage: Stage)
{
                                      ...
       //[1]
       val missing = **getMissingParentStages(stage)**.sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents”)
        //[1]
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
         //[2]
          waitingStages += stage
        }
      ...
}
  1. [DAGScheduler]->getMissingParentStages(stage: Stage): List[Stage]
    ?根據(jù)dependency是否是shuffle dependency(wild or narrow)來進行stage劃分
{
                                       . . . 

  for (dep <- rdd.dependencies) {
            dep match {
              case shufDep: ShuffleDependency[_, _, _] =>
                val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
              case narrowDep: NarrowDependency[_] =>
                [2]
                waitingForVisit.push(narrowDep.rdd)
            }
          }
                                        . . .
}

ShuffleMapStage

  • 概念

ShuffleMapStages are intermediate stages in the execution DAG that produce data for a shuffle.⑤They occur right before each shuffle operation, and might contain multiple pipelined operations before that (e.g. map and filter). When executed, ⑥they save map output files that can later be fetched by reduce tasks.The shuffleDep field describes the shuffle each stage is part of,and ⑧variables like outputLocs and numAvailableOutputs track how many map outputs are ready.ShuffleMapStages can also be submitted independently as jobs with DAGScheduler.submitMapStage. For such stages, the ActiveJobs that submitted them are tracked in mapStageJobs. ⑨Note that there can be multiple ActiveJobs trying to compute the same shuffle map stage.

  • 代碼解讀

⑤-在對stage進行劃分時,shuffle map stage 包含前個shuffle之后的所有非shuffle操作,如map、filter等。
⑥ 對每個partition的output 信息進行維護

/**
   List of [[MapStatus]] for each partition. The index of the array
   is the map partition id,and each value in the array is the list of     
   possible [[MapStatus]]  for a partition(a single task might run 
   multiple times).
   ③⑧當(dāng)前rdd的位置及狀態(tài)信息及每個partiton會在哪個executor
  上執(zhí)行并產(chǎn)生輸出。該信息將用于DAG對task的調(diào)度.
**/
  private[this] val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil)

[DAGScheduler]->submitMissingTasks(stage: Stage, jobId: Int)

                               ... 
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() 
                               ...
    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
      stage match {
        case s: ShuffleMapStage =>
          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
        case s: ResultStage =>
          val job = s.activeJob.get
          partitionsToCompute.map { id =>
          val p = s.partitions(id)
          (id, getPreferredLocs(stage.rdd, p))
          }.toMap
      }
    }
                                       ...

⑦shuffleDep定義了整個shuffle的信息,每個stage的shuffleDep變量則標(biāo)識該stage屬于哪個shuffle應(yīng)該執(zhí)行怎么樣的操作,
在提交執(zhí)行stage時需要用到該信息。

class  ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false){
                                     ...
}
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容