-
概念
A stage is a set of parallel tasks ① all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
Each Stage can ② either be a shuffle map stage, in which case its tasks' results are input for other stage(s), or a result stage, in which case its tasks directly compute a Spark action (e.g. count(), save(), etc) by running a function on an RDD. ③For shuffle map stages, we also track the nodes that each output partition is on.
Each Stage also has a firstJobId, identifying the job that first submitted the stage. When FIFO scheduling is used, this ④ allows Stages from earlier jobs to be computed first or recoveredfaster on failure.Finally, a single stage can be re-executed in multiple attempts due to fault recovery. In thatcase, the Stage object will track multiple StageInfo objects to pass to listeners or the web UI.The latest one will be accessible through latestInfo.
-
代碼解讀
- [DAGScheduler]->private[scheduler] def handleJobSubmitted
{
var finalStage: ResultStage = null
try {
/**
②stage 的類型只有兩種,一種是shuffle map stage 另一種是result
stage,并且result stage 一定是調(diào)用action操作的RDD所在
的stage,參數(shù)含義:func-對每個分區(qū)進行的操作根據(jù)action的不同
而不同,例如action為count的時候那么func就是計算每個分區(qū)的大小,
最終結(jié)果由jobwaiter(在SubmitJob方法中有涉及)搜集并計算將func
的結(jié)果進行相加返回。
**/
finalStage = newResultStage(finalRDD, func, partitions, jobId,
callSite)
} catch {
case e: Exception => logWarning("Creating new stage failed due to
exception - job: " + jobId, e) listener.jobFailed(e) return
}
. . .
/**
[1]首次提交的一定是finalStage即resultStage,然后會遞歸
尋找該Stage的依賴直到找到一個沒有依賴的Stage才會生
成taskSet進行提交
submitStage(finalStage)
[2]在遞歸尋找依賴stage的過程中如果發(fā)現(xiàn)當(dāng)前stage有依
賴則將當(dāng)前stage放入等待隊列中以便后續(xù)調(diào)度
**/
submitWaitingStages()
}
- [DAGScheduler]->private def submitStage(stage: Stage)
{
...
//[1]
val missing = **getMissingParentStages(stage)**.sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents”)
//[1]
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
//[2]
waitingStages += stage
}
...
}
- [DAGScheduler]->getMissingParentStages(stage: Stage): List[Stage]
?根據(jù)dependency是否是shuffle dependency(wild or narrow)來進行stage劃分
{
. . .
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] =>
[2]
waitingForVisit.push(narrowDep.rdd)
}
}
. . .
}
ShuffleMapStage
-
概念
ShuffleMapStages are intermediate stages in the execution DAG that produce data for a shuffle.⑤They occur right before each shuffle operation, and might contain multiple pipelined operations before that (e.g. map and filter). When executed, ⑥they save map output files that can later be fetched by reduce tasks. ⑦The
shuffleDep
field describes the shuffle each stage is part of,and ⑧variables likeoutputLocs
andnumAvailableOutputs
track how many map outputs are ready.ShuffleMapStages can also be submitted independently as jobs with DAGScheduler.submitMapStage. For such stages, the ActiveJobs that submitted them are tracked inmapStageJobs
. ⑨Note that there can be multiple ActiveJobs trying to compute the same shuffle map stage.
-
代碼解讀
⑤-在對stage進行劃分時,shuffle map stage 包含前個shuffle之后的所有非shuffle操作,如map、filter等。
⑥ 對每個partition的output 信息進行維護
/**
List of [[MapStatus]] for each partition. The index of the array
is the map partition id,and each value in the array is the list of
possible [[MapStatus]] for a partition(a single task might run
multiple times).
③⑧當(dāng)前rdd的位置及狀態(tài)信息及每個partiton會在哪個executor
上執(zhí)行并產(chǎn)生輸出。該信息將用于DAG對task的調(diào)度.
**/
private[this] val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil)
[DAGScheduler]->submitMissingTasks(stage: Stage, jobId: Int)
...
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
...
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
val job = s.activeJob.get
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
}
...
⑦shuffleDep定義了整個shuffle的信息,每個stage的shuffleDep變量則標(biāo)識該stage屬于哪個shuffle應(yīng)該執(zhí)行怎么樣的操作,
在提交執(zhí)行stage時需要用到該信息。
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false){
...
}