本文基于spark2.11
1. 前言
1.1 基本概念
RDD
關于RDD已經有很多文章了,可以參考一下理解Spark的核心RDD-
依賴
依賴分為窄依賴和寬依賴,下圖描述了兩種依賴(圖片出自spark窄依賴和寬依賴)
20160913233559680.jpeg
從途中可以看出,窄依賴中,每一個上游RDD中的分區只會被一個下游分區依賴。而寬依賴上游RDD中的分區則可能被多個下游分區依賴。寬依賴往往意味者shuffle操作 shuffle
窄依賴中,由于只存在n - 1(n>=1)的依賴關系,分區上的數據可以像流水線一樣一道道應用計算,大多時候不需要移動數據。寬依賴,又叫shuffle dependency,由于分區包含下游多個分區的數據,需要將數據移動到對應分區,這個過程稱為shuffleStage
RDD DAG靜態的描述了數據轉換與依賴關系,action觸發job提交時RDD DAG會被首先被劃分以stage,stage劃分的邊界是寬依賴,也就是被劃分到一個stage之內的rdd只存在窄依賴,stage之間是寬依賴。
有兩種stage:ShuffleMapStage,ResultStage。一個job中只有一個ResultStage,是job運行的最后階段,收集結果。Task
task定義了計算任務,一個分區一個task,task根據劃分好的stage生成。
有兩種類型的task:ShuffleMapTask和ResultTask,和stage對應。
spark基于RDD上有兩種操作transformation和action(見spark programming-guide),transformation(map、reduceByKey這種)使得RDD轉換成新的RDD,action(foreach,top這種)則產生會觸發一個新的job并提交,并產生以及收集job運行的結果。
下面的代碼:
def main(args:Array[String]){
val sparkConf = new SparkConf().setAppName("Log Query")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile("README.md",3)
val words = lines.flatMap(line => line.split(" "))
val wordOne = words.map(word => (word,1))
val wordCount = wordOne.reduceByKey(_ + _,3)
wordCount.foreach(println)
val resultAsArry = wordCount.collect()
}
有兩個action:foreach和collect,因此會提交兩個job,但是這兩個job有共享了幾個RDD。一個job提交會做一下幾件事:
- DAGScheduler劃分Stage
有兩種Stage:ShuffleMapStage和ResultStage,前者劃分以ShuffleDependency為邊界,創建時會根據RDD往前回溯到源頭,然后從源頭往下創建stage。后者job最后一階段,在所有上游ShuffleMapStage包含的任務(ShuffleMapTask)完成后收集結果 - DAGScheduler提交Stage
雖說最先提交ResultStage,但是提交時會追溯上游是否有未完成的Stage,直到找到所有不依賴任何Stage或者其依賴的Stage全部完成的Stage,然后提交。意味著對各stage可能并行提交。 - DAGScheduler根據Stage創建任務
這一步其實是包含在提交Stage當中的。Stage是一種靜態的概念,最終運行在集群中的是task,對應ShuffleMapStage和ResultStage存在兩種task:ShuffleMapTask和ResultTask。一個ShuffleMapStage包含若干窄依賴RDD組成,一個RDD又由若干partition組成,task運行在每一個partition之上,也就是說會根據ShuffleMapStage創建出多個task。 - 提交任務
根據ShuffleMapStage創建多個ShuffleMapTask之后,調用TaskScheduler開始調度任務。 - TaskScheduler調度任務
TaskScheduler根據任務的preferedLocation尋找合適的executor,然后將任務信息包裝好發送LaunchTask到executor,讓executor執行任務。 - Executor執行任務
executor運行在worker之上,接受到taskScheduler的LaunchTask消息后,啟動任務的執行。 - 任務狀態匯報
任務運行信息會匯報到TaskScheduler,TaskScheduler則會匯報給DAGScheduler,DAGScheduler根據任務狀態作出處理(stage中所有任務完成如提交子stage,讀取上游數據失敗重新提交stage等)
2. 劃分Stage
RDD上的action操作觸發job的提交,提交之前會完成stage的劃分,一個stage可能包含一連串的RDD之間的轉換,stage的邊界就是兩個RDD之間的shuffle依賴,以上面代碼為例,wordOne.reduceByKey
使得wordCount和wordOne之間產生shuffle依賴,下圖便是上述代碼產生的RDD DAG stage的劃分之后的樣子
wordCount.foreach(println)
這類action操作觸發job的提交,經過一系列調用進入到DAGScheduler的如下方法中:
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
// 每個action觸發一個job,每個job一個唯一的id
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// 向事件循環中發送一個JobSubmitted的消息
// 消息包含了rdd,jobid,partitions等信息
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
參數解釋
- 上面方法中參數rdd即代碼中
val wordCount
對應的rdd - 參數func,由println包裝成的,下文會提到,這個func只會作用于ResultStage。
- partiitions,rdd包含的分區
- resulthandler,返回結果時回調
上面代碼中并未提交job,而是發送JobSubmited消息給eventProcessLoop,由其異步的提交job。下面是eventProcessLoop的類DAGSchedulerEventProcessLoop處理接收到的消息的代碼邏輯:
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
// 此處省略了其它情況
case ...
}
接受到JobShumitted消息后,調用dagScheduler.handlerJobSummited提交job。下面是handleJobSubmitted的核心代碼:
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
...
return
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
...
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)
}
上述代碼最后調用submitStage提交stage。在此之前需要創建stage,上面方法中有兩個地方可能會創建stage。
- createResultStage
- getMissingParentStages
前面提到Stage分為兩種:ResultStage和ShuffleMapStage,對于一個job而言,ResultStage是其最后階段,收集job運行的結果,一個job對應的RDD DAG劃分中,只存在一個ResultStage和多個ShuffleMapStage。
上面1中createResultStage會創建ResultStage,但是ResultStage,創建過程中會判斷當前stage是否存在依賴上游stage,如果存在就會一直往上游追溯,從上至下創建。每一個stage有一個id,創建出來的stage根據id緩存,避免重復創建。
上面2中getMissingParentStages,則會在當前stage有上游依賴時遞歸的創建所有的上游依賴。
下圖描述了1,2兩個方法的調用圖,
creatResultStage getMissingParentStages
| |
|__________________________|
v
getOrCreateParentStages <---------------------|
| |
v |
getOrCreateShuffleMapStage |
| 當前以及所有存在的上游stage都要創建 | |
v |
createShuffleMapStage-------------------------| 嘗試創建上游stage,然后創建自己
- createResultStage
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
- 上述代碼中,先是調用getOrCreateParentStages,遞歸的創建所有上游的stage
- stageId是ResultStage的id,這里是遞增的,但是下面說道ShuffleMapStage的id則是shuffleid
- getOrCreateParentStages
方法代碼如下:
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
getShuffleDependencies(rdd).map { shuffleDep =>
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
getShuffleDependencies(rdd),一直追溯rdd的依賴直到依賴類型為ShuffleDenpendency,這個方法實現了廣度遍歷的過程。而且它只返回rdd的直屬父shuffle依賴,祖先shuffle依賴不返回,下面有個例子說明:
E <------ A <------ B <------- C
|
D <--------|
假設上面都是shuffle依賴,getShuffleDependedency(C)只返回B,D
回到方法本身,對C的每一個shuffle依賴B,D,調用getOrCreateShuffleMapStage
創建stage。
- getOrCreateShuffleMapStage
假設先對B創建stage,代碼如下:
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
createShuffleMapStage(shuffleDep, firstJobId)
}
}
- 參數shuffleDep,即C到B的依賴,shuffleDep.rdd即為B
- getMissingAncestorShuffleDependencies(B),此時會返回B所有的祖先shuffle 依賴,也就是 B對A的依賴,和A對E的依賴
- 下面4中代碼中,每個ShuffleMapStage創建后都會映射到shuffle id上,假設新建的ShuffleMapStage作用于B,那么他映射的shuffleid就是B<-C之間的shuffle的id。因此方法里可以根據shuffleId到shuffleIdToMapStage檢索,避免重復創建,
- 對于不存在的ShuffleMapStage, 調用createShuffleMapStage創建stage。
- createShuffleMapStage
對于B而言,先創建A,代碼如下:def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = { val rdd = shuffleDep.rdd val numTasks = rdd.partitions.length val parents = getOrCreateParentStages(rdd, jobId) val id = nextStageId.getAndIncrement() val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep) stageIdToStage(id) = stage shuffleIdToMapStage(shuffleDep.shuffleId) = stage updateJobIdStageIdMaps(jobId, stage) if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { // A previously run stage generated partitions for this shuffle, so for each output // that's still available, copy information about that output location to the new stage // (so we don't unnecessarily re-compute that data). val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) (0 until locs.length).foreach { i => if (locs(i) ne null) { // locs(i) will be null if missing stage.addOutputLoc(i, locs(i)) } } } else { // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")") mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length) } stage
}
- 此時shuffleDep即B->A的依賴,shuffleDep.rdd就是A
- 上述代碼又getOrCreateParentStages(A)創建A的上游,A上游是E,E沒有上游節點,此處E所所處ShuffleMapStage被創建,stageid即A->E的shuffleId,stage作用于E
- 所有的上游stages創建完成了,創建當前shuffleDep產生的ShuffleMapStage。建立shuffleId到ShuffleMapStage的映射
- shuffleId在建立RDD DAG圖之后就是一直不變的,而stageId每提交一次job都會變動,同一個job可能因為某個環節失敗了重新提交,但是失敗的job中的某個ShuffleMapStage的輸出數據可是完好的可重復利用。if分支使用mapoutTracker檢查特定shuffle階段的數據是不是完好的,然后可以重復利用,避免再此計算。
- 創建stage之后就返回上一層。
下圖描述了最終創建出來的stage的依賴圖:
#3 提交stage
第2節最前面,handleJobSubmitted,創建完所有stage之后調用submitStage(finalStage)提交stage,代碼如下:
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
1. waitingStages、runningStage、failedStages分別記錄那些還有未完成的上游stage的stage、正在executor上運行的stage和失敗的stages
2. getMissingParentStages獲得沒有提交或者沒有完成的上游stage,stage不存在就創建
3. submitMissingTasks在沒有未完成的上游stage的情況下,提交當前stage
4. 有上游stage未完成,將當前stage加入到waitingStages隊列中
上面submitStages代碼中將根據stage創建任務提交的是submitMissingTasks,其代碼如下:
private def submitMissingTasks(stage: Stage, jobId: Int) {
// stage可能被多次提交,stage作用的rdd可能在之前幾次提交運行中有寫partition已經有了計算計算,先找出沒有結果的partition
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
...
// 一個分區一個task,找出一組task運行的位置
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
...
return
}
// stage只有一個stageId,但是stage的每次提交都會有新的attempId
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// 根據stage序列化任務信息,ShuffleMapStage對應ShuffleMapTask,主要信息有task運行作用的rdd,和依賴
// ResultStage對應ResultTask,主要信息有rdd和func(這個func就是前面wordCount代碼中foreach 這個action的參數轉換來的,只有ResultStage有)
var taskBinary: Broadcast[Array[Byte]] = null
try {
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
// 將task信息作為廣播變量傳輸,executor執行task時需要根據廣播變量獲取task信息,task很多時,使用廣播變量傳輸能有效減少driver上的壓力
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
...
}
// 創建task,一個分區一個task
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
stage.pendingPartitions += id
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
// 使用TaskScheduler提交tasks
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
submitWaitingChildStages(stage)
}
}
上面在代碼注釋中簡單介紹了Stage轉換成Task,然后使用TaskScheduler提交的過程。
關于TaskScheduler對任務的管理和提交見另一篇文章[Spark 任務調度-TaskScheduler]()。