作業的提交做的主要的事情是:通過提交的最后一個rdd的依賴關系來劃分stage,在再將stage轉換成task,由diver端發送給一個個的將task發送到Mster端,最后提交到到CoarseGrainedExecutorBackend里面讓executor執行.接下來就從源碼角度整個的分析一遍流程
下面的每一個開頭都表示在那個類里面執行的方法,要是沒有進行標注表示這個方法所在的類和上面的方法所在同一個類.同時代碼里面也會給出很多的注釋,記得看得仔細點哦
SparkContext:RDD內部隱式觸發SparkContext的runJob()方法 將RDD,操作rdd的函數以及分區數傳過來.
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}
DAGScheduler:接著調用DAGScheduler的runJob方法,
在這里會執行submitJob方法,這個方法會發生線程等待,直到返回作業執行的結果.
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
//在這里會發生線程等待,直到返回結果
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
的在submitJob方法,首先會檢查分區數,接著調用DAGScheduler的一個內部類DAGSchedulerEventProcessLoop;接著在它里面的消息接收方法OnReceive方法對JobSubmitted進行模式匹配,再次回到DAGScheduler中的handleJobSubmitted方法
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
/**
* Check to make sure we are not launching a task on a partition that does not exist.
*/
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
//執行
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
handleJobSubmitted這個方法真的太重要,從這個方法開始我們就開始要進行最重要的一步,stage的劃分了.stage的劃分是以createResultStage這個方法為入口,這個方法的返回值是一個job的最后一個stage.接下來就讓我進到這個方法里面來探究stage到底是怎么進行劃分的吧.
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
/**
* 這里其實是獲取到最后一個stage 通過我們前面傳過來的最后一個rdd來獲取的
*/
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
/**
* 根據stage創建job,一個finalStage創建一個job,所以一個action算子對應一個job
*/
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
/**
* 打印日志
*/
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
//創建job的提交時間
val jobSubmissionTime = clock.getTimeMillis()
//賦值
jobIdToActiveJob(jobId) = job
//把這個job保存下來
activeJobs += job
finalStage.setActiveJob(job)
//獲取到整個劃分的所有stage的集合
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//提交最后一個stage
submitStage(finalStage)
}
首先我們得明確一點就是,我們的stage是作業不同的調度階段,而它的劃分依據是RDD是否發生的寬窄依賴,而RDD又是具有血緣性的,所以我們的stage要要能夠簡歷起來血緣關系才能不斷的進行回溯.所以我們這里要關注的是
getOrCreateParentStages(rdd, jobId)這個方法,看他是怎么創建stage的.
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
/**
* 通過最后一個RDD里面獲取它的依賴關系去劃分stage 它的父stage 用一個list去保存
* 這里用集合的原因是因為它的父rdd是發生join之類的操作,所有會劃分出來兩個stage,所以就用list去保存
*/
val parents: List[Stage] = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
/**
* 把 List集合的stage封裝起來 通過最后一個rdd去創建最后一個stage ,同時里保存了他的父stage
*所以最后一個stage就是ResultStage,別的都是MapShuffleStage
*/
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
/**
* 把最后一個stage進行注冊!
*/
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
/**
* Get or create the list of parent stages for a given RDD. The new Stages will be created with
* the provided firstJobId.
*/
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
/**
* 這個方法返回的是當前rdd的父rdd的一個ShuffleDependency的hashSet集合
* 為什么要用集合保存呢,這里就是當他的父rdd有多個的時候,比如join操作呀,那是不是得用集合嘛
*/
getShuffleDependencies(rdd)
.map
{ shuffleDep =>
/**
* 拿出里面的dependency去創建他的父Stage 這一步很關鍵哦
*/
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
getOrCreateParentStages()這個方法是stage劃分里面最關鍵的幾個方法之一.在它里面首先我們通過傳進來的RDD去調用
getShuffleDependencies(rdd)方法,這個方法是獲取到當前RDD的父寬依賴列表.
話不多說,因為本人覺得這個方法蠻重要的,所以進去這個方法看看
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
/**
* 這個方法返回的是當前rdd的父rdd的一個ShuffleDependency的hashSet集合
* 為什么要用集合保存呢,這里就是當他的父rdd有多個的時候,比如join操作呀,那是不是得用集合嘛
*/
getShuffleDependencies(rdd)
.map
{ shuffleDep =>
/**
* 拿出里面的dependency去創建他的父Stage 這一步很關鍵哦
*/
getOrCreateShuffleMapStage(shuffleDep, firstJobId)
}.toList
}
這里用到了一個很關鍵的數據結構:棧,通過不斷的壓棧出棧實現對RDD的迭代,去找到它的ShuffleDependency.具體的做法是:在rdd里面有一個方法是dependencies,通過它可以獲得它的父rdd的依賴的一個Dependency對象(不知道寬窄),然后對這個依賴進行,模式匹配,要是shuffleDepen就保存下來,要是窄依賴就調用Dependency里面的一個屬性獲取當前依賴的rdd,同時入棧,繼續進行遍歷.最終返回的是一個發生shuffleDepen的list集合.所有大家發現沒有這個Dependency就相當于一個指針一樣,這個指針的概念我們后面要一直用哦.
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]
= {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep
case dependency =>
waitingForVisit.push(dependency.rdd)
}
}
}
parents
}
從* getShuffleDependencies方法出去,來到 getOrCreateShuffleMapStage(shuffleDep, firstJobId)這個方法里面,這個方法也在劃分stage里面太他媽重要了,它是通過我們上面得到的依賴關系去劃分出stage的.(劃分原則是要是有stage的話直接獲取,沒有stage就通過依賴關系去創建stage)所以我們再次進入這個方法里面,可以看到是通過之前的依賴關系里面的shuffleId屬性去獲取它所對應的stage,這里我們假設是第一次調用這個方法,我們的stage還沒有創建,所以接下來去創建stage,所以進入case None 的getMissingAncestorShuffleDependencies(shuffleDep.rdd)*這個方法,看我對這個方法的注釋就知道這個方法的重要性了吧.哈哈,所以就讓我們進入到這個stage里面最重要的方法,怎么劃分出所有的stage!!!
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
/**
* 通過shuffle的dependency獲取到shuffleID去獲取stage
*/
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
//獲取到了就返回就完事了
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
/**
* 我日你個媽嘞,原來這個方法里面才是獲取當前RDD的所有父的Shuffle依賴......
*/
getMissingAncestorShuffleDependencies(shuffleDep.rdd)
.foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
/**
* 這里就是通過它的父依賴去創建stage 再將stage進行注冊
* 里面通過遞歸調用獲取到所有的血緣關系!!!!!!
*/
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
createShuffleMapStage(shuffleDep, firstJobId)
}
}
getMissingAncestorShuffleDependencies(shuffleDep.rdd),在這個方法里面還是運用到了棧這個數據結構,同時通過遞歸調用 getShuffleDependencies(toVisit)這個方法,最終這個方法返回的是整個shuffle依賴鏈,從這個方法出來,foreach它的shuffle依賴. createShuffleMapStage方法通過它的shuffle依賴創建stage,同時將stage注冊.
private def getMissingAncestorShuffleDependencies(
rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
val ancestors = new Stack[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
/**
* 又是這個方法哦,通過遞歸調用這個方法
*/
getShuffleDependencies(toVisit).
foreach { shuffleDep =>
if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {
ancestors.push(shuffleDep)
/**
* 這里就是通過壓棧出棧的操作,獲取到她的所有的依賴
*/
waitingForVisit.push(shuffleDep.rdd)
} // Otherwise, the dependency and its ancestors have already been registered.
}
}
}
/**
* 返回的是一個所有依賴關系的棧
*/
ancestors
}
createShuffleMapStage在這個方法里面我們就創建了所有的shuffleMapStage了,同時完成了注冊.記住!!!!我們這里創建的一定是shuffleMapstage.
最后回到getOrCreateShuffleMapStage方法里面,返回一個stage.別搞混了哦,雖然注冊了所有的stage,但是這里只返回當前RDD對應的父stage
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
/**
* 這里的shuffleDep 就是他的一個父依賴
*/
val rdd = shuffleDep.rdd
/**
* 一個分區執行一個task 所以分區數等于task數
*/
val numTasks = rdd.partitions.length
/**
* !!!!!在這里就進行了遞歸調用
*/
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
/**
* 通過他的父依賴創建stage
*/
val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)
/**
* 很關鍵, 這里進行了注冊,把stage和他的shuffleid,還有一個自增的的stage內置id放在了一起
*/
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
// A previously run stage generated partitions for this shuffle, so for each output
// that's still available, copy information about that output location to the new stage
// (so we don't unnecessarily re-compute that data).
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else {
// Kind of ugly: need to register RDDs with the cache and map output tracker here
// since we can't do it in the RDD constructor because # of partitions is unknown
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
通過上面的代碼分析我們的stage的劃分就完成了
然后對方法進行回溯.回到createResultStage這個方法里面.通過前面獲取的父stage列表,創建出最后ResultStage對象返回.最后在handleJobSubmitted里面調用submitStage()方法將ResultStage提交.