本文以Spark 1.6 Standalone模式為例,介紹用戶提交Spark Job后的Job的執(zhí)行流程。大體流程如下圖所示
用戶提交Job后會生成SparkContext對象,SparkContext向Cluster Manager(在Standalone模式下是Spark Master)申請Executor資源,并將Job分解成一系列可并行處理的task,然后將task分發(fā)到不同的Executor上運(yùn)行,Executor在task執(zhí)行完后將結(jié)果返回到SparkContext。
下面首先介紹SparkContext申請Executor資源的過程,整個過程如下圖所示。
整個過程分為8步:
- SparkContext創(chuàng)建TaskSchedulerImpl,SparkDeploySchedulerBackend和DAGScheduler
// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
DAGScheduler負(fù)責(zé)將Job劃分為不同的Stage,并在每個Stage內(nèi)化為出一系列可并行處理的task,然后將task遞交給TaskSchedulerImpl調(diào)度。此過程之后詳談。
TaskSchedulerImpl負(fù)責(zé)通過SparkDeploySchedulerBackend來調(diào)度任務(wù)(task),目前實現(xiàn)了FIFO調(diào)度和Fair調(diào)度。注意如果是Yarn模式,則是通過YarnSchedulerBackend來進(jìn)行調(diào)度。
- SparkDeploySchedulerBackend創(chuàng)建AppClient,并通過一些回調(diào)函數(shù)來得到Executor信息
client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
client.start()
SparkDeploySchedulerBackend與AppClient間的回調(diào)函數(shù)如下:
private[spark] trait AppClientListener {
def connected(appId: String): Unit
/** Disconnection may be a temporary state, as we fail over to a new Master. */
def disconnected(): Unit
/** An application death is an unrecoverable failure condition. */
def dead(reason: String): Unit
def executorAdded(fullId: String, workerId: String, hostPort: String, cores: Int, memory: Int)
def executorRemoved(fullId: String, message: String, exitStatus: Option[Int]): Unit
}
- AppClient向Master注冊Application
try {
registerWithMaster()
} catch {
case e: Exception =>
logWarning("Failed to connect to master", e)
markDisconnected()
context.stop(self)
}
Applicent通過Akka與Master進(jìn)行交互得到Executor和Master的信息,然后通過回調(diào)SparkDeploySchedulerBackend的函數(shù)。
- Master向Woker發(fā)送LaunchExecutor消息,同時向AppClient發(fā)送ExecutorAdded消息
Master收到RegisterApplication信息后,開始分配Executor資源。目前有兩種策略(摘抄原話):There are two modes of launching executors. The first attempts to spread out an application's executors on as many workers as possible, while the second does the opposite (i.e. launch them on as few workers as possible). The former is usually better for data locality purposes and is the default.
case RegisterApplication(description) => {
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, sender)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
sender ! RegisteredApplication(app.id, masterUrl)
schedule() //分配Executor資源
}
}
然后向Worker發(fā)送LauchExecutor消息,
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
worker.actor ! LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
exec.application.driver ! ExecutorAdded(
exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
}
注意:The number of cores assigned to each executor is configurable. When this is explicitly set, multiple executors from the same application may be launched on the same worker if the worker has enough cores and memory. Otherwise, each executor grabs all the cores available on the worker by default, in which case only one executor may be launched on each worker.
- Worker創(chuàng)建ExecutorRunner,并向Master發(fā)送ExecutorStateChanged的消息
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
akkaUrl,
conf,
appLocalDirs, ExecutorState.LOADING)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
master ! ExecutorStateChanged(appId, execId, manager.state, None, None)
- ExecutorRunner創(chuàng)建CoarseGrainedSchedulerBackend
在函數(shù)fetchAndRunExecutor中,
val builder = CommandUtils.buildProcessBuilder(appDesc.command, memory,
sparkHome.getAbsolutePath, substituteVariables)
其中appDesc.command是(在SparkDeploySchedulerBackend中定義)
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
- CoarseGrainedExecutorBackend向SparkDeploySchedulerBackend發(fā)送RegisterExecutor消息
override def onStart() {
logInfo("Connecting to driver: " + driverUrl)
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
// This is a very fast action so we can use "ThreadUtils.sameThread"
driver = Some(ref)
ref.ask[RegisteredExecutor.type](
RegisterExecutor(executorId, self, hostPort, cores, extractLogUrls))
}(ThreadUtils.sameThread).onComplete {
// This is a very fast action so we can use "ThreadUtils.sameThread"
case Success(msg) => Utils.tryLogNonFatalError {
Option(self).foreach(_.send(msg)) // msg must be RegisteredExecutor
}
case Failure(e) => logError(s"Cannot register with driver: $driverUrl", e)
}(ThreadUtils.sameThread)
}
- CoarseGrainedExecutorBackend在接收到SparkDeploySchedulerBackend發(fā)送的RegisteredExecutor消息后,創(chuàng)建Executor
override def receive: PartialFunction[Any, Unit] = {
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
val (hostname, _) = Utils.parseHostPort(hostPort)
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
至此Executor創(chuàng)建成功,:)。