引言
上一小節《TaskScheduler源碼與任務提交原理淺析2》介紹了Driver側將Stage進行劃分,根據Executor閑置情況分發任務,最終通過DriverActor向executorActor發送任務消息。
我們要了解Executor的執行機制首先要了解Executor在Driver側的注冊過程,這篇文章先了解一下Application和Executor的注冊過程。
1. Task類及其相關
1.1 Task類
Spark將由Executor執行的Task分為ShuffleMapTask和ResultTask兩種,其源碼存在scheduler package中。
Task是介于DAGScheduler和TaskScheduler中間的接口,在DAGScheduler,需要把DAG中的每個stage的每個partitions封裝成task,最終把taskset提交給TaskScheduler。
/**
* A unit of execution. We have two kinds of Task's in Spark:
* - [[org.apache.spark.scheduler.ShuffleMapTask]]
* - [[org.apache.spark.scheduler.ResultTask]]
*
* A Spark job consists of one or more stages. The very last stage in a job consists of multiple
* ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
* and sends the task output back to the driver application. A ShuffleMapTask executes the task
* and divides the task output to multiple buckets (based on the task's partitioner).
*
* @param stageId id of the stage this task belongs to
* @param partitionId index of the number in the RDD
*/
private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable
Task對應一個stageId和partitionId。
提供runTask()接口、kill()接口等。
提供killed變量、TaskMetrics變量、TaskContext變量等。
除了上述基本接口和變量,Task的伴生對象提供了序列化和反序列化應用依賴的jar包的方法。原因是Task需要保證工作節點具備本次Task需要的其他依賴,注冊到SparkContext下,所以提供了把依賴轉成流寫入寫出的方法。
1.2 ShuffleMapTask
對應于ShuffleMap Stage, 產生的結果作為其他stage的輸入。
ShuffleMapTask復寫了MapStatus向外讀寫的方法,因為向外讀寫的內容包括:stageId,rdd,dep,partitionId,epoch和split(某個partition)。對于其中的stageId,rdd,dep有統一的序列化和反序列化操作并會cache在內存里,再放到ObjectOutput里寫出去。序列化操作使用的是Gzip,序列化信息會維護在serializedInfoCache = newHashMap[Int, Array[Byte]]。這部分需要序列化并保存的原因是:stageId,rdd,dep真正代表了本次Shuffle Task的信息,為了減輕master節點負擔,把這部分序列化結果cache了起來。
/**
* A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a partitioner
* specified in the ShuffleDependency).
*
* See [[org.apache.spark.scheduler.Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
* the type should be (RDD[_], ShuffleDependency[_, _, _]).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
*/
private[spark] class ShuffleMapTask(
stageId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient private var locs: Seq[TaskLocation])
extends Task[MapStatus](stageId, partition.index) with Logging {
1.3 ResultTask
對應于Result Stage直接產生結果。
/**
* A task that sends back the output to the driver application.
*
* See [[Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
* partition of the given RDD. Once deserialized, the type should be
* (RDD[T], (TaskContext, Iterator[T]) => U).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
* @param outputId index of the task in this job (a job can launch tasks on only a subset of the
* input RDD's partitions).
*/
private[spark] class ResultTask[T, U](
stageId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient locs: Seq[TaskLocation],
val outputId: Int)
extends Task[U](stageId, partition.index) with Serializable {
1.4 TaskSet
TaskSet是一個數據結構,用于封裝一個stage的所有的tasks, 以提交給TaskScheduler。
TaskSet就是可以做pipeline的一組完全相同的task,每個task的處理邏輯完全相同,不同的是處理數據,每個task負責處理一個partition。pipeline,可以稱為大數據處理的基石,只有數據進行pipeline處理,才能將其放到集群中去運行。對于一個task來說,它從數據源獲得邏輯,然后按照拓撲順序,順序執行(實際上是調用rdd的compute)。
/**
* A set of tasks submitted together to the low-level TaskScheduler, usually representing
* missing partitions of a particular stage.
*/
private[spark] class TaskSet(
val tasks: Array[Task[_]],
val stageId: Int,
val attempt: Int,
val priority: Int,
val properties: Properties) {
val id: String = stageId + "." + attempt
override def toString: String = "TaskSet " + id
}
2. Executor注冊到Driver
Driver發送LaunchTask消息被Executor接收,Executor會使用launchTask對消息進行處理。
不過在這個過程之前,我們要知道,如果Executor沒有注冊到Driver,即便接收到LaunchTask指令,也不會做任務處理。所以我們要先搞清楚,Executor是如何在Driver側進行注冊的。
2.1 Application注冊
Executor的注冊是發生在Application的注冊過程中的,我們以Standalone模式為例:
SparkContext創建schedulerBackend和taskScheduler,schedulerBackend作為TaskScheduler對象的一個成員存在 --> 在TaskScheduler對象調用start函數時,其實調用了backend.start()函數 --> backend.start()函數中啟動了AppClient,AppClient的其中一個參數ApplicationDescription就是封裝的運行
CoarseGrainedExecutorBackend的命令
--> AppClient內部啟動了一個ClientActor,這個ClientActor啟動之后,會嘗試向Master發送一個指令actor ! RegisterApplication(appDescription)
注冊一個Application
下面是SparkDeploySchedulerBackend的start函數中的部分注冊Application的代碼:
// Start executors with a few necessary configs for registering with the scheduler
val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
val javaOpts = sparkJavaOpts ++ extraJavaOpts
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("")
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
appUIAddress, sc.eventLogDir, sc.eventLogCodec)
client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
client.start()
AppClient向Master提交Application
AppClient是Application和Master交互的接口。它的包含一個類型為org.apache.spark.deploy.client.AppClient.ClientActor
的成員變量actor。它負責了所有的與Master的交互。其中提交Application過程涉及的函數調用為:
ClientActor的preStart()
--> 調用registerWithMaster()
--> 調用tryRegisterAllMasters
--> actor ! RegisterApplication(appDescription)
--> Master的receiveWithLogging函數處理RegisterApplication消息。
下面是RegisterApplication(appDescription)消息的相關處理代碼(在Master.scala中的receiveWithLogging部分代碼):
case RegisterApplication(description) => {
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
val app = createApplication(description, sender)
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
sender ! RegisteredApplication(app.id, masterUrl)
schedule()//為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度
}
}
這段代碼做了以下幾件事:
- createApplication為這個app構建一個描述App數據結構的ApplicationInfo
- 注冊該Application,更新相應的映射關系,添加到等待隊列里面
- 用persistenceEngine持久化Application信息,默認是不保存的,另外還有兩種方式,保存在文件或者Zookeeper當中
- 通過發送方注冊成功
- 開始作業調度(為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度)
2.2 Master中的schedule函數
schedule()為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度。為Application分配資源選擇worker(executor),現在有兩種策略:
- 盡量的打散,即一個Application盡可能多的分配到不同的節點。這個可以通過設置spark.deploy.spreadOut來實現。默認值為true,即盡量的打散。
- 盡量的集中,即一個Application盡量分配到盡可能少的節點。
對于同一個Application,它在一個worker上只能擁有一個executor;當然了,這個executor可能擁有多于1個core。對于策略1,任務的部署會慢于策略2,但是GC的時間會更快。
schedule函數的源碼,解釋在中文注釋中:
/*
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
private def schedule() {
if (state != RecoveryState.ALIVE) { return }
// First schedule drivers, they take strict precedence over applications
// Randomization helps balance drivers
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
if (spreadOutApps) {//盡量的打散負載,如有可能,每個executor分配一個core
// Try to spread out each app among all the nodes, until it has all its cores
for (app <- waitingApps if app.coresLeft > 0) {
// 可用的worker的標準:State是Alive,其上并沒有該Application的executor,可用內存滿足要求。
// 在可用的worker中,優先選擇可用core數多的。
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(canUse(app, _)).sortBy(_.coresFree).reverse
val numUsable = usableWorkers.length
val assigned = new Array[Int](numUsable) // Number of cores to give on each node
var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
var pos = 0
while (toAssign > 0) {
if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
toAssign -= 1
assigned(pos) += 1
}
pos = (pos + 1) % numUsable
}
// Now that we've decided how many cores to give on each node, let's actually give them
for (pos <- 0 until numUsable) {
if (assigned(pos) > 0) {
val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
launchExecutor(usableWorkers(pos), exec)
app.state = ApplicationState.RUNNING
}
}
}
} else { //盡可能多的利用worker的core
// Pack each app into as few nodes as possible until we've assigned all its cores
for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
for (app <- waitingApps if app.coresLeft > 0) {
if (canUse(app, worker)) {
val coresToUse = math.min(worker.coresFree, app.coresLeft)
if (coresToUse > 0) {
val exec = app.addExecutor(worker, coresToUse)
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
}
}
}
}
2.3 launchExecutor函數
在選擇了worker和確定了worker上得executor需要的CPU core數后,Master會調用 launchExecutor(worker: WorkerInfo, exec: ExecutorInfo)向Worker發送請求,向AppClient發送executor已經添加的消息。同時會更新master保存的worker的信息,包括增加executor,減少可用的CPU core數和memory數。Master不會等到真正在worker上成功啟動executor后再更新worker的信息。如果worker啟動executor失敗,那么它會發送FAILED的消息給Master,Master收到該消息時再次更新worker的信息即可。
def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
worker.actor ! LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
exec.application.driver ! ExecutorAdded(
exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
}
2.4 Executor的創建
下面的調用關系鏈是Worker接收到來自Master的LaunchExecutor消息
后的調用過程:
LaunchExecutor的消息處理中創建ExecutorRunner
--> ExecutorRunner會將在SparkDeploySchedulerBackend
中準備好的ApplicationDescription
以進程的形式啟動起來 --> 啟動ApplicationDescription
中攜帶的CoarseGrainedExecutorBackend
--> CoarseGrainedExecutorBackend
啟動后,會首先通過傳入的driverUrl這個參數向在CoarseGrainedSchedulerBackend::DriverActor
發送RegisterExecutor消息
--> DriverActor會回復RegisteredExecutor
--> CoarseGrainedExecutorBackend
會創建一個Executor
--> Executor創建完畢。
CoarseGrainedExecutorBackend啟動后,preStart函數執行的相關操作:
override def preStart() {
logInfo("Connecting to driver: " + driverUrl)
driver = context.actorSelection(driverUrl)
driver ! RegisterExecutor(executorId, hostPort, cores, extractLogUrls)
context.system.eventStream.subscribe(self, classOf[RemotingLifecycleEvent])
}
CoarseGrainedExecutorBackend接收RegisteredExecutor消息后,創建Executor的操作:
override def receiveWithLogging = {
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
val (hostname, _) = Utils.parseHostPort(hostPort)
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
......
參考資料
Spark Core源碼分析: Spark任務模型
Spark技術內幕:Executor分配詳解 —— 強烈推薦該博文,其中博主結合Spark源碼對Executor的分配講解的非常詳細
轉載請注明作者Jason Ding及其出處
GitCafe博客主頁(http://jasonding1354.gitcafe.io/)
Github博客主頁(http://jasonding1354.github.io/)
CSDN博客(http://blog.csdn.net/jasonding1354)
簡書主頁(http://www.lxweimin.com/users/2bd9b48f6ea8/latest_articles)
Google搜索jasonding1354進入我的博客主頁