==[Executor]分配詳解::Spark技術內幕:

Spark技術內幕:Executor分配詳解 - 推酷
http://www.tuicool.com/articles/VVFnIfq

當用戶應用new SparkContext后,集群就會為在Worker上分配executor,那么這個過程是什么呢?本文以Standalone的Cluster為例,詳細的闡述這個過程。序列圖如下:


  1. SparkContext創建TaskScheduler和DAG Scheduler
    SparkContext是用戶應用和Spark集群的交換的主要接口,用戶應用一般首先要創建它。如果你使用SparkShell,你不必自己顯式去創建它,系統會自動創建一個名字為sc的SparkContext的實例。創建SparkContext的實例,主要的工作除了設置一些conf,比如executor使用到的memory的大小。如果系統的配置文件有,那么就讀取該配置。否則則讀取環境變量。如果都沒有設置,那么取默認值為512M。當然了這個數值還是很保守的,特別是在內存已經那么昂貴的今天。
    private[spark] val executorMemory = conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY"))) .orElse(Option(System.getenv("SPARK_MEM")).map(warnSparkMem)) .map(Utils.memoryStringToMb) .getOrElse(512)
    除了加載這些集群的參數,它完成了TaskScheduler和DAGScheduler的創建:
    // Create and start the scheduler private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver") @volatile private[spark] var dagScheduler: DAGScheduler = _ try { dagScheduler = new DAGScheduler(this) } catch { case e: Exception => throw new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage)) } // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's // constructor taskScheduler.start()
    TaskScheduler是通過不同的SchedulerBackend來調度和管理任務。它包含資源分配和任務調度。它實現了FIFO調度和FAIR調度,基于此來決定不同jobs之間的調度順序。并且管理任務,包括任務的提交和終止,為饑餓任務啟動備份任務。
    不同的Cluster,包括local模式,都是通過不同的SchedulerBackend的實現其不同的功能。這個模塊的類圖如下:


  2. TaskScheduler通過SchedulerBackend創建AppClient
    SparkDeploySchedulerBackend是Standalone模式的SchedulerBackend。通過創建AppClient,可以向Standalone的Master注冊Application,然后Master會通過Application的信息為它分配Worker,包括每個worker上使用CPU core的數目等。
    private[spark] class SparkDeploySchedulerBackend( scheduler: TaskSchedulerImpl, sc: SparkContext, masters: Array[String]) extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem) with AppClientListener with Logging { var client: AppClient = null //注:Application與Master的接口 val maxCores = conf.getOption("spark.cores.max").map(.toInt) //注:獲得每個executor最多的CPU core數目 override def start() { super.start() // The endpoint for executors to talk to us val driverUrl = "akka.tcp://%s@%s:%s/user/%s".format( SparkEnv.driverActorSystemName, conf.get("spark.driver.host"), conf.get("spark.driver.port"), CoarseGrainedSchedulerBackend.ACTOR_NAME) //注:現在executor還沒有申請,因此關于executor的所有信息都是未知的。 //這些參數將會在org.apache.spark.deploy.worker.ExecutorRunner啟動ExecutorBackend的時候替換這些參數 val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}") //注:設置executor運行時需要的環境變量 val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions") .map(Utils.splitCommandString).getOrElse(Seq.empty) val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath").toSeq.flatMap { cp => cp.split(java.io.File.pathSeparator) } val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath").toSeq.flatMap { cp => cp.split(java.io.File.pathSeparator) } // Start executors with a few necessary configs for registering with the scheduler val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf) val javaOpts = sparkJavaOpts ++ extraJavaOpts //注:在Worker上通過org.apache.spark.deploy.worker.ExecutorRunner啟動 // org.apache.spark.executor.CoarseGrainedExecutorBackend,這里準備啟動它需要的參數 val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries, libraryPathEntries, javaOpts) //注:org.apache.spark.deploy.ApplicationDescription包含了所有注冊這個Application的所有信息。 val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command, sc.ui.appUIAddress, sc.eventLogger.map(.logDir)) client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf) client.start() //注:在Master返回注冊Application成功的消息后,AppClient會回調本class的connected,完成了Application的注冊。 waitForRegistration() }
    org.apache.spark.deploy.client.AppClientListener是一個trait,主要為了SchedulerBackend和AppClient之間的函數回調,在以下四種情況下,AppClient會回調相關函數以通知SchedulerBackend:
    向Master成功注冊Application,即成功鏈接到集群;
    斷開連接,如果當前 SparkDeploySchedulerBackend:: stop == false,那么可能原來的Master實效了,待新的Master ready后,會重新恢復原來的連接;
    Application由于不可恢復的錯誤停止了,這個時候需要重新提交出錯的TaskSet;
    添加一個Executor,在這里的實現僅僅是打印了log,并沒有額外的邏輯;
    刪除一個Executor,可能有兩個原因,一個是Executor退出了,這里可以得到Executor的退出碼,或者由于Worker的退出導致了運行其上的Executor退出,這兩種情況需要不同的邏輯來處理。

private[spark] trait AppClientListener { def connected(appId: String): Unit /** Disconnection may be a temporary state, as we fail over to a new Master. / def disconnected(): Unit /* An application death is an unrecoverable failure condition. */ def dead(reason: String): Unit def executorAdded(fullId: String, workerId: String, hostPort: String, cores: Int, memory: Int) def executorRemoved(fullId: String, message: String, exitStatus: Option[Int]): Unit}
小結:SparkDeploySchedulerBackend裝備好啟動Executor的必要參數后,創建AppClient,并通過一些回調函數來得到Executor和連接等信息;通過org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.DriverActor與ExecutorBackend來進行通信。

  1. AppClient向Master提交Application
    AppClient是Application和Master交互的接口。它的包含一個類型為org.apache.spark.deploy.client.AppClient.ClientActor的成員變量actor。它負責了所有的與Master的交互。actor首先向Master注冊Application。如果超過20s沒有接收到注冊成功的消息,那么會重新注冊;如果重試超過3次仍未成功,那么本次提交就以失敗結束了。
    def tryRegisterAllMasters() { for (masterUrl <- masterUrls) { logInfo("Connecting to master " + masterUrl + "...") val actor = context.actorSelection(Master.toAkkaUrl(masterUrl)) actor ! RegisterApplication(appDescription) // 向Master注冊 } } def registerWithMaster() { tryRegisterAllMasters() import context.dispatcher var retries = 0 registrationRetryTimer = Some { // 如果注冊20s內未收到成功的消息,那么再次重復注冊 context.system.scheduler.schedule(REGISTRATION_TIMEOUT, REGISTRATION_TIMEOUT) { Utils.tryOrExit { retries += 1 if (registered) { // 注冊成功,那么取消所有的重試 registrationRetryTimer.foreach(.cancel()) } else if (retries >= REGISTRATION_RETRIES) { // 重試超過指定次數(3次),則認為當前Cluster不可用,退出 markDead("All masters are unresponsive! Giving up.") } else { // 進行新一輪的重試 tryRegisterAllMasters() } } } } }
    主要的消息如下:
    RegisteredApplication(appId
    , masterUrl) => //注:來自Master的注冊Application成功的消息
    ApplicationRemoved(message) => //注:來自Master的刪除Application的消息。Application執行成功或者失敗最終都會被刪除。
    ExecutorAdded(id: Int, workerId: String, hostPort: String, cores: Int, memory: Int) => //注:來自Master
    ExecutorUpdated(id, state, message, exitStatus) => //注:來自Master的Executor狀態更新的消息,如果是Executor是完成的狀態,那么回調SchedulerBackend的executorRemoved的函數。
    MasterChanged(masterUrl, masterWebUiUrl) => //注:來自新競選成功的Master。Master可以選擇ZK實現HA,并且使用ZK來持久化集群的元數據信息。 因此在Master變成leader后,會恢復持久化的Application,Driver和Worker的信息。
    StopAppClient => //注:來自AppClient::stop()

  2. Master根據AppClient的提交選擇Worker
    Master接收到AppClient的registerApplication的請求后,處理邏輯如下:
    case RegisterApplication(description) => { if (state == RecoveryState.STANDBY) { // ignore, don't send response //注:AppClient有超時機制(20s),超時會重試 } else { logInfo("Registering app " + description.name) val app = createApplication(description, sender) // app is ApplicationInfo(now, newApplicationId(date), desc, date, driver, defaultCores), driver就是AppClient的actor //保存到master維護的成員變量中,比如 /* apps += app; idToApp(app.id) = app actorToApp(app.driver) = app addressToApp(appAddress) = app waitingApps += app */ registerApplication(app) logInfo("Registered app " + description.name + " with ID " + app.id) persistenceEngine.addApplication(app) //持久化app的元數據信息,可以選擇持久化到ZooKeeper,本地文件系統,或者不持久化 sender ! RegisteredApplication(app.id, masterUrl) schedule() //為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度 } }
    schedule() 為處于待分配資源的Application分配資源。在每次有新的Application加入或者新的資源加入時都會調用schedule進行調度。為Application分配資源選擇worker(executor),現在有兩種策略:盡量的打散,即一個Application盡可能多的分配到不同的節點。這個可以通過設置spark.deploy.spreadOut來實現。默認值為true,即盡量的打散。
    盡量的集中,即一個Application盡量分配到盡可能少的節點。

對于同一個Application,它在一個worker上只能擁有一個executor;當然了,這個executor可能擁有多于1個core。其主要邏輯如下:
if (spreadOutApps) { //盡量的打散負載,如有可能,每個executor分配一個core // Try to spread out each app among all the nodes, until it has all its cores for (app <- waitingApps if app.coresLeft > 0) { //使用FIFO的方式為等待的app分配資源 // 可用的worker的標準:State是Alive,其上并沒有該Application的executor,可用內存滿足要求。 // 在可用的worker中,優先選擇可用core數多的。 val usableWorkers = workers.toArray.filter(.state == WorkerState.ALIVE) .filter(canUse(app, )).sortBy(.coresFree).reverse val numUsable = usableWorkers.length val assigned = new ArrayInt // Number of cores to give on each node 保存在該節點上預分配的core數 var toAssign = math.min(app.coresLeft, usableWorkers.map(.coresFree).sum) var pos = 0 while (toAssign > 0) { if (usableWorkers(pos).coresFree - assigned(pos) > 0) { toAssign -= 1 assigned(pos) += 1 } pos = (pos + 1) % numUsable } // Now that we've decided how many cores to give on each node, let's actually give them for (pos <- 0 until numUsable) { if (assigned(pos) > 0) { val exec = app.addExecutor(usableWorkers(pos), assigned(pos)) launchExecutor(usableWorkers(pos), exec) app.state = ApplicationState.RUNNING } } } } else {//盡可能多的利用worker的core // Pack each app into as few nodes as possible until we've assigned all its cores for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) { for (app <- waitingApps if app.coresLeft > 0) { if (canUse(app, worker)) { val coresToUse = math.min(worker.coresFree, app.coresLeft) if (coresToUse > 0) { val exec = app.addExecutor(worker, coresToUse) launchExecutor(worker, exec) app.state = ApplicationState.RUNNING } } } } }
在選擇了worker和確定了worker上得executor需要的CPU core數后,Master會調用 launchExecutor(worker: WorkerInfo, exec: ExecutorInfo)向Worker發送請求,向AppClient發送executor已經添加的消息。同時會更新master保存的worker的信息,包括增加executor,減少可用的CPU core數和memory數。Master不會等到真正在worker上成功啟動executor后再更新worker的信息。如果worker啟動executor失敗,那么它會發送FAILED的消息給Master,Master收到該消息時再次更新worker的信息即可。這樣是簡化了邏輯。
def launchExecutor(worker: WorkerInfo, exec: ExecutorInfo) { logInfo("Launching executor " + exec.fullId + " on worker " + worker.id) worker.addExecutor(exec)//更新worker的信息,可用core數和memory數減去本次分配的executor占用的 // 向Worker發送啟動executor的請求 worker.actor ! LaunchExecutor(masterUrl, exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory) // 向AppClient發送executor已經添加的消息? exec.application.driver ! ExecutorAdded( exec.id, worker.id, worker.hostPort, exec.cores, exec.memory) }
小結:現在的分配方式還是比較粗糙的。比如并沒有考慮節點的當前總體負載??赡軙е鹿濣c上executor的分配是比較均勻的,單純靜態的從executor分配到得CPU core數和內存數來看,負載是比較均衡的。但是從實際情況來看,可能有的executor的資源消耗比較大,因此會導致集群負載不均衡。這個需要從生產環境的數據得到反饋來進一步的修正和細化分配策略,以達到更好的資源利用率。

  1. Worker根據Master的資源分配結果來創建Executor
    Worker接收到來自Master的LaunchExecutor的消息后,會創建org.apache.spark.deploy.worker.ExecutorRunner。Worker本身會記錄本身資源的使用情況,包括已經使用的CPU core數,memory數等;但是這個統計只是為了web UI的展現。Master本身會記錄Worker的資源使用情況,無需Worker自身匯報。Worker與Master之間的心跳的目的僅僅是為了報活,不會攜帶其他的信息。
    ExecutorRunner會將在org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend中準備好的org.apache.spark.deploy.ApplicationDescription以進程的形式啟動起來。當時以下幾個參數還是未知的:
    val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}")。ExecutorRunner需要將他們替換成已經分配好的實際值:
    /** Replace variables such as {{EXECUTOR_ID}} and {{CORES}} in a command argument passed to us / def substituteVariables(argument: String): String = argument match { case "{{WORKER_URL}}" => workerUrl case "{{EXECUTOR_ID}}" => execId.toString case "{{HOSTNAME}}" => host case "{{CORES}}" => cores.toString case other => other }
    接下來就啟動org.apache.spark.deploy.ApplicationDescription中攜帶的org.apache.spark.executor.CoarseGrainedExecutorBackend:def fetchAndRunExecutor() { try { // Create the executor's working directory val executorDir = new File(workDir, appId + "/" + execId) if (!executorDir.mkdirs()) { throw new IOException("Failed to create directory " + executorDir) } // Launch the process val command = getCommandSeq logInfo("Launch command: " + command.mkString(""", "" "", """)) val builder = new ProcessBuilder(command: _
    ).directory(executorDir) val env = builder.environment() for ((key, value) <- appDesc.command.environment) { env.put(key, value) } // In case we are running this from within the Spark Shell, avoid creating a "scala" // parent process for the executor command env.put("SPARK_LAUNCH_WITH_SCALA", "0") process = builder.start()
    CoarseGrainedExecutorBackend啟動后,會首先通過傳入的driverUrl這個參數向在org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend::DriverActor發送RegisterExecutor(executorId, hostPort, cores),DriverActor會回復RegisteredExecutor,此時 CoarseGrainedExecutorBackend會創建一個org.apache.spark.executor.Executor。至此,Executor創建完畢。Executor在Mesos, YARN, and the standalone scheduler中,都是相同的。不同的只是資源的分配管理方式。
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容