首先分5中情況:
1,spark master進程掛掉了
2,spark master在執行中掛掉了
3,spark worker提交任務前全部掛掉了
4,spark worker在執行application過程中掛掉了
5,spark worker在執行application過程中全部掛掉了
1,spark master進程掛掉了
提交不了application,所以不用考慮application恢復問題
2,spark master在執行中掛掉了
不影響application正常執行,因為執行過程在worker中完成,并直接由worker返回結果。
3,spark worker提交任務前全部掛掉了
報錯信息如下,啟動woker后,application恢復正常。
17/01/04 19:31:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
4,spark worker在執行application過程中掛掉了
報錯信息如下:
17/01/04 19:41:50 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.91.128: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
移除RPC client,檢查DAG是否丟失關鍵路徑,如果丟失重新計算,如果lost :0 ,從BlockManagerMaster刪除失敗的executor,重新分發失敗的executor到其他worker。
5,spark worker在執行application過程中全部掛掉了
報錯信息如下,
17/01/04 19:34:16 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.91.128: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
application進入停滯狀態,等待worker注冊。
啟動worker后:executor重新注冊
刪除不可用executor,并重新注冊。
CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.91.128:55126) with ID 1
worker啟動后application恢復正常返回結果,但仍然報錯如下。(2.1.0以后版本修復此bug)。 ?
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler
下一篇講述Spark源碼調試環境的搭建