Spark on Yarn

版本：spark-2.3.0-bin-hadoop2.6
http://spark.apache.org/docs/latest/running-on-yarn.html

在YARN上啟動Spark

確保HADOOP_CONF_DIR或YARN_CONF_DIR指向Hadoop集群（客戶端）配置目錄。這些配置用于寫入HDFS并連接到YARN ResourceManager。該目錄中包含的配置文件將被分發到YARN集群，以便應用程序使用的所有容器具有相同的配置。如果配置引用了不是由YARN管理的Java系統屬性或環境變量，還應該在Spark應用程序的配置（驅動程序，執行程序和以客戶端模式運行時的AM）中對其進行設置。

Spark on Yarn 有兩種部署模式。在cluster模式中，Spark驅動程序在由YARN管理的應用程序主進程內運行，客戶端可以在啟動應用程序后離開。在client模式下，驅動程序在客戶端進程中運行，而應用程序主服務器僅用于從YARN請求資源。

與Spark支持的其他集群管理器不同，--master 參數中指定了master的地址，在YARN模式下，ResourceManager的地址從Hadoop配置中獲取。因此，--master參數是yarn。

以cluster模式啟動Spark應用程序：

$ ./bin/spark-submit --class path.to.your.Class 
           --master yarn 
           --deploy-mode cluster 
           [options] <app jar> [app options]

例如：

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

以上示例啟動了一個YARN客戶端程序，它將啟動默認Application Master。然后SparkPi將作為Application Master的子線程運行。客戶端將定期輪詢AM的狀態并將其顯示在控制臺中。一旦你的應用程序運行完畢，客戶端將退出。有關如何查看驅動程序和執行程序日志的信息，請參閱下面的“調試應用程序”部分。

要在client模式下啟動Spark應用程序，請執行相同的操作，但替換cluster為client。以下顯示如何spark-shell在client模式下運行：

$ ./bin/spark-shell --master yarn --deploy-mode client

添加其他JAR

在cluster模式下，驅動程序運行在與客戶端不同的機器上，因此SparkContext.addJar 無法使用客戶端本地文件開箱即用。要使客戶端上的文件可用SparkContext.addJar，請使用--jars啟動命令中的選項將它們包括在內。

$ ./bin/spark-submit --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

準備工作

在YARN上運行Spark需要使用YARN支持構建的Spark版本。二進制發行版可以從項目網站的下載頁面下載。要自己構建Spark，請參閱Building Spark。

為了從YARN端訪問Spark運行時jar，你可以指定spark.yarn.archive或spark.yarn.jars。有關詳細信息，請參閱Spark屬性。如果既沒有指定spark.yarn.archive也沒有spark.yarn.jars，Spark將創建一個包含所有jar文件的zip文件$SPARK_HOME/jars并將其上傳到分布式緩存。

配置

YARN上的Spark大部分配置與其他部署模式都相同。有關這些的更多信息，請參閱配置頁面。這些是在YARN上特定于Spark的配置。

調試您的應用程序

在YARN術語中，執行者和應用程序masters在“容器”內部運行。在應用程序完成后，YARN有兩種處理容器日志的模式。如果打開日志聚合（使用yarn.log-aggregation-enable配置），容器日志將復制到HDFS中，而本地計算機上的日志將被刪除。查看日志可以通過 yarn logs 命令從群集中的任何位置查看。

yarn logs -applicationId <app ID>

該命令會將指定的應用程序日志從所有的容器中打印所有的日志內容。您也可以使用HDFS shell 或API直接在HDFS中查看容器日志文件。他們所在的目錄參考YARN配置（yarn.nodemanager.remote-app-log-dir和yarn.nodemanager.remote-app-log-dir-suffix）。日志也可以通過Spark Web UI中的Executors 標簽頁查看。你需要運行spark 歷史服務器和MapReduce歷史服務器，并在yarn-site.xml正確配置yarn.log.server.url。Spark歷史記錄服務器UI上的日志URL會將您重定向到MapReduce歷史記錄服務器以顯示聚合日志。

當日志聚合未打開時，日志將保存在每臺計算機上的本地YARN_APP_LOGS_DIR，通常配置為/tmp/logs或$HADOOP_HOME/logs/userlogs，取決于Hadoop版本和安裝配置。查看容器的日志需要轉到包含它們的主機并查看此目錄。子目錄按應用程序ID和容器ID組織日志文件。日志也可以在執行程序選項卡下的Spark Web UI上使用，并且不需要運行MapReduce歷史記錄服務器。

要查看每個容器的啟動環境，請增加yarn.nodemanager.delete.debug-delay-sec一個較大的值（例如36000），然后通過yarn.nodemanager.local-dirs 在啟動容器的節點上訪問應用程序緩存。該目錄包含啟動腳本，JAR以及用于啟動每個容器的所有環境變量。這個過程特別適用于調試類路徑問題。（請注意，啟用此功能需要具有群集設置的管理權限并重新啟動所有節點管理器。因此，這不適用于托管群集）。

要為應用程序master 或executor 使用自定義log4j配置，可以使用以下選項：

使用spark-submit上傳自定義log4j.properties，方法是添加--files參數中與應用程序一起上傳的文件列表。
添加-Dlog4j.configuration=<location of configuration file>到spark.driver.extraJavaOptions （對于驅動程序）或spark.executor.extraJavaOptions（對于執行者）。請注意，如果使用文件，file:則應明確提供協議，并且文件需要在所有節點上本地存在。
更新$SPARK_CONF_DIR/log4j.properties文件，它會自動與其他配置一起上傳。請注意，如果指定了多個選項，其他2個選項的優先級高于此選項。

請注意，對于第一個選項，executor和master 將共享相同的log4j配置，這可能會導致在同一節點上運行時出現問題（例如嘗試寫入同一個日志文件）。

如果您需要使用正確的位置將日志文件放入YARN中，以便YARN可以正確顯示并聚合它們，請在您的log4j.properties中配置spark.yarn.app.container.log.dir。例如，log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log。對于流式應用程序，RollingFileAppender將文件位置配置到YARN的日志目錄將避免由大型日志文件引起的磁盤溢出，并且可以使用YARN的日志工具訪問日志。

要為master和 executor 使用自定義的metrics.properties，請更新該$SPARK_CONF_DIR/metrics.properties文件。它將自動與其他配置一起上傳，因此您無需手動指定它--files。

Spark屬性

屬性名稱	默認	含義
`spark.yarn.am.memory`	`512m`	用于客戶端模式下的YARN Application Master的內存量，格式與JVM內存字符串（例如`512m`，`2g`）相同。在集群模式下，改為使用`spark.driver.memory`。使用小寫字母后綴，例如`k`，`m`，`g`，`t`，和`p`，分別代表kibi-，mebi-，gibi-，tebi-和pebibytes。
`spark.yarn.am.cores`	`1`	在客戶端模式下用于YARN Application Master的核心數量。在集群模式下，改為使用`spark.driver.cores`。
`spark.yarn.am.waitTime`	`100s`	在`cluster`模式中，YARN Application Master等待SparkContext被初始化的時間。在`client`模式中，YARN Application Master等待驅動程序連接的時間。
`spark.yarn.submit.file.replication`	默認為HDFS副本（通常`3`）	應用程序上傳到HDFS的HDFS復制級別。這些包括Spark jar，應用程序jar以及任何分布式緩存文件/存檔等內容。
`spark.yarn.stagingDir`	當前用戶在文件系統中的主目錄	提交應用程序時使用的暫存目錄。
`spark.yarn.preserve.staging.files`	`false`	設置為`true`在作業結束時保留分段文件（Spark jar，app jar，分布式緩存文件），而不是刪除它們。
`spark.yarn.scheduler.heartbeat.interval-ms`	`3000`	Spark master 到YARN ResourceManager的心跳時間間隔（以毫秒為單位）。該值被限制在YARN配置的到期時間間隔的一半值上，即 `yarn.am.liveness-monitor.expiry-interval-ms`的一半。
`spark.yarn.scheduler.initial-allocation.interval`	`200ms`	當容器分配請求時， Spark master節點到YARN ResourceManager的心跳時間間隔。它不應該大于 `spark.yarn.scheduler.heartbeat.interval-ms`。如果掛起的容器仍然存在，分配間隔會是連續的心跳加倍，直到達到`spark.yarn.scheduler.heartbeat.interval-ms`。
`spark.yarn.max.executor.failures`	numExecutors * 2，最少3	認定應用程序失敗之前執行者失敗的最大次數。
`spark.yarn.historyServer.address`	（none）	Spark歷史記錄服務器的地址，例如`host.com:18080`。地址不應該包含scheme（`http://`）。由于歷史記錄服務器是可選服務，因此缺省設置為未設置。當Spark應用程序完成將應用程序從ResourceManager UI鏈接到Spark歷史記錄服務器UI時，該地址將被提供給YARN ResourceManager。對于這個屬性，YARN屬性可以用作變量，并且這些屬性在運行時被Spark替換。例如，如果Spark歷史記錄服務器與YARN ResourceManager在同一節點上運行，則可將其設置為`${hadoopconf-yarn.resourcemanager.hostname}:18080`。
`spark.yarn.dist.archives`	（none）	逗號分隔的檔案列表被提取到每個執行者的工作目錄中。
`spark.yarn.dist.files`	（none）	逗號分隔的文件列表將放置在每個執行者的工作目錄中。
`spark.yarn.dist.jars`	（沒有）	逗號分隔的jar列表將被放置在每個執行者的工作目錄中。
`spark.yarn.dist.forceDownloadSchemes`	`(none)`	在添加到YARN分布式緩存之前，文件將被下載到本地磁盤的逗號分隔列表。用于YARN服務不支持Spark支持的方案的情況，如http，https和ftp。
`spark.executor.instances`	`2`	靜態分配的執行者數量。當啟用`spark.dynamicAllocation.enabled`時，執行者數量將至少這么大。
`spark.yarn.am.memoryOverhead`	AM 內存* 0.10，最小值為384	與`spark.driver.memoryOverhead`客戶端模式下的YARN Application Master 相同。
`spark.yarn.queue`	`default`	提交應用程序的YARN隊列的名稱。
`spark.yarn.jars`	（none）	包含要分發到YARN容器的Spark代碼庫的列表。默認情況下，Spark on YARN 將使用本地安裝的Spark jar，但Spark jar也可以位于HDFS上都可讀的位置。這允許YARN將它緩存在節點上，以便每次應用程序運行時不需要分發它。例如，要指向HDFS上的jar，請將此配置設置為`hdfs:///some/path`。Globs是允許的。
`spark.yarn.archive`	（none）	一個包含需要的Spark Jars的檔案，用于分發到YARN緩存。如果設置，則此配置會替換`spark.yarn.jars`并在所有應用程序的容器中使用該存檔。歸檔文件應該在其根目錄中包含jar文件。與之前的選項一樣，存檔也可以托管在HDFS上以加速文件分發。
`spark.yarn.access.hadoopFileSystems`	（none）	Spark應用程序將訪問的安全Hadoop文件系統的逗號分隔列表。例如，`spark.yarn.access.hadoopFileSystems=hdfs://nn1.com:8032,hdfs://nn2.com:8032, webhdfs://nn3.com:50070`。Spark應用程序必須能夠訪問列出的文件系統，并且必須正確配置Kerberos以便能夠訪問它們（在相同的領域或受信任的領域）。Spark為每個文件系統獲取安全令牌，以便Spark應用程序可以訪問這些遠程Hadoop文件系統。`spark.yarn.access.namenodes` 已棄用，請使用此代替。
`spark.yarn.appMasterEnv.[EnvironmentVariableName]`	（none）	將指定的環境變量添加`EnvironmentVariableName`到在YARN上啟動的應用程序主進程。用戶可以指定其中的多個并設置多個環境變量。在`cluster`模式中，它控制Spark驅動程序的環境，并且在`client`模式下它僅控制執行程序啟動程序的環境。
`spark.yarn.containerLauncherMaxThreads`	`25`	YARN Application Master中用于啟動執行程序容器的最大線程數。
`spark.yarn.am.extraJavaOptions`	（none）	一系列額外的JVM選項以客戶端模式傳遞給YARN Application Master。在集群模式下，`spark.driver.extraJavaOptions`改為使用。請注意，使用此選項設置最大堆大小（-Xmx）設置是非法的。最大堆大小設置可以使用`spark.yarn.am.memory`
`spark.yarn.am.extraLibraryPath`	（none）	在客戶端模式下啟動YARN Application Master時，設置一個特殊的庫路徑。
`spark.yarn.maxAppAttempts`	`yarn.resourcemanager.am.max-attempts` 在YARN	提交申請的最大嘗試次數。它不應該大于YARN配置中的全局最大嘗試次數。
`spark.yarn.am.attemptFailuresValidityInterval`	（沒有）	定義AM故障跟蹤的有效時間間隔。如果AM至少在定義的時間間隔內運行，則AM故障計數將被重置。如果未配置，此功能未啟用。
`spark.yarn.executor.failuresValidityInterval`	（沒有）	定義執行器故障跟蹤的有效時間間隔。比有效期間更早的執行程序故障將被忽略。
`spark.yarn.submit.waitAppCompletion`	`true`	在YARN群集模式下，控制客戶端是否等待退出，直到應用程序完成。如果設置為`true`，則客戶端進程將保持活動狀態，報告應用程序的狀態。否則，客戶端進程將在提交后退出。
`spark.yarn.am.nodeLabelExpression`	（沒有）	一個YARN節點標簽表達式將限制節點集合AM的安排。只有大于或等于2.6的YARN版本才支持節點標簽表達式，所以在針對早期版本運行時，該屬性將被忽略。
`spark.yarn.executor.nodeLabelExpression`	（沒有）	一個YARN節點標簽表達式將限制節點執行者集合的安排。只有大于或等于2.6的YARN版本才支持節點標簽表達式，所以在針對早期版本運行時，該屬性將被忽略。
`spark.yarn.tags`	（沒有）	逗號分隔的字符串列表將作為YARN應用程序標簽出現在YARN ApplicationReports中傳遞，可用于在查詢YARN應用程序時進行過濾。
`spark.yarn.keytab`	（沒有）	包含上面指定的主體的密鑰表文件的完整路徑。此密鑰表將通過安全分布式緩存復制到運行YARN應用程序主控的節點，以定期更新登錄憑單和代理令牌。（也與“當地”大師合作）
`spark.yarn.principal`	（沒有）	在安全的HDFS上運行時使用Principal來登錄KDC。（也與“當地”大師合作）
`spark.yarn.kerberos.relogin.period`	1m	多久檢查一次kerberos TGT是否應該更新。這應該設置為比TGT更新周期短的值（或者如果未啟用TGT續訂，則TGT壽命周期）。大多數部署的默認值應該足夠了。

Property Name	Default	Meaning
`spark.yarn.am.memory`	`512m`	Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. `512m`, `2g`). In cluster mode, use `spark.driver.memory` instead.Use lower-case suffixes, e.g. `k`, `m`, `g`, `t`, and `p`, for kibi-, mebi-, gibi-, tebi-, and pebibytes, respectively.
`spark.yarn.am.cores`	`1`	Number of cores to use for the YARN Application Master in client mode. In cluster mode, use `spark.driver.cores` instead.
`spark.yarn.am.waitTime`	`100s`	In `cluster` mode, time for the YARN Application Master to wait for the SparkContext to be initialized. In `client` mode, time for the YARN Application Master to wait for the driver to connect to it.
`spark.yarn.submit.file.replication`	The default HDFS replication (usually `3`)	HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
`spark.yarn.stagingDir`	Current user's home directory in the filesystem	Staging directory used while submitting applications.
`spark.yarn.preserve.staging.files`	`false`	Set to `true` to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
`spark.yarn.scheduler.heartbeat.interval-ms`	`3000`	The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. The value is capped at half the value of YARN's configuration for the expiry interval, i.e. `yarn.am.liveness-monitor.expiry-interval-ms`.
`spark.yarn.scheduler.initial-allocation.interval`	`200ms`	The initial interval in which the Spark application master eagerly heartbeats to the YARN ResourceManager when there are pending container allocation requests. It should be no larger than `spark.yarn.scheduler.heartbeat.interval-ms`. The allocation interval will doubled on successive eager heartbeats if pending containers still exist, until`spark.yarn.scheduler.heartbeat.interval-ms` is reached.
`spark.yarn.max.executor.failures`	numExecutors * 2, with minimum of 3	The maximum number of executor failures before failing the application.
`spark.yarn.historyServer.address`	(none)	The address of the Spark history server, e.g. `host.com:18080`. The address should not contain a scheme (`http://`). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI. For this property, YARN properties can be used as variables, and these are substituted by Spark at runtime. For example, if the Spark history server runs on the same node as the YARN ResourceManager, it can be set to `${hadoopconf-yarn.resourcemanager.hostname}:18080`.
`spark.yarn.dist.archives`	(none)	Comma separated list of archives to be extracted into the working directory of each executor.
`spark.yarn.dist.files`	(none)	Comma-separated list of files to be placed in the working directory of each executor.
`spark.yarn.dist.jars`	(none)	Comma-separated list of jars to be placed in the working directory of each executor.
`spark.yarn.dist.forceDownloadSchemes`	`(none)`	Comma-separated list of schemes for which files will be downloaded to the local disk prior to being added to YARN's distributed cache. For use in cases where the YARN service does not support schemes that are supported by Spark, like http, https and ftp.
`spark.executor.instances`	`2`	The number of executors for static allocation. With `spark.dynamicAllocation.enabled`, the initial set of executors will be at least this large.
`spark.yarn.am.memoryOverhead`	AM memory * 0.10, with minimum of 384	Same as `spark.driver.memoryOverhead`, but for the YARN Application Master in client mode.
`spark.yarn.queue`	`default`	The name of the YARN queue to which the application is submitted.
`spark.yarn.jars`	(none)	List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to `hdfs:///some/path`. Globs are allowed.
`spark.yarn.archive`	(none)	An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces `spark.yarn.jars` and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
`spark.yarn.access.hadoopFileSystems`	(none)	A comma-separated list of secure Hadoop filesystems your Spark application is going to access. For example, `spark.yarn.access.hadoopFileSystems=hdfs://nn1.com:8032,hdfs://nn2.com:8032, webhdfs://nn3.com:50070`. The Spark application must have access to the filesystems listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm). Spark acquires security tokens for each of the filesystems so that the Spark application can access those remote Hadoop filesystems. `spark.yarn.access.namenodes` is deprecated, please use this instead.
`spark.yarn.appMasterEnv.[EnvironmentVariableName]`	(none)	Add the environment variable specified by `EnvironmentVariableName` to the Application Master process launched on YARN. The user can specify multiple of these and to set multiple environment variables. In `cluster` mode this controls the environment of the Spark driver and in `client` mode it only controls the environment of the executor launcher.
`spark.yarn.containerLauncherMaxThreads`	`25`	The maximum number of threads to use in the YARN Application Master for launching executor containers.
`spark.yarn.am.extraJavaOptions`	(none)	A string of extra JVM options to pass to the YARN Application Master in client mode. In cluster mode, use `spark.driver.extraJavaOptions` instead. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with `spark.yarn.am.memory`
`spark.yarn.am.extraLibraryPath`	(none)	Set a special library path to use when launching the YARN Application Master in client mode.
`spark.yarn.maxAppAttempts`	`yarn.resourcemanager.am.max-attempts` in YARN	The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration.
`spark.yarn.am.attemptFailuresValidityInterval`	(none)	Defines the validity interval for AM failure tracking. If the AM has been running for at least the defined interval, the AM failure count will be reset. This feature is not enabled if not configured.
`spark.yarn.executor.failuresValidityInterval`	(none)	Defines the validity interval for executor failure tracking. Executor failures which are older than the validity interval will be ignored.
`spark.yarn.submit.waitAppCompletion`	`true`	In YARN cluster mode, controls whether the client waits to exit until the application completes. If set to `true`, the client process will stay alive reporting the application's status. Otherwise, the client process will exit after submission.
`spark.yarn.am.nodeLabelExpression`	(none)	A YARN node label expression that restricts the set of nodes AM will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
`spark.yarn.executor.nodeLabelExpression`	(none)	A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
`spark.yarn.tags`	(none)	Comma-separated list of strings to pass through as YARN application tags appearing in YARN ApplicationReports, which can be used for filtering when querying YARN apps.
`spark.yarn.keytab`	(none)	The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the YARN Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. (Works also with the "local" master)
`spark.yarn.principal`	(none)	Principal to be used to login to KDC, while running on secure HDFS. (Works also with the "local" master)
`spark.yarn.kerberos.relogin.period`	1m	How often to check whether the kerberos TGT should be renewed. This should be set to a value that is shorter than the TGT renewal period (or the TGT lifetime if TGT renewal is not enabled). The default value should be enough for most deployments.
`spark.yarn.config.gatewayPath`	(none)	A path that is valid on the gateway host (the host where a Spark application is started) but may differ for paths for the same resource in other nodes in the cluster. Coupled with`spark.yarn.config.replacementPath`, this is used to support clusters with heterogeneous configurations, so that Spark can correctly launch remote processes.The replacement path normally will contain a reference to some environment variable exported by YARN (and, thus, visible to Spark containers).For example, if the gateway node has Hadoop libraries installed on `/disk1/hadoop`, and the location of the Hadoop install is exported by YARN as the `HADOOP_HOME` environment variable, setting this value to `/disk1/hadoop` and the replacement path to `$HADOOP_HOME`will make sure that paths used to launch remote processes properly reference the local YARN configuration.
`spark.yarn.config.replacementPath`	(none)	See `spark.yarn.config.gatewayPath`.
`spark.security.credentials.${service}.enabled`	`true`	Controls whether to obtain credentials for services when security is enabled. By default, credentials for all supported services are retrieved when those services are configured, but it's possible to disable that behavior if it somehow conflicts with the application being run. For further details please see Running in a Secure Cluster
`spark.yarn.rolledLog.includePattern`	(none)	Java Regex to filter the log files which match the defined include pattern and those log files will be aggregated in a rolling fashion. This will be used with YARN's rolling log aggregation, to enable this feature in YARN side `yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds` should be configured in yarn-site.xml. This feature can only be used with Hadoop 2.6.4+. The Spark log4j appender needs be changed to use FileAppender or another appender that can handle the files being removed while it is running. Based on the file name configured in the log4j configuration (like spark.log), the user should set the regex (spark*) to include all the log files that need to be aggregated.
`spark.yarn.rolledLog.excludePattern`	(none)	Java Regex to filter the log files which match the defined exclude pattern and those log files will not be aggregated in a rolling fashion. If the log file name matches both the include and the exclude pattern, this file will be excluded eventually.

重要筆記

核心請求在調度決策中是否得到遵守取決于哪個調度程序正在使用以及它如何配置。
在cluster模式下，Spark執行程序和Spark驅動程序使用的本地目錄將是為YARN配置的本地目錄（Hadoop YARN配置yarn.nodemanager.local-dirs）。如果用戶指定spark.local.dir，它將被忽略。在client模式中，Spark執行程序將使用為YARN配置的本地目錄，而Spark驅動程序將使用其中定義的目錄spark.local.dir。這是因為Spark驅動程序不能在client模式下的YARN集群上運行，只有Spark執行程序會這樣做。
在--files和--archives選項支持類似于Hadoop的 # 指定文件名。例如，您可以指定：--files localtest.txt#appSees.txt 并且這會將您本地命名的文件localtest.txt上載到HDFS，但是這將通過名稱鏈接到appSees.txt，并且您的應用程序應該使用該名稱appSees.txt來在YARN上運行時引用它。
如果您使用本地文件并在cluster模式下運行，--jars選項允許該SparkContext.addJar功能起作用。如果您使用HDFS，HTTP，HTTPS或FTP文件，則不需要使用它。

在安全集群中運行

正如安全性所述，Kerberos用于安全的Hadoop集群中，以對與服務和客戶端關聯的主體進行身份驗證。這允許客戶提出這些認證服務的請求; 為授權委托人授予權利的服務。

Hadoop服務發布hadoop令牌來授予對服務和數據的訪問權限。客戶必須首先獲取他們將訪問的服務的標記，并將它們與其在YARN群集中啟動的應用程序一起傳遞。

為了使Spark應用程序能夠與任何Hadoop文件系統（例如hdfs，webhdfs等），HBase和Hive進行交互，它必須使用啟動應用程序的用戶的Kerberos憑據來獲取相關的令牌 - 即身份標識的主體將成為啟動的Spark應用程序的功能。

這通常在啟動時完成：在安全集群中，Spark將自動獲取群集的默認Hadoop文件系統的標記，并可能為HBase和Hive獲取標記。

如果HBase位于類路徑中，HBase配置聲明應用程序是安全的（即hbase-site.xml設置hbase.security.authentication為kerberos）并且spark.security.credentials.hbase.enabled未設置為HBase令牌，則會獲取HBase令牌false。

同樣，如果Hive位于類路徑中，則會獲得Hive標記，其配置包含元數據存儲的URI "hive.metastore.uris，并且spark.security.credentials.hive.enabled未設置為false。

如果應用程序需要與其他安全Hadoop文件系統交互，那么在啟動時必須明確請求訪問這些群集所需的令牌。這是通過將它們列在spark.yarn.access.hadoopFileSystems屬性中完成的。

spark.yarn.access.hadoopFileSystems hdfs://ireland.example.org:8020/,webhdfs://frankfurt.example.org:50070/

Spark通過Java服務機制支持與其他安全感知服務的集成（請參閱參考資料 java.util.ServiceLoader）。為此，org.apache.spark.deploy.yarn.security.ServiceCredentialProvider Spark的實現應該可以通過將其名稱列在jar META-INF/services目錄中的相應文件中來使用。這些插件可以通過設置來禁用 spark.security.credentials.{service}.enabled到false，這里{service}是證書提供商的名稱。

配置外部 Shuffle 服務

要NodeManager在YARN群集中的每個群集上啟動Spark Shuffle服務，請按照以下說明操作：

使用YARN配置文件構建Spark 。如果您使用的是預先打包的發行版，請跳過此步驟。
找到spark-<version>-yarn-shuffle.jar。$SPARK_HOME/common/network-yarn/target/scala-<version>如果你正在自己構建Spark，并且在yarn使用發行版的情況下，這應該在下面。
將此jar添加到NodeManager群集中所有s 的類路徑中。
在yarn-site.xml每個節點上，添加spark_shuffle到yarn.nodemanager.aux-services，然后設置yarn.nodemanager.aux-services.spark_shuffle.class為 org.apache.spark.network.yarn.YarnShuffleService。
NodeManager's通過設置YARN_HEAPSIZE（缺省etc/hadoop/yarn-env.sh 為1000）來增加堆大小，以避免在混洗期間垃圾收集問題。
重新啟動NodeManager群集中的所有設備。

在YARN上運行shuffle服務時，以下額外配置選項可用：

使用Apache Oozie啟動您的應用程序

Apache Oozie可以將Spark應用程序作為工作流程的一部分啟動。在安全集群中，啟動的應用程序將需要相關的令牌來訪問集群的服務。如果Spark使用密鑰表啟動，則這是自動的。但是，如果要在沒有密鑰表的情況下啟動Spark，則必須將設置安全性的責任移交給Oozie。

可以在Oozie網站的特定版本文檔的“身份驗證”部分找到有關為安全集群配置Oozie和獲取作業憑證的詳細信息。

對于Spark應用程序，必須為Oozie設置Oozie工作流程以請求應用程序需要的所有令牌，其中包括：

YARN資源管理器。
本地Hadoop文件系統。
用作I / O的源或目標的任何遠程Hadoop文件系統。
配置單元 - 如果使用。
使用HBase -if。
YARN時間軸服務器，如果應用程序與此交互。

為了避免Spark試圖獲取Hive，HBase和遠程HDFS令牌，然后失敗，必須將Spark配置設置為禁用服務的令牌收集。

Spark配置必須包含以下行：

spark.security.credentials.hive.enabled   false
spark.security.credentials.hbase.enabled  false

配置選項spark.yarn.access.hadoopFileSystems必須未設置。

解決Kerberos問題

調試Hadoop / Kerberos問題可能很“困難”。一種有用的技術是通過設置HADOOP_JAAS_DEBUG 環境變量來在Hadoop中啟用額外的Kerberos操作日志記錄。

export HADOOP_JAAS_DEBUG=true

可以將JDK類配置為通過系統屬性啟用額外的Kerberos和SPNEGO / REST身份驗證日志記錄，sun.security.krb5.debug 以及sun.security.spnego.debug=true

-Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true

所有這些選項都可以在Application Master中啟用：

spark.yarn.appMasterEnv.HADOOP_JAAS_DEBUG true
spark.yarn.am.extraJavaOptions -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug=true

最后，如果日志級別org.apache.spark.deploy.yarn.Client設置為DEBUG，則日志將包含獲取的所有令牌的列表及其到期的詳細信息

使用Spark歷史記錄服務器來替換Spark Web UI

當應用程序UI被禁用時，可以使用Spark History Server應用程序頁面作為運行應用程序的跟蹤URL。這在安全集群上可能是需要的，或者減少Spark驅動程序的內存使用量。要通過Spark歷史記錄服務器設置跟蹤，請執行以下操作：

在應用程序方面，設置spark.yarn.historyServer.allowTracking=trueSpark的配置。如果應用程序的UI被禁用，這將告訴Spark使用歷史記錄服務器的URL作為跟蹤URL。
在Spark歷史記錄服務器上，添加org.apache.spark.deploy.yarn.YarnProxyRedirectFilter 到spark.ui.filters配置中的過濾器列表。

請注意，歷史記錄服務器信息可能不是應用程序狀態的最新信息。

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,622評論 6贊 544
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,716評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,746評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,991評論 1贊 318
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,706評論 6贊 413
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 56,036評論 1贊 329
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 44,029評論 3贊 450
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 43,203評論 0贊 290
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,725評論 1贊 336
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,451評論 3贊 361
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,677評論 1贊 374
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,161評論 5贊 365
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,857評論 3贊 351
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,266評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,606評論 1贊 295
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,407評論 3贊 400
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,643評論 2贊 380

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Spark on Yarn

Spark on Yarn

在YARN上啟動Spark

添加其他JAR

準備工作

配置

調試您的應用程序

Spark屬性

重要筆記

在安全集群中運行

配置外部 Shuffle 服務

使用Apache Oozie啟動您的應用程序

解決Kerberos問題

使用Spark歷史記錄服務器來替換Spark Web UI

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Spark on Yarn

在YARN上啟動Spark

添加其他JAR

準備工作

配置

調試您的應用程序

Spark屬性

重要筆記

在安全集群中運行

配置外部 Shuffle 服務

使用Apache Oozie啟動您的應用程序

解決Kerberos問題

使用Spark歷史記錄服務器來替換Spark Web UI

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频