Spark Storage Level

RDD Persistence

MEMORY_ONLY

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER(Java and Scala)

Store RDD asserializedJava objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a?fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER

(Java and Scala)Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISK_ONLY

Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental)

Similar to MEMORY_ONLY_SER, but store the data inoff-heap memory. This requires off-heap memory to be enabled.

使用示例

?參數:

_userDisk: Boolean

_userMemory: Boolean

_userOffHeap: Boolean

_deserialized: Boolean

_replication: Int = 1?

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

If not, try using MEMORY_ONLY_SER and?selecting a fast serialization library?to make the objects much more space-efficient, but still reasonably fast to access. (Java and Scala)

Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application).All?the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,827評論 0 23
  • 天下事,了猶未了,何妨以不了了之。
    魚琪兒閱讀 230評論 0 0
  • 那一年,他們歡唱咆哮稱霸全世界 那一年,他們為七號房的禮物飆淚 那一年他們在藍塔山下許下誓言 那一年,他們喊著ex...
    茜馨伊熏閱讀 369評論 2 3
  • 每個人都有自己的夢想,都希望夢想成真。 其實,夢想就在“想”與“做”之間,如果你真想實現自己的夢想,付之行動,腳踏...
    雨潤1996閱讀 154評論 0 0