Spark 基礎(chǔ)學(xué)習(xí)第一講:彈性分布式數(shù)據(jù)集RDD

引子

任何一個(gè)概念的引入都是為了解決某種問(wèn)題,RDD亦然。關(guān)于RDD這個(gè)概念,先拋幾個(gè)問(wèn)題。

為什么引入RDD這個(gè)概念?

RDD到底是什么?

RDD是怎么實(shí)現(xiàn)的?

下面我們逐步闡述一下這三個(gè)問(wèn)題。

part1:為什么要引入RDD?

想回答這個(gè)問(wèn)題,最好的辦法就是讀原論文里面的論述,可以通過(guò)下面的鏈接下載閱讀。

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

其中開(kāi)篇首先講述了為什么要設(shè)計(jì)RDD這樣的一個(gè)模型,

Although current frameworks provide numerous abstractions for accessing a cluster’s computational resources, they lack abstractions for leveraging distributed memory.This makes them inefficient for an important class of emerging applications: those that reuse intermediate results across multiple computations.

Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. Another compelling use case is interactive data mining, where a user runs multiple adhoc queries on the same subset of the data.?

Unfortunately, in most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system. This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

從文中的描述可以看出,要解決一個(gè)核心問(wèn)題,那就是目前的框架(MapReduce和Dryad)都缺乏利用分布式內(nèi)存的抽象能力。對(duì)于那些需要在多個(gè)計(jì)算中復(fù)用中間結(jié)果類(lèi)型的應(yīng)用,不能很好地使用內(nèi)存,而是要多次讀寫(xiě)到磁盤(pán)中,性能普遍不佳。

當(dāng)前的分布式編程框架,都是基于非循環(huán)的數(shù)據(jù)流模型。即從穩(wěn)定的物理存儲(chǔ)(如分布式文件系統(tǒng))中加載記錄,記錄被傳入由一組穩(wěn)定性操作構(gòu)成的DAG(Directed Acyclic Graph,有向無(wú)環(huán)圖),然后寫(xiě)回穩(wěn)定存儲(chǔ)。需要將數(shù)據(jù)輸出到磁盤(pán),然后在每次查詢(xún)時(shí)重新加載,這會(huì)帶來(lái)較大的開(kāi)銷(xiāo)。RDD允許用戶(hù)在執(zhí)行多個(gè)查詢(xún)時(shí)顯式地將工作集緩存在內(nèi)存中,后續(xù)的查詢(xún)能夠重用工作集,這極大地提升了查詢(xún)速度。

其次,基于共享內(nèi)存的抽象,都沒(méi)有很好地解決容錯(cuò)性的問(wèn)題,當(dāng)一個(gè)數(shù)據(jù)塊出錯(cuò)了之后,他們?cè)O(shè)計(jì)的方法就是在節(jié)點(diǎn)間復(fù)制數(shù)據(jù)來(lái)記錄數(shù)據(jù)的更新。

Existing abstractions for in-memory storage on clusters, such as distributed shared memory [24], key value stores [25], databases, and Piccolo [27], offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.

RDD通過(guò)基于數(shù)據(jù)集粗粒度轉(zhuǎn)換、線性血脈關(guān)系的管理,很好地解決了容錯(cuò)性問(wèn)題,通過(guò)依賴(lài)關(guān)系還原現(xiàn)場(chǎng)重建數(shù)據(jù)塊,而不是直接復(fù)制數(shù)據(jù)。

part2:RDD是什么?

2.1 基本概念

我們看看論文里面怎么說(shuō):

In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient?data reuse in a broad range of applications. RDDs are?fault-tolerant, parallel data structures that let users explicitly?persist intermediate results in memory, control?their partitioning to optimize data placement, and manipulate?them using a rich set of operators.

第一強(qiáng)調(diào)的高效的復(fù)用;

第二是容錯(cuò)、并行、顯式保存中間結(jié)果;

第三通過(guò)控制分區(qū)優(yōu)化數(shù)據(jù)存儲(chǔ)的位置;

第四是提供豐富的算子來(lái)操作這些分區(qū)。

大家經(jīng)常看到的概念描述是這樣的:彈性分布式數(shù)據(jù)集(RDD,Resilient Distributed Datasets)提供了一種高度受限的共享內(nèi)存模型,即RDD是只讀的、分區(qū)記錄的集合,只能通過(guò)在其他RDD執(zhí)行確定的轉(zhuǎn)換操作(如map、join和group by)而創(chuàng)建,然而這些限制使得實(shí)現(xiàn)容錯(cuò)的開(kāi)銷(xiāo)很低。

2.2 關(guān)于容錯(cuò)

設(shè)計(jì)RDD的一個(gè)主要挑戰(zhàn)就是容錯(cuò)設(shè)計(jì),Matei Zaharia在論文里面說(shuō)到:

The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently.

RDD如何進(jìn)行容錯(cuò)設(shè)計(jì)呢?

首先看看一般的分布式共享內(nèi)存系統(tǒng)是如何進(jìn)行容錯(cuò)設(shè)計(jì)的:

分布式共享內(nèi)存系統(tǒng):高昂代價(jià)的檢查點(diǎn)和回滾機(jī)制(IO和存儲(chǔ))

RDD的容錯(cuò)機(jī)制采用了一種完全不同的方案

RDD通過(guò)Lineage來(lái)重建丟失的分區(qū):也稱(chēng)為記錄更新,一個(gè)RDD中包含了如何從其他RDD衍生所必需的相關(guān)信息(血統(tǒng)),更新的方式是粗粒度變換(coarse-grained transformation),即僅記錄在單個(gè)塊上執(zhí)行的單個(gè)操作,然后創(chuàng)建某個(gè)RDD的變換序列存儲(chǔ)下來(lái),當(dāng)數(shù)據(jù)丟失時(shí),我們可以用變換序列(血統(tǒng))來(lái)重新計(jì)算,恢復(fù)丟失的數(shù)據(jù),以達(dá)到容錯(cuò)的目的。

其實(shí),對(duì)于lineage過(guò)長(zhǎng)的情況,重建分區(qū)的代價(jià)也是很大的,這個(gè)時(shí)候需要通過(guò)建立checkpoint的方式持久化一部分中間結(jié)果。所以對(duì)于歷史的方案,并沒(méi)有完全拋棄。

2.3 每個(gè)RDD有5個(gè)主要的屬性

?一組partitions(分片,可擴(kuò)展性)

?計(jì)算每個(gè)分片的函數(shù)(transformation,action)

?RDD間的依賴(lài)關(guān)系(自動(dòng)容錯(cuò))

?選擇項(xiàng):一個(gè)K-V RDD的partitioner(自定義分區(qū)函數(shù),一般是使用hash函數(shù))

?選擇項(xiàng):存儲(chǔ)每個(gè)partition的優(yōu)先的(preferred)位置(本地優(yōu)化分配)

2.3.1 RDD要素之一:partition

問(wèn)題

1.partition的數(shù)量是如何決定的?

2. 多少個(gè)partition合適?

3.partition是如何分配到各個(gè)executor的?

partition的數(shù)量是如何決定的?

默認(rèn)情況

1、默認(rèn)情況下,當(dāng)Spark從HDFS讀一個(gè)文件的時(shí)候,會(huì)為一個(gè)輸入的片段創(chuàng)建一個(gè)分區(qū),也就是一個(gè)HDFS split對(duì)應(yīng)一個(gè)RDD partition, 大小是64MB或者128MB,這個(gè)過(guò)程是自動(dòng)完成的,不需要人工干預(yù),但是這個(gè)分區(qū)之間的split是基于行的分割而不是按照塊分割的。

2、使用默認(rèn)讀取文件命令時(shí),分區(qū)數(shù)目可能會(huì)少,一般情況下是HDFS分區(qū)數(shù)目,如果文件一行長(zhǎng)度很長(zhǎng)(大于block大小),分區(qū)數(shù)會(huì)變少。

3、自定義分區(qū)數(shù)目只對(duì)非壓縮文件生效,對(duì)于壓縮文件,一個(gè)文件只能有一個(gè)分區(qū),這個(gè)時(shí)候只能使用repartition修改分區(qū)數(shù)目了。

手動(dòng)改變分區(qū)數(shù)目的操作

用戶(hù)可以在讀數(shù)據(jù)的時(shí)候可以自定義分區(qū)數(shù)目,并在后續(xù)的transformation和action中體現(xiàn)。

sc.textFile(path,partition)

repartition Transformation

repartition(numPartitions:Int)(implicitord: Ordering[T] = null): RDD[T]

Repartition is coalesce withnumPartitionsand shuffle enabled.

coalesce Transformation

coalesce(numPartitions:Int, shuffle: Boolean = false)(implicitord:

Ordering[T] = null): RDD[T]

The coalesce transformation is used to change the number of partitions. It can

trigger RDD shuffling depending on the shuffle flag (disabled by default, i.e.

false).

獲取分區(qū)信息

def getPartitions: Array[Partition]

多少個(gè)partition合適?

每個(gè)executor分配的Task的數(shù)量和executor分配的CPU數(shù)量一致,而Task數(shù)量和分區(qū)數(shù)目一致。所以要平衡分區(qū)數(shù)目和申請(qǐng)的CPU資源。

一般情況下,分區(qū)文件小會(huì)導(dǎo)致分區(qū)數(shù)目增加,可以分配到更多的節(jié)點(diǎn)上進(jìn)行計(jì)算,這樣會(huì)提升速度;分區(qū)過(guò)大則分區(qū)數(shù)目減少,如果分區(qū)數(shù)目遠(yuǎn)少于分配的CPU數(shù)目,那么很多CPU就會(huì)空閑,速度會(huì)很慢。

Spark對(duì)每個(gè)RDD分區(qū)只能運(yùn)行一個(gè)并行任務(wù),最多同時(shí)運(yùn)行任務(wù)數(shù)就是集群的CPU核心總數(shù),總體講建議一個(gè)CPU最多可以分配2-3個(gè)任務(wù)。所以總的分區(qū)數(shù)目理想數(shù)字也應(yīng)該是分配的CPU的2-3倍之間。

分區(qū)的最大大小由executor的可用內(nèi)存決定,如果分區(qū)過(guò)大,多個(gè)大的分區(qū)被分配到同一個(gè)executor中,超出了shuffle內(nèi)存,則會(huì)出現(xiàn)內(nèi)存溢出。

partition是如何分配到各個(gè)executor的?

首先引用一個(gè)Quora的答案

RDD is a dataset which is distributed, that is, it is divided into"partitions". Each of these partitions can be present in the memory or disk of different machines. If you want Spark to process the RDD, then Spark needs to launch one task per partition of the RDD. It’s best that each task be sent to the machine have the partition that task is supposed to process. In that case, the task will be able to read the data of the partition from the local machine. Otherwise, the task would have to pull the partition data over the network from a different machine, which is less efficient. This scheduling of tasks (that is, allocation of tasks to machines) such that the tasks can read data “l(fā)ocally” is known as “locality aware scheduling”.

分區(qū)任務(wù)分配

我們基本上都了解,計(jì)算和數(shù)據(jù)的本地化是分布式計(jì)算的一個(gè)重要思想,當(dāng)數(shù)據(jù)和運(yùn)算分離的時(shí)候就需要從其他節(jié)點(diǎn)拉數(shù)據(jù),這個(gè)是要消耗網(wǎng)絡(luò)IO的。

在進(jìn)行任務(wù)分配的時(shí)候,要以網(wǎng)絡(luò)傳輸消耗最小化為原則,Spark從最近的節(jié)點(diǎn)將數(shù)據(jù)讀到RDD中。Spark首先會(huì)評(píng)估分區(qū)情況,當(dāng)任務(wù)分配完畢后,會(huì)有若干個(gè)executor,而分區(qū)在若干個(gè)worker上,需要綜合評(píng)估網(wǎng)絡(luò)傳輸?shù)拇鷥r(jià),將不同的分區(qū)分配到不同的executor上。

taskSetManager在分發(fā)任務(wù)之前會(huì)先計(jì)算數(shù)據(jù)本地性,優(yōu)先級(jí)依次是:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible

NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NO_PREF data is accessed equally quickly from anywhere and has no locality preference

RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

ANY data is elsewhere on the network and not in the same rack

?當(dāng)計(jì)算完本地性之后,再分發(fā)任務(wù)。

?Driver剛啟動(dòng)時(shí)候,executor還沒(méi)有初始化完畢,一部分任務(wù)的本地化被設(shè)置為NO_PREF,ShuffleRDD的本地性始終為NO_PREF,在任務(wù)分配時(shí)候優(yōu)先分配到非本地節(jié)點(diǎn)。

?具體怎么實(shí)現(xiàn)見(jiàn)下次Spark計(jì)算框架一講。

并行度

在Standalone 或者 Yarn集群模式,默認(rèn)并行度等于集群中所有核心數(shù)目的總和或者 2,取兩者中的較大值(見(jiàn) SparkDeploySchedulerBackend.defaultParallelism()。這個(gè)backend繼承自CoarseGrainedSchedulerBackend)。

override def defaultParallelism():Int= {

conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),2))

}

可以通過(guò)設(shè)置并行度來(lái)進(jìn)行再分區(qū),參數(shù)如下:

spark并行度設(shè)置

2.3.2 RDD要素之二:函數(shù)

?分為兩類(lèi):

?transformation

?action

輸入:在Spark程序運(yùn)行中,數(shù)據(jù)從外部數(shù)據(jù)空間(如分布式存儲(chǔ):textFile讀取HDFS等,parallelize方法輸入Scala集合或數(shù)據(jù))輸入Spark,數(shù)據(jù)進(jìn)入Spark運(yùn)行時(shí)數(shù)據(jù)空間,轉(zhuǎn)化為Spark中的數(shù)據(jù)塊,通過(guò)BlockManager進(jìn)行管理。

運(yùn)行:在Spark數(shù)據(jù)輸入形成RDD后便可以通過(guò)變換算子,如filter等,對(duì)數(shù)據(jù)進(jìn)行操作并將RDD轉(zhuǎn)化為新的RDD,通過(guò)Action算子,觸發(fā)Spark提交作業(yè)。

如果數(shù)據(jù)需要復(fù)用,可以通過(guò)Cache算子,將數(shù)據(jù)緩存到內(nèi)存。

輸出:程序運(yùn)行結(jié)束數(shù)據(jù)會(huì)輸出Spark運(yùn)行時(shí)空間,存儲(chǔ)到分布式存儲(chǔ)中(如saveAsTextFile輸出到HDFS),或Scala數(shù)據(jù)或集合中(collect輸出到Scala集合,count返回Scala int型數(shù)據(jù))。


Spark RDD算子

Transformation和Actions


Transformation 和Action

Transformation

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala,Java,Python,R) and pair RDD functions doc (Scala,Java) for details.

TransformationMeaning

map(func)Return a new distributed dataset formed by passing each element of the source through a functionfunc.

filter(func)Return a new dataset formed by selecting those elements of the source on whichfuncreturns true.

flatMap(func)Similar to map, but each input item can be mapped to 0 or more output items (sofuncshould return a Seq rather than a single item).

mapPartitions(func)Similar to map, but runs separately on each partition (block) of the RDD, sofuncmust be of type Iterator => Iterator when running on an RDD of type T.

mapPartitionsWithIndex(func)Similar to mapPartitions, but also providesfuncwith an integer value representing the index of the partition, sofuncmust be of type (Int, Iterator) => Iterator when running on an RDD of type T.

sample(withReplacement,fraction,seed)Sample a fractionfractionof the data, with or without replacement, using a given random number generator seed.

union(otherDataset)Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset)Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numTasks]))Return a new dataset that contains the distinct elements of the source dataset.

groupByKey([numTasks])When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.

Note:If you are grouping in order to perform an aggregation (such as a sum or average) over each key, usingreduceByKeyoraggregateByKeywill yield much better performance.

Note:By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optionalnumTasksargument to set a different number of tasks.

reduceByKey(func, [numTasks])When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce functionfunc, which must be of type (V,V) => V. Like ingroupByKey, the number of reduce tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOp,combOp, [numTasks])When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like ingroupByKey, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the booleanascendingargument.

join(otherDataset, [numTasks])When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported throughleftOuterJoin,rightOuterJoin, andfullOuterJoin.

cogroup(otherDataset, [numTasks])When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also calledgroupWith.

cartesian(otherDataset)When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

pipe(command,[envVars])Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than callingrepartitionand then sorting within each partition because it can push the sorting down into the shuffle machinery.

Action

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala,Java,Python,R)

and pair RDD functions doc (Scala,Java) for details.

ActionMeaning

reduce(func)Aggregate the elements of the dataset using a functionfunc(which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

collect()Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

count()Return the number of elements in the dataset.

first()Return the first element of the dataset (similar to take(1)).

take(n)Return an array with the firstnelements of the dataset.

takeSample(withReplacement,num, [seed])Return an array with a random sample ofnumelements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n,[ordering])Return the firstnelements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

saveAsSequenceFile(path)

(Java and Scala)Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path)

(Java and Scala)Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().

countByKey()Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.

foreach(func)Run a functionfuncon each element of the dataset. This is usually done for side effects such as updating anAccumulatoror interacting with external storage systems.

Note: modifying variables other than Accumulators outside of theforeach()may result in undefined behavior. SeeUnderstanding closuresfor more details.

The Spark RDD API also exposes asynchronous versions of some actions, likeforeachAsyncforforeach, which immediately return aFutureActionto the caller instead of blocking on completion of the action. This can be used to manage or wait for the asynchronous execution of the action.

2.3.3 RDD要素之三:依賴(lài)關(guān)系(自動(dòng)容錯(cuò))

RDD作為數(shù)據(jù)結(jié)構(gòu),本質(zhì)上是一個(gè)只讀的分區(qū)記錄集合。

我們可以用數(shù)據(jù)庫(kù)視圖的概念來(lái)理解RDD(其實(shí)DataFrame更像)。

一個(gè)RDD可以包含多個(gè)分區(qū),每個(gè)分區(qū)就是一個(gè)dataset片段;數(shù)據(jù)庫(kù)的表也可以創(chuàng)建很多分區(qū)。

有的SQL語(yǔ)句只需要查詢(xún)一張表,有的需要關(guān)聯(lián)多張表,RDD也是一樣的,有的數(shù)據(jù)只需要操作一個(gè)集合,有的需要做關(guān)聯(lián),本質(zhì)上是一樣的,這里就引入了依賴(lài)的概念。

有的SQL只需要一張表,比如簡(jiǎn)單過(guò)濾,有的需要多張表進(jìn)行關(guān)聯(lián)。

簡(jiǎn)單來(lái)講,對(duì)RDD的操作,如果只需要一個(gè)RDD,則是窄依賴(lài),但是多個(gè)RDD的卻未必是寬依賴(lài),這就是比較難理解的地方。

之所以實(shí)際情況要復(fù)雜一些,這和Spark的shuffle機(jī)制有關(guān),本質(zhì)上是否shuffle是區(qū)分寬窄依賴(lài)的判斷條件,實(shí)際上多個(gè)RDD操作未必會(huì)發(fā)生shuffle,因?yàn)橐粋€(gè)分區(qū)會(huì)被一個(gè)executor處理,如果兩個(gè)RDD的數(shù)據(jù)都在一個(gè)分區(qū)里面,結(jié)果就直接在自己的executor內(nèi)完成了,就不會(huì)發(fā)生shuffle,下面詳解。

依賴(lài)的定義

先看一張圖

Dependencies

我們可以看出依賴(lài)可以分為NarrowDependency和ShuffleDependency。

窄依賴(lài)NarrowDependency

窄依賴(lài)的基本判斷原則

父RDD中的一個(gè)分區(qū)最多只會(huì)被子RDD中的一個(gè)分區(qū)使用

換句話(huà)說(shuō),父RDD中,一個(gè)分區(qū)內(nèi)的數(shù)據(jù)是不能被分割的,必須整個(gè)交付給子RDD中的一個(gè)分區(qū)。因?yàn)橐坏┓指罹蜕婕暗絪huffle,一旦涉及到shuffle,就成了寬依賴(lài),所以寬依賴(lài)也叫shuffle依賴(lài)。

窄依賴(lài)共有兩種實(shí)現(xiàn),一種是一對(duì)一的依賴(lài),即 OneToOneDependency:

@DeveloperApi

class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {??

override def getParents(partitionId: Int): List[Int] = List(partitionId)

}

從其 getParents 方法可以看出 OneToOneDependency 的依賴(lài)關(guān)系下,子 RDD 的 partition 僅依賴(lài)于唯一 parent RDD 的相同 index 的 partition。

另一種窄依賴(lài)的實(shí)現(xiàn)是 RangeDependency,它僅僅被 UnionRDD 使用,UnionRDD 把多個(gè) RDD 合成一個(gè) RDD,這些 RDD 是被拼接而成,其 getParents 實(shí)現(xiàn)如下:

override def getParents(partitionId:Int):List[Int] = {

if(partitionId >= outStart && partitionId < outStart + length) {

List(partitionId - outStart + inStart)? ?

} else {

? ? ? ?Nil? ?

? ? }

}

如下圖所示:

窄依賴(lài)示意圖

哪些操作屬于窄依賴(lài)呢?

?map

?flatMap

?filter

?union

?部分join

?sample

寬依賴(lài)ShuffleDependency

寬依賴(lài)( Shuffle 依賴(lài))中:

父RDD中的分區(qū)被多個(gè)子RDD分區(qū)使用。

因?yàn)楦窻DD中一個(gè)分區(qū)內(nèi)的數(shù)據(jù)會(huì)被分割,發(fā)送給子RDD的所有分區(qū),因此Shuffle依賴(lài)也意味著父RDD 與子 RDD之間存在著Shuffle過(guò)程。


寬依賴(lài)示意圖

哪些操作屬于寬依賴(lài)呢?

?部分join

?groupByKey

?reduceByKey

?groupWith

?cartesian

轉(zhuǎn)換中的依賴(lài)情況

第一:依賴(lài)關(guān)系是兩個(gè)RDD 之間的依賴(lài)。

第二:若一次轉(zhuǎn)換操作中父RDD 有多個(gè),則可能會(huì)同時(shí)包含窄依賴(lài)和 Shuffle 依賴(lài)。

右圖所示的Join 操作中:

RDDa 和 RDD c 采用了相同的分區(qū)器,兩個(gè)RDD 之間是窄依賴(lài),父RDD a分區(qū)沒(méi)有被分割,屬于窄依賴(lài)

Rdd b 的分區(qū)器與RDD c不同,父RDD b分區(qū)被分割,因此它們之間是Shuffle依賴(lài)


窄依賴(lài)和寬依賴(lài)的作用

概念1:計(jì)算鏈(computing chain)

概念2:Stage

概念3:有向無(wú)環(huán)圖DAG

把RDD每個(gè)分區(qū)內(nèi)數(shù)據(jù)的計(jì)算當(dāng)成一個(gè)并行任務(wù),每個(gè)并行任務(wù)包含一個(gè)計(jì)算鏈,將一個(gè)計(jì)算鏈交付給一個(gè)CPU核心去執(zhí)行,集群中的CPU核心一起把RDD內(nèi)的所有分區(qū)計(jì)算出來(lái)。

沒(méi)必要保留中間過(guò)程的情況


斷鏈—shuffle dependency

Shuffle依賴(lài)需要所有的父RDD

所以每次使用的時(shí)候,在shuffle時(shí)候都需要父RDD,如果不保存下來(lái),就得重新計(jì)算。

保存是更好的選擇,所以計(jì)算鏈從shuffle處斷開(kāi),劃分為不同的階段(stage),階段之間存在shuffle依賴(lài),不同的stage構(gòu)成了DAG。


作業(yè)的DAG圖

2.3.4 RDD要素之四:partitioner

HashPartitioner與RangePartitioner

由于分區(qū)器能夠間接決定RDD中分區(qū)的數(shù)量和分區(qū)內(nèi)部數(shù)據(jù)記錄的個(gè)數(shù),因此選擇合適的分區(qū)器能夠有效提高并行計(jì)算的性能

哈希分析器的實(shí)現(xiàn)簡(jiǎn)單,運(yùn)行速度快,但其本身有一明顯的缺點(diǎn):由于不關(guān)心鍵值的分布情況,其散列到不同分區(qū)的概率會(huì)因數(shù)據(jù)而異,個(gè)別情況下會(huì)導(dǎo)致一部分分區(qū)分配得到的數(shù)據(jù)多,一部分則比較少。

范圍分區(qū)器則在一定程度上避免這個(gè)問(wèn)題,范圍分區(qū)器爭(zhēng)取將所有的分區(qū)盡可能分配得到相同多的數(shù)據(jù),并且所有分區(qū)內(nèi)數(shù)據(jù)的上界是有序的。

HashPartitioner

?功能:依據(jù)RDD中key值的hashCode的值將數(shù)據(jù)取模后得到該key值對(duì)應(yīng)的下一個(gè)RDD的分區(qū)id值,支持key值為null的情況,當(dāng)key為null的時(shí)候,返回0;該分區(qū)器基本上適合所有RDD數(shù)據(jù)類(lèi)型的數(shù)據(jù)進(jìn)行分區(qū)操作

?例如:將同一個(gè)Host的URL分配到一個(gè)節(jié)點(diǎn)上,直接對(duì)域名做partition是無(wú)法實(shí)現(xiàn)的。



RangePartitioner

?主要用于RDD的數(shù)據(jù)排序相關(guān)API中,比如sortByKey底層使用的數(shù)據(jù)分區(qū)器就是RangePartitioner分區(qū)器;該分區(qū)器的實(shí)現(xiàn)方式主要是通過(guò)兩個(gè)步驟來(lái)實(shí)現(xiàn)的,

?第一步:先重整個(gè)RDD中抽取出樣本數(shù)據(jù),將樣本數(shù)據(jù)排序,計(jì)算出每個(gè)分區(qū)的最大key值,形成一個(gè)Array[KEY]類(lèi)型的數(shù)組變量rangeBounds;

?第二步:判斷key在rangeBounds中所處的范圍,給出該key值在下一個(gè)RDD中的分區(qū)id下標(biāo);該分區(qū)器要求RDD中的KEY類(lèi)型必須是可以排序的。

?例如:排序,把數(shù)據(jù)按照范圍分區(qū),這樣各個(gè)分區(qū)排序之后,把數(shù)據(jù)組合起來(lái)就是正確的排序結(jié)果了。


2.3.5 RDD要素之五:本地存儲(chǔ)優(yōu)化

數(shù)據(jù)本地化是影響Spark job性能的一個(gè)重要的指標(biāo)。如果數(shù)據(jù)和代碼在一起,計(jì)算的速度應(yīng)該會(huì)加快。如果數(shù)據(jù)和代碼不在一塊呢?我們必須移動(dòng)其中的一個(gè),使二者在一起。通常情況下,移動(dòng)代碼比移動(dòng)數(shù)據(jù)更加高效(代碼的體積通常比較小)。Spark在進(jìn)行任務(wù)調(diào)度的時(shí)候就圍繞這樣一個(gè)原則:數(shù)據(jù)本地化原則(數(shù)據(jù)不動(dòng),代碼動(dòng)的原則)。

?PROCESS_LOCAL:數(shù)據(jù)和代碼在同一個(gè)JVM里。這是最理想的數(shù)據(jù)本地性。

?NODE_LOCAL:數(shù)據(jù)和代碼在相同的節(jié)點(diǎn)上,例如在同一HDFS的存儲(chǔ)節(jié)點(diǎn)上,或者在同一節(jié)點(diǎn)的兩個(gè)不同的executor里。這種級(jí)別的數(shù)據(jù)本地性會(huì)比PROCESS_LOCAL慢一點(diǎn),因?yàn)閿?shù)據(jù)需要在兩個(gè)進(jìn)程間傳遞。

?NO_PREF:數(shù)據(jù)可以被從任何位置進(jìn)行訪問(wèn),我們都看成是一樣的。

?RACK_LOCAL:數(shù)據(jù)在同一機(jī)架的不同機(jī)器上,因此數(shù)據(jù)傳輸需要通過(guò)網(wǎng)絡(luò),典型情況下需要通過(guò)一個(gè)交換機(jī)。

?ANY:數(shù)據(jù)不在同一機(jī)架的互聯(lián)網(wǎng)上的任意位置。

?Spark優(yōu)先選擇使用最佳的本地性級(jí)別,但是如果在每個(gè)空閑的executor里沒(méi)有未處理的數(shù)據(jù),Spark會(huì)降低這種級(jí)別。牽涉到兩點(diǎn):1等待忙著的CPU空閑了重新開(kāi)一個(gè)task,這樣數(shù)據(jù)就不用傳輸了。2在遠(yuǎn)端開(kāi)一個(gè)新的任務(wù),把數(shù)據(jù)傳過(guò)去。

?Spark的做法是先等待一段時(shí)間(spark.locality),如果在等待時(shí)間內(nèi)忙著的CPU閑下來(lái)了,采用1,否者采用2.

Part3:RDD在底層是如何實(shí)現(xiàn)的?

RDD底層實(shí)現(xiàn)原理

第一:RDD是分布式數(shù)據(jù)集,所以各個(gè)部分存在多臺(tái)機(jī)器上。

第二:每個(gè)部分的形式是Block。

第三:每個(gè)Executor啟動(dòng)BlockManagerSlave對(duì)象來(lái)管理所操作的Block。

第四:Block的元數(shù)據(jù)由Driver節(jié)點(diǎn)的BlockManagerMaster保存,BlockManagerSlave生成Block后向BlockManagerMaster注冊(cè)該Block。

第五:BlockManagerMaster管理RDD與Block的關(guān)系,當(dāng)RDD不再需要存儲(chǔ)的時(shí)候,將向BlockManagerSlave發(fā)送指令刪除相應(yīng)的Block。


RDD的邏輯與物理架構(gòu)

用戶(hù)程序?qū)DD通過(guò)多個(gè)函數(shù)進(jìn)行操作,將RDD進(jìn)行轉(zhuǎn)換。Block-Manager管理RDD的物理分區(qū),每個(gè)Block就是節(jié)點(diǎn)上對(duì)應(yīng)的一個(gè)數(shù)據(jù)塊,可以存儲(chǔ)在內(nèi)存或者磁盤(pán)。

而RDD中的partition是一個(gè)邏輯數(shù)據(jù)塊,對(duì)應(yīng)相應(yīng)的物理塊Block。本質(zhì)上一個(gè)RDD在代碼中相當(dāng)于是數(shù)據(jù)的一個(gè)元數(shù)據(jù)結(jié)構(gòu),存儲(chǔ)著數(shù)據(jù)分區(qū)及其邏輯結(jié)構(gòu)映射關(guān)系,存儲(chǔ)著RDD之前的依賴(lài)轉(zhuǎn)換關(guān)系。


spark程序模型

數(shù)據(jù)與計(jì)算的關(guān)系



最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容