MapReduce工作機制和序列化

MapReduce執行流程

image

MapReduce的執行步驟

1、Map任務處理

1.1 讀取HDFS中的文件。每一行解析成一個<k,v>。每一個鍵值對調用一次map函數。 <0,hello you> <10,hello me>

1.2 覆蓋map()，接收1.1產生的<k,v>，進行處理，轉換為新的<k,v>輸出。　　　　　　　　　　<hello,1> <you,1> <hello,1> <me,1>

1.3 對1.2輸出的<k,v>進行分區。默認分為一個區。詳見Partitioner

1.4 對不同分區中的數據進行排序（按照k）、分組。分組指的是相同key的value放到一個集合中。　排序后：<hello,1> <hello,1> <me,1> <you,1> 分組后：<hello,{1,1}><me,{1}><you,{1}>

1.5 （可選）對分組后的數據進行歸約。詳見Combiner

2、Reduce任務處理

2.1 多個map任務的輸出，按照不同的分區，通過網絡copy到不同的reduce節點上。詳見shuffle過程分析

2.2 對多個map的輸出進行合并、排序。覆蓋reduce函數，接收的是分組后的數據，實現自己的業務邏輯，　<hello,2> <me,1> <you,1>

處理后，產生新的<k,v>輸出。

2.3 對reduce輸出的<k,v>寫到HDFS中。

Java代碼實現

注：要導入org.apache.hadoop.fs.FileUtil.java。

1、先創建一個hello文件，上傳到HDFS中

image

2、然后再編寫代碼，實現文件中的單詞個數統計（代碼中被注釋掉的代碼，是可以省略的，不省略也行）

<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: "Courier New" !important; font-size: 12px !important;"> 1 package mapreduce; 2
3 import java.net.URI; 4 import org.apache.hadoop.conf.Configuration; 5 import org.apache.hadoop.fs.FileSystem; 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.io.LongWritable; 8 import org.apache.hadoop.io.Text; 9 import org.apache.hadoop.mapreduce.Job; 10 import org.apache.hadoop.mapreduce.Mapper; 11 import org.apache.hadoop.mapreduce.Reducer; 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 13 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 16
17 public class WordCountApp { 18 static final String INPUT_PATH = "hdfs://chaoren:9000/hello";
19 static final String OUT_PATH = "hdfs://chaoren:9000/out";
20
21 public static void main(String[] args) throws Exception { 22 Configuration conf = new Configuration(); 23 FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf); 24 Path outPath = new Path(OUT_PATH); 25 if (fileSystem.exists(outPath)) { 26 fileSystem.delete(outPath, true);
27 }
28
29 Job job = new Job(conf, WordCountApp.class.getSimpleName());
30
31 // 1.1指定讀取的文件位于哪里
32 FileInputFormat.setInputPaths(job, INPUT_PATH);
33 // 指定如何對輸入的文件進行格式化，把輸入文件每一行解析成鍵值對 34 //job.setInputFormatClass(TextInputFormat.class);
35
36 // 1.2指定自定義的map類
37 job.setMapperClass(MyMapper.class);
38 // map輸出的<k,v>類型。如果<k3,v3>的類型與<k2,v2>類型一致，則可以省略 39 //job.setOutputKeyClass(Text.class);
40 //job.setOutputValueClass(LongWritable.class);
41
42 // 1.3分區 43 //job.setPartitionerClass(org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.class);
44 // 有一個reduce任務運行 45 //job.setNumReduceTasks(1);
46
47 // 1.4排序、分組 48
49 // 1.5歸約 50
51 // 2.2指定自定義reduce類
52 job.setReducerClass(MyReducer.class);
53 // 指定reduce的輸出類型
54 job.setOutputKeyClass(Text.class);
55 job.setOutputValueClass(LongWritable.class);
56
57 // 2.3指定寫出到哪里
58 FileOutputFormat.setOutputPath(job, outPath);
59 // 指定輸出文件的格式化類 60 //job.setOutputFormatClass(TextOutputFormat.class);
61
62 // 把job提交給jobtracker運行
63 job.waitForCompletion(true);
64 }
65
66 /**
67 *
68 * KEYIN 即K1 表示行的偏移量
69 * VALUEIN 即V1 表示行文本內容
70 * KEYOUT 即K2 表示行中出現的單詞
71 * VALUEOUT 即V2 表示行中出現的單詞的次數，固定值1
72 *
73 /
74 static class MyMapper extends
75 Mapper<LongWritable, Text, Text, LongWritable> { 76 protected void map(LongWritable k1, Text v1, Context context) 77 throws java.io.IOException, InterruptedException { 78 String[] splited = v1.toString().split("\t");
79 for (String word : splited) { 80 context.write(new Text(word), new LongWritable(1));
81 }
82 };
83 }
84
85 /*
86 * KEYIN 即K2 表示行中出現的單詞
87 * VALUEIN 即V2 表示出現的單詞的次數
88 * KEYOUT 即K3 表示行中出現的不同單詞
89 * VALUEOUT 即V3 表示行中出現的不同單詞的總次數
90 */
91 static class MyReducer extends
92 Reducer<Text, LongWritable, Text, LongWritable> { 93 protected void reduce(Text k2, java.lang.Iterable<LongWritable> v2s, 94 Context ctx) throws java.io.IOException, 95 InterruptedException {
96 long times = 0L;
97 for (LongWritable count : v2s) { 98 times += count.get(); 99 } 100 ctx.write(k2, new LongWritable(times)); 101 }; 102 } 103 }</pre>

3、運行成功后，可以在Linux中查看操作的結果

image

</div>

MapReduce中的序列化

<div class="mdContent">
hadoop序列化的特點：

序列化格式特點：

1.緊湊：高效使用存儲空間。
2.快速：讀寫數據的額外開銷小
3.可擴展：可透明地讀取老格式的數據
4.互操作：支持多語言的交互

hadoop序列化與java序列化的最主要的區別是：在復雜類型的對象下，hadoop序列化不用像java對象類一樣傳輸多層的父子關系，需要哪個屬性就傳輸哪個屬性值，大大的減少網絡傳輸的開銷。

hadoop序列化的作用: <div class="mdContent">

1.序列化的在分布式的環境的作用：進程之間的通信，節點通過網絡之間的

2.hadoop節點之間數據傳輸

節點1：（序列化二進制數據） ------->(二進制流消息) 節點2:(反序列化二進制數據)

MR中key,value都是需要實現WritableComparable接口的對象，這樣的對象才是hadoop序列化的對象。

package com.feihao;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class StudentWritable implements WritableComparable<StudentWritable> {
private String name;
private int age;

public void write(DataOutput out) throws IOException {
    out.writeUTF(this.name);
    out.writeInt(this.age);
}

public void readFields(DataInput in) throws IOException {
    this.name = in.readUTF();
    this.age = in.readInt();

}
public int compareTo(StudentWritable o) {
return 0;
}
}
</div>

Combiner

一、Combiner的出現背景

1.1 回顧Map階段五大步驟

我們認識了MapReduce的八大步湊，其中在Map階段總共五個步驟，如下圖所示：

map section

其中，step1.5是一個可選步驟，它就是我們今天需要了解的 Map規約階段。現在，我們再來看看前一篇博文《[計數器與自定義計數器]》中的第一張關于計數器的圖：

image

我們可以發現，其中有兩個計數器：Combine output records和Combine input records，他們的計數都是0，這是因為我們在代碼中沒有進行Map階段的規約操作。

1.2 為什么需要進行Map規約操作

眾所周知，Hadoop框架使用Mapper將數據處理成一個個的<key,value>鍵值對，在網絡節點間對其進行整理(shuffle)，然后使用Reducer處理數據并進行最終輸出。

image

在上述過程中，我們看到至少兩個性能瓶頸：

（1）如果我們有10億個數據，Mapper會生成10億個鍵值對在網絡間進行傳輸，但如果我們只是對數據求最大值，那么很明顯的Mapper只需要輸出它所知道的最大值即可。這樣做不僅可以減輕網絡壓力，同樣也可以大幅度提高程序效率。

總結：網絡帶寬嚴重被占降低程序效率；

（2）假設使用美國專利數據集中的國家一項來闡述數據傾斜這個定義，這樣的數據遠遠不是一致性的或者說平衡分布的，由于大多數專利的國家都屬于美國，這樣不僅Mapper中的鍵值對、中間階段(shuffle)的鍵值對等，大多數的鍵值對最終會聚集于一個單一的Reducer之上，壓倒這個Reducer，從而大大降低程序的性能。

總結：單一節點承載過重降低程序性能；

那么，有木有一種方案能夠解決這兩個問題呢？

二、初步探索Combiner

2.1 Combiner的橫空出世

在MapReduce編程模型中，在Mapper和Reducer之間有一個非常重要的組件，它解決了上述的性能瓶頸問題，它就是Combiner。

PS：

①與mapper和reducer不同的是，combiner沒有默認的實現，需要顯式的設置在conf中才有作用。

②并不是所有的job都適用combiner，只有操作滿足結合律的才可設置combiner。combine操作類似于：opt(opt(1, 2, 3), opt(4, 5, 6))。如果opt為求和、求最大值的話，可以使用，但是如果是求中值的話，不適用。

每一個map都可能會產生大量的本地輸出，Combiner的作用就是對map端的輸出先做一次合并，以減少在map和reduce節點之間的數據傳輸量，以提高網絡IO性能，是MapReduce的一種優化手段之一，其具體的作用如下所述。

（1）Combiner最基本是實現本地key的聚合，對map輸出的key排序，value進行迭代。如下所示：

map: (K1, V1) → list(K2, V2)
　　combine: (K2, list(V2)) → list(K2, V2)
　　reduce: (K2, list(V2)) → list(K3, V3)

（2）Combiner還有本地reduce功能（其本質上就是一個reduce），例如Hadoop自帶的wordcount的例子和找出value的最大值的程序，combiner和reduce完全一致，如下所示：

map: (K1, V1) → list(K2, V2)
　　combine: (K2, list(V2)) → list(K3, V3)
　　reduce: (K3, list(V3)) → list(K4, V4)

PS：現在想想，如果在wordcount中不用combiner，那么所有的結果都是reduce完成，效率會相對低下。使用combiner之后，先完成的map會在本地聚合，提升速度。對于hadoop自帶的wordcount的例子，value就是一個疊加的數字，所以map一結束就可以進行reduce的value疊加，而不必要等到所有的map結束再去進行reduce的value疊加。

2.2 融合Combiner的MapReduce

image

前面文章中的代碼都忽略了一個可以優化MapReduce作業所使用帶寬的步驟—Combiner，它在Mapper之后Reducer之前運行。Combiner是可選的，如果這個過程適合于你的作業，Combiner實例會在每一個運行map任務的節點上運行。Combiner會接收特定節點上的Mapper實例的輸出作為輸入，接著Combiner的輸出會被發送到Reducer那里，而不是發送Mapper的輸出。Combiner是一個“迷你reduce”過程，它只處理單臺機器生成的數據。

2.3 使用MyReducer作為Combiner

在前面文章中的WordCount代碼中加入以下一句簡單的代碼，即可加入Combiner方法：

// 設置Map規約Combiner
job.setCombinerClass(MyReducer.class)

還是以下面的文件內容為例，看看這次計數器會發生怎樣的改變？

（1）上傳的測試文件的內容

hello edison
hello kevin

（2）調試后的計數器日志信息

image

可以看到，原本都為0的Combine input records和Combine output records發生了改變。我們可以清楚地看到map的輸出和combine的輸入統計是一致的，而combine的輸出與reduce的輸入統計是一樣的。由此可以看出規約操作成功，而且執行在map的最后，reduce之前。

三、自己定義Combiner

為了能夠更加清晰的理解Combiner的工作原理，我們自定義一個Combiners類，不再使用MyReduce做為Combiners的類，具體的代碼下面一一道來。

3.1 改寫Mapper類的map方法

public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> { protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException {
String line = value.toString();
String[] spilted = line.split(" "); for (String word : spilted) {
context.write(new Text(word), new LongWritable(1L)); // 為了顯示效果而輸出Mapper的輸出鍵值對信息
System.out.println("Mapper輸出<" + word + "," + 1 + ">");
}
};
}

3.2 改寫Reducer類的reduce方法

public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> { protected void reduce(Text key,
java.lang.Iterable<LongWritable> values,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException { // 顯示次數表示redcue函數被調用了多少次，表示k2有多少個分組
System.out.println("Reducer輸入分組<" + key.toString() + ",N(N>=1)>"); long count = 0L; for (LongWritable value : values) {
count += value.get(); // 顯示次數表示輸入的k2,v2的鍵值對數量
System.out.println("Reducer輸入鍵值對<" + key.toString() + ","
+ value.get() + ">");
}
context.write(key, new LongWritable(count));
};
}

3.3 添加MyCombiner類并重寫reduce方法

public static class MyCombiner extends Reducer<Text, LongWritable, Text, LongWritable> { protected void reduce(
Text key,
java.lang.Iterable<LongWritable> values,
org.apache.hadoop.mapreduce.Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException { // 顯示次數表示規約函數被調用了多少次，表示k2有多少個分組
System.out.println("Combiner輸入分組<" + key.toString() + ",N(N>=1)>"); long count = 0L; for (LongWritable value : values) {
count += value.get(); // 顯示次數表示輸入的k2,v2的鍵值對數量
System.out.println("Combiner輸入鍵值對<" + key.toString() + ","
+ value.get() + ">");
}
context.write(key, new LongWritable(count)); // 顯示次數表示輸出的k2,v2的鍵值對數量
System.out.println("Combiner輸出鍵值對<" + key.toString() + "," + count + ">");
};
}

3.4 添加設置Combiner的代碼

// 設置Map規約Combiner
job.setCombinerClass(MyCombiner.class);

3.5 調試運行的控制臺輸出信息

（1）Mapper

Mapper輸出<hello,1> Mapper輸出<edison,1> Mapper輸出<hello,1> Mapper輸出<kevin,1>

（2）Combiner

Combiner輸入分組<edison,N(N>=1)> Combiner輸入鍵值對<edison,1> Combiner輸出鍵值對<edison,1> Combiner輸入分組<hello,N(N>=1)> Combiner輸入鍵值對<hello,1> Combiner輸入鍵值對<hello,1> Combiner輸出鍵值對<hello,2> Combiner輸入分組<kevin,N(N>=1)> Combiner輸入鍵值對<kevin,1> Combiner輸出鍵值對<kevin,1></pre>

這里可以看出，在Combiner中進行了一次本地的Reduce操作，從而簡化了遠程Reduce節點的歸并壓力。

（3）Reducer

Reducer輸入分組<edison,N(N>=1)> Reducer輸入鍵值對<edison,1> Reducer輸入分組<hello,N(N>=1)> Reducer輸入鍵值對<hello,2> Reducer輸入分組<kevin,N(N>=1)> Reducer輸入鍵值對<kevin,1>

這里可以看出，在對hello的歸并上，只進行了一次操作就完成了。

那么，如果我們再來看看不添加Combiner時的控制臺輸出信息：

（1）Mapper
Mapper輸出<hello,1> Mapper輸出<edison,1> Mapper輸出<hello,1> Mapper輸出<kevin,1>

（2）Reducer

Reducer輸入分組<edison,N(N>=1)> Reducer輸入鍵值對<edison,1> Reducer輸入分組<hello,N(N>=1)> Reducer輸入鍵值對<hello,1> Reducer輸入鍵值對<hello,1> Reducer輸入分組<kevin,N(N>=1)> Reducer輸入鍵值對<kevin,1>

可以看出，沒有采用Combiner時hello都是由Reducer節點來進行統一的歸并，也就是這里為何會有兩次hello的輸入鍵值對了。

總結：從控制臺的輸出信息我們可以發現，其實combine只是把兩個相同的hello進行規約，由此輸入給reduce的就變成了<hello,2>。在實際的Hadoop集群操作中，我們是由多臺主機一起進行MapReduce的，如果加入規約操作，每一臺主機會在reduce之前進行一次對本機數據的規約，然后在通過集群進行reduce操作，這樣就會大大節省reduce的時間，從而加快MapReduce的處理速度。
</div>

Partition分區

<div class='mdContent'>
舊版 API 的 Partitioner 解析

Partitioner 的作用是對 Mapper 產生的中間結果進行分片，以便將同一分組的數據交給同一個 Reducer 處理，它直接影響 Reduce 階段的負載均衡。舊版 API 中 Partitioner 的類圖如圖所示。它繼承了JobConfigurable，可通過 configure 方法初始化。它本身只包含一個待實現的方法 getPartition。該方法包含三個參數，均由框架自動傳入，前面兩個參數是key/value，第三個參數 numPartitions 表示每個 Mapper 的分片數，也就是 Reducer 的個數。

image

MapReduce 提供了兩個Partitioner 實現：HashPartitioner和TotalOrderPartitioner。其中 HashPartitioner 是默認實現，它實現了一種基于哈希值的分片方法，代碼如下：

<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: Consolas, "Courier New", 宋體, Courier, mono, serif; font-size: 12px !important; line-height: 1;">public int getPartition(K2 key, V2 value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}</pre>

TotalOrderPartitioner 提供了一種基于區間的分片方法，通常用在數據全排序中。在MapReduce 環境中，容易想到的全排序方案是歸并排序，即在 Map 階段，每個 Map Task進行局部排序；在 Reduce 階段，啟動一個 Reduce Task 進行全局排序。由于作業只能有一個 Reduce Task，因而 Reduce 階段會成為作業的瓶頸。為了提高全局排序的性能和擴展性，MapReduce 提供了 TotalOrderPartitioner。它能夠按照大小將數據分成若干個區間（分片），并保證后一個區間的所有數據均大于前一個區間數據，這使得全排序的步驟如下：
步驟1：數據采樣。在 Client 端通過采樣獲取分片的分割點。Hadoop 自帶了幾個采樣算法，如 IntercalSampler、 RandomSampler、 SplitSampler 等（具體見org.apache.hadoop.mapred.lib 包中的 InputSampler 類）。下面舉例說明。
采樣數據為： b， abc， abd， bcd， abcd， efg， hii， afd， rrr， mnk
經排序后得到： abc， abcd， abd， afd， b， bcd， efg， hii， mnk， rrr
如果 Reduce Task 個數為 4，則采樣數據的四等分點為 abd、 bcd、 mnk，將這 3 個字符串作為分割點。
步驟2：Map 階段。本階段涉及兩個組件，分別是 Mapper 和 Partitioner。其中，Mapper 可采用 IdentityMapper，直接將輸入數據輸出，但 Partitioner 必須選用TotalOrderPartitioner，它將步驟 1 中獲取的分割點保存到 trie 樹中以便快速定位任意一個記錄所在的區間，這樣，每個 Map Task 產生 R（Reduce Task 個數）個區間，且區間之間有序。TotalOrderPartitioner 通過 trie 樹查找每條記錄所對應的 Reduce Task 編號。如圖所示，我們將分割點保存在深度為 2 的 trie 樹中，假設輸入數據中有兩個字符串“ abg”和“ mnz”，則字符串“ abg” 對應 partition1，即第 2 個 Reduce Task，字符串“ mnz” 對應partition3，即第 4 個 Reduce Task。

image

步驟 3：Reduce 階段。每個 Reducer 對分配到的區間數據進行局部排序，最終得到全排序數據。從以上步驟可以看出，基于 TotalOrderPartitioner 全排序的效率跟 key 分布規律和采樣算法有直接關系；key 值分布越均勻且采樣越具有代表性，則 Reduce Task 負載越均衡，全排序效率越高。TotalOrderPartitioner 有兩個典型的應用實例： TeraSort 和 HBase 批量數據導入。其中，TeraSort 是 Hadoop 自帶的一個應用程序實例。它曾在 TB 級數據排序基準評估中贏得第一名，而 TotalOrderPartitioner正是從該實例中提煉出來的。HBase 是一個構建在 Hadoop之上的 NoSQL 數據倉庫。它以 Region為單位劃分數據，Region 內部數據有序（按 key 排序），Region 之間也有序。很明顯，一個 MapReduce 全排序作業的 R 個輸出文件正好可對應 HBase 的 R 個 Region。

新版 API 的 Partitioner 解析

新版 API 中的Partitioner類圖如圖所示。它不再實現JobConfigurable 接口。當用戶需要讓 Partitioner通過某個JobConf 對象初始化時，可自行實現Configurable 接口，如：

image

Partition所處的位置

image

Partition主要作用就是將map的結果發送到相應的reduce。這就對partition有兩個要求：

1）均衡負載，盡量的將工作均勻的分配給不同的reduce。

2）效率，分配速度一定要快。

Mapreduce提供的Partitioner

image

patition類結構

1. Partitioner<k,v>是partitioner的基類，如果需要定制partitioner也需要繼承該類。源代碼如下：

<p><code>Partitioner</code> controls the partitioning of the keys of the
intermediate map-outputs. The key (or a subset of the key) is used to derive
the partition, typically by a hash function. The total number of partitions
is the same as the number of reduce tasks for the job. Hence this controls
which of the <code>m</code> reduce tasks the intermediate key (and hence the
record) is sent for reduction.</p>
@see Reducer
@deprecated Use {@link org.apache.hadoop.mapreduce.Partitioner} instead. / @Deprecated public interface Partitioner<K2, V2> extends JobConfigurable { /* * Get the paritition number for a given key (hence record) given the total
- number of partitions i.e. number of reduce-tasks for the job.
- <p>Typically a hash function on a all or a subset of the key.</p>
- @param key the key to be paritioned.
- @param value the entry value.
- @param numPartitions the total number of partitions.
- @return the partition number for the <code>key</code>. */
  int getPartition(K2 key, V2 value, int numPartitions);
  }</pre>

2. HashPartitioner<k,v>是mapreduce的默認partitioner。源代碼如下：

<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: Consolas, "Courier New", 宋體, Courier, mono, serif; font-size: 12px !important; line-height: 1;">package org.apache.hadoop.mapreduce.lib.partition; import org.apache.hadoop.mapreduce.Partitioner; /** Partition keys by their {@link Object#hashCode()}. /
public class HashPartitioner<K, V> extends Partitioner<K, V> { /* Use {@link Object#hashCode()} to partition. */
public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}</pre>

3. BinaryPatitioner繼承于Partitioner<BinaryComparable ,V>，是Partitioner<k,v>的偏特化子類。該類提供leftOffset和rightOffset，在計算which reducer時僅對鍵值K的[rightOffset，leftOffset]這個區間取hash。

4. KeyFieldBasedPartitioner<k2, v2="">也是基于hash的個partitioner。和BinaryPatitioner不同，它提供了多個區間用于計算hash。當區間數為0時KeyFieldBasedPartitioner退化成HashPartitioner。源代碼如下：

<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: Consolas, "Courier New", 宋體, Courier, mono, serif; font-size: 12px !important; line-height: 1;">package org.apache.hadoop.mapred.lib; import java.io.UnsupportedEncodingException; import java.util.List; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Partitioner; import org.apache.hadoop.mapred.lib.KeyFieldHelper.KeyDescription; /** * Defines a way to partition keys based on certain key fields (also see

{@link KeyFieldBasedComparator}.
The key specification supported is of the form -k pos1[,pos2], where,
pos is of the form f[.c][opts], where f is the number
of the key field to use, and c is the number of the first character from
the beginning of the field. Fields and character posns are numbered
starting with 1; a character position of zero in pos2 indicates the
field's last character. If '.c' is omitted from pos1, it defaults to 1
(the beginning of the field); if omitted from pos2, it defaults to 0
(the end of the field).
*/
public class KeyFieldBasedPartitioner<K2, V2> implements Partitioner<K2, V2> { private static final Log LOG = LogFactory.getLog(KeyFieldBasedPartitioner.class.getName()); private int numOfPartitionFields; private KeyFieldHelper keyFieldHelper = new KeyFieldHelper(); public void configure(JobConf job) {
String keyFieldSeparator = job.get("map.output.key.field.separator", "\t");
keyFieldHelper.setKeyFieldSeparator(keyFieldSeparator); if (job.get("num.key.fields.for.partition") != null) {
LOG.warn("Using deprecated num.key.fields.for.partition. " +
"Use mapred.text.key.partitioner.options instead"); this.numOfPartitionFields = job.getInt("num.key.fields.for.partition",0);
keyFieldHelper.setKeyFieldSpec(1,numOfPartitionFields);
} else {
String option = job.getKeyFieldPartitionerOption();
keyFieldHelper.parseOption(option);
}
} public int getPartition(K2 key, V2 value, int numReduceTasks) { byte[] keyBytes;

List <KeyDescription> allKeySpecs = keyFieldHelper.keySpecs(); if (allKeySpecs.size() == 0) { return getPartition(key.toString().hashCode(), numReduceTasks);
} try {
keyBytes = key.toString().getBytes("UTF-8");
} catch (UnsupportedEncodingException e) { throw new RuntimeException("The current system does not " +
"support UTF-8 encoding!", e);
} // return 0 if the key is empty
if (keyBytes.length == 0) { return 0;
} int []lengthIndicesFirst = keyFieldHelper.getWordLengths(keyBytes, 0,
keyBytes.length); int currentHash = 0; for (KeyDescription keySpec : allKeySpecs) { int startChar = keyFieldHelper.getStartOffset(keyBytes, 0, keyBytes.length,
lengthIndicesFirst, keySpec); // no key found! continue
if (startChar < 0) { continue;
} int endChar = keyFieldHelper.getEndOffset(keyBytes, 0, keyBytes.length,
lengthIndicesFirst, keySpec);
currentHash = hashCode(keyBytes, startChar, endChar,
currentHash);
} return getPartition(currentHash, numReduceTasks);
} protected int hashCode(byte[] b, int start, int end, int currentHash) { for (int i = start; i <= end; i++) {
currentHash = 31*currentHash + b[i];
} return currentHash;
} protected int getPartition(int hash, int numReduceTasks) { return (hash & Integer.MAX_VALUE) % numReduceTasks;
}
}</pre>

5. TotalOrderPartitioner這個類可以實現輸出的全排序。不同于以上3個partitioner，這個類并不是基于hash的。下面詳細的介紹TotalOrderPartitioner

TotalOrderPartitioner 類

每一個reducer的輸出在默認的情況下都是有順序的，但是reducer之間在輸入是無序的情況下也是無序的。如果要實現輸出是全排序的那就會用到TotalOrderPartitioner。

要使用TotalOrderPartitioner，得給TotalOrderPartitioner提供一個partition file。這個文件要求Key（這些key就是所謂的劃分）的數量和當前reducer的數量-1相同并且是從小到大排列。對于為什么要用到這樣一個文件，以及這個文件的具體細節待會還會提到。

TotalOrderPartitioner對不同Key的數據類型提供了兩種方案：

1）對于非BinaryComparable 類型的Key，TotalOrderPartitioner采用二分發查找當前的K所在的index。

例如：reducer的數量為5，partition file 提供的4個劃分為【2，4，6，8】。如果當前的一個key/value 是<4,”good”>，利用二分法查找到index=1，index+1=2那么這個key/value 將會發送到第二個reducer。如果一個key/value為<4.5, “good”>。那么二分法查找將返回-3，同樣對-3加1然后取反就是這個key/value將要去的reducer。

對于一些數值型的數據來說，利用二分法查找復雜度是O(log(reducer count))，速度比較快。

2）對于BinaryComparable類型的Key（也可以直接理解為字符串）。字符串按照字典順序也是可以進行排序的。

這樣的話也可以給定一些劃分，讓不同的字符串key分配到不同的reducer里。這里的處理和數值類型的比較相近。

例如：reducer的數量為5，partition file 提供了4個劃分為【“abc”, “bce”, “eaa”, ”fhc”】那么“ab”這個字符串將會被分配到第一個reducer里，因為它小于第一個劃分“abc”。

但是不同于數值型的數據，字符串的查找和比較不能按照數值型數據的比較方法。mapreducer采用的Tire tree（關于Tire tree可以參考《字典樹(Trie Tree)》）的字符串查找方法。查找的時間復雜度o(m)，m為樹的深度，空間復雜度o(255^m-1)。是一個典型的空間換時間的案例。

Tire tree的構建

假設樹的最大深度為3，劃分為【aaad ，aaaf， aaaeh，abbx】

image

Mapreduce里的Tire tree主要有兩種節點組成：

1） Innertirenode
Innertirenode在mapreduce中是包含了255個字符的一個比較長的串。上圖中的例子只包含了26個英文字母。
2）葉子節點{unslipttirenode, singesplittirenode, leaftirenode}
Unslipttirenode 是不包含劃分的葉子節點。
Singlesplittirenode 是只包含了一個劃分點的葉子節點。
Leafnode是包含了多個劃分點的葉子節點。（這種情況比較少見，達到樹的最大深度才出現這種情況。在實際操作過程中比較少見）

Tire tree的搜索過程

接上面的例子：
1）假如當前 key value pair <aad, 10="">這時會找到圖中的leafnode，在leafnode內部使用二分法繼續查找找到返回 aad在劃分數組中的索引。找不到會返回一個和它最接近的劃分的索引。
2）假如找到singlenode，如果和singlenode的劃分相同或小返回他的索引，比singlenode的劃分大則返回索引+1。
3）假如找到nosplitnode則返回前面的索引。如<zaa, 20="">將會返回abbx的在劃分數組中的索引。

TotalOrderPartitioner的疑問

上面介紹了partitioner有兩個要求，一個是速度，另外一個是均衡負載。使用tire tree提高了搜素的速度，但是我們怎么才能找到這樣的partition file 呢？讓所有的劃分剛好就能實現均衡負載。

InputSampler
輸入采樣類，可以對輸入目錄下的數據進行采樣。提供了3種采樣方法。

image

采樣類結構圖

采樣方式對比表:

類名稱

采樣方式

構造方法

效率

特點

|
|

SplitSampler<K,V>

對前n個記錄進行采樣

采樣總數，劃分數

最高

| |
|

RandomSampler<K,V>

遍歷所有數據，隨機采樣

采樣頻率，采樣總數，劃分數

最低

| |
|

IntervalSampler<K,V>

固定間隔采樣

采樣頻率，劃分數

中

對有序的數據十分適用

writePartitionFile這個方法很關鍵，這個方法就是根據采樣類提供的樣本，首先進行排序，然后選定（隨機的方法）和reducer數目-1的樣本寫入到partition file。這樣經過采樣的數據生成的劃分，在每個劃分區間里的key/value就近似相同了，這樣就能完成均衡負載的作用。

SplitSampler類的源代碼如下：

Inexpensive way to sample random data. /
public static class SplitSampler<K,V> implements Sampler<K,V> { private final int numSamples; private final int maxSplitsSampled; /* * Create a SplitSampler sampling <em>all</em> splits.
- Takes the first numSamples / numSplits records from each split.
- @param numSamples Total number of samples to obtain from all selected
- ```
              splits. */
```

public SplitSampler(int numSamples) { this(numSamples, Integer.MAX_VALUE);
} /** * Create a new SplitSampler.
 * @param numSamples Total number of samples to obtain from all selected
 *                   splits.
 * @param maxSplitsSampled The maximum number of splits to examine. */
public SplitSampler(int numSamples, int maxSplitsSampled) { this.numSamples = numSamples; this.maxSplitsSampled = maxSplitsSampled;
} /** * From each split sampled, take the first numSamples / numSplits records. */ @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
  InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
  ArrayList<K> samples = new ArrayList<K>(numSamples); int splitsToSample = Math.min(maxSplitsSampled, splits.length); int splitStep = splits.length / splitsToSample; int samplesPerSplit = numSamples / splitsToSample; long records = 0; for (int i = 0; i < splitsToSample; ++i) {
    RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
        job, Reporter.NULL);
    K key = reader.createKey();
    V value = reader.createValue(); while (reader.next(key, value)) {
      samples.add(key);
      key = reader.createKey(); ++records; if ((i+1) * samplesPerSplit <= records) { break;
      }
    }
    reader.close();
  } return (K[])samples.toArray();
}

}</pre>

RandomSampler類的源代碼如下：

General-purpose sampler. Takes numSamples / maxSplitsSampled inputs from
each split. /
public static class RandomSampler<K,V> implements Sampler<K,V> { private double freq; private final int numSamples; private final int maxSplitsSampled; /* * Create a new RandomSampler sampling <em>all</em> splits.
- This will read every split at the client, which is very expensive.
- @param freq Probability with which a key will be chosen.
- @param numSamples Total number of samples to obtain from all selected
- ```
              splits. */
```

public RandomSampler(double freq, int numSamples) { this(freq, numSamples, Integer.MAX_VALUE);
} /** * Create a new RandomSampler.
 * @param freq Probability with which a key will be chosen.
 * @param numSamples Total number of samples to obtain from all selected
 *                   splits.
 * @param maxSplitsSampled The maximum number of splits to examine. */
public RandomSampler(double freq, int numSamples, int maxSplitsSampled) { this.freq = freq; this.numSamples = numSamples; this.maxSplitsSampled = maxSplitsSampled;
} /** * Randomize the split order, then take the specified number of keys from
 * each split sampled, where each key is selected with the specified
 * probability and possibly replaced by a subsequently selected key when
 * the quota of keys from that split is satisfied. */ @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
  InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
  ArrayList<K> samples = new ArrayList<K>(numSamples); int splitsToSample = Math.min(maxSplitsSampled, splits.length);

  Random r = new Random(); long seed = r.nextLong();
  r.setSeed(seed);
  LOG.debug("seed: " + seed); // shuffle splits
  for (int i = 0; i < splits.length; ++i) {
    InputSplit tmp = splits[i]; int j = r.nextInt(splits.length);
    splits[i] = splits[j];
    splits[j] = tmp;
  } // our target rate is in terms of the maximum number of sample splits, // but we accept the possibility of sampling additional splits to hit // the target sample keyset
  for (int i = 0; i < splitsToSample || (i < splits.length && samples.size() < numSamples); ++i) {
    RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
        Reporter.NULL);
    K key = reader.createKey();
    V value = reader.createValue(); while (reader.next(key, value)) { if (r.nextDouble() <= freq) { if (samples.size() < numSamples) {
          samples.add(key);
        } else { // When exceeding the maximum number of samples, replace a // random element with this one, then adjust the frequency // to reflect the possibility of existing elements being // pushed out
          int ind = r.nextInt(numSamples); if (ind != numSamples) {
            samples.set(ind, key);
          }
          freq *= (numSamples - 1) / (double) numSamples;
        }
        key = reader.createKey();
      }
    }
    reader.close();
  } return (K[])samples.toArray();
}

}</pre>

IntervalSampler類的源代碼為：

Useful for sorted data. /
public static class IntervalSampler<K,V> implements Sampler<K,V> { private final double freq; private final int maxSplitsSampled; /* * Create a new IntervalSampler sampling <em>all</em> splits.
- @param freq The frequency with which records will be emitted. /
  public IntervalSampler(double freq) { this(freq, Integer.MAX_VALUE);
  } /* * Create a new IntervalSampler.
- @param freq The frequency with which records will be emitted.
- @param maxSplitsSampled The maximum number of splits to examine.
- @see #getSample /
  public IntervalSampler(double freq, int maxSplitsSampled) { this.freq = freq; this.maxSplitsSampled = maxSplitsSampled;
  } /* * For each split sampled, emit when the ratio of the number of records
- retained to the total record count is less than the specified
- frequency. */ @SuppressWarnings("unchecked") // ArrayList::toArray doesn't preserve type
  public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
  InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
  ArrayList<K> samples = new ArrayList<K>(); int splitsToSample = Math.min(maxSplitsSampled, splits.length); int splitStep = splits.length / splitsToSample; long records = 0; long kept = 0; for (int i = 0; i < splitsToSample; ++i) {
  RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
  job, Reporter.NULL);
  K key = reader.createKey();
  V value = reader.createValue(); while (reader.next(key, value)) { ++records; if ((double) kept / records < freq) { ++kept;
  samples.add(key);
  key = reader.createKey();
  }
  }
  reader.close();
  } return (K[])samples.toArray();
  }
  }</pre>

InputSampler類完整源代碼如下：

image

InputSampler

TotalOrderPartitioner實例

<pre style="margin: 0px; padding: 0px; white-space: pre-wrap; word-wrap: break-word; font-family: Consolas, "Courier New", 宋體, Courier, mono, serif; font-size: 12px !important; line-height: 1;">public class SortByTemperatureUsingTotalOrderPartitioner extends Configured implements Tool
{
@Override public int run(String[] args) throws Exception
{
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args); if (conf == null) { return -1;
}
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(conf, true);
SequenceFileOutputFormat
.setOutputCompressorClass(conf, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(conf,
CompressionType.BLOCK);
conf.setPartitionerClass(TotalOrderPartitioner.class);
InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>( 0.1, 10000, 10);
Path input = FileInputFormat.getInputPaths(conf)[0];
input = input.makeQualified(input.getFileSystem(conf));
Path partitionFile = new Path(input, "_partitions");
TotalOrderPartitioner.setPartitionFile(conf, partitionFile);
InputSampler.writePartitionFile(conf, sampler); // Add to DistributedCache
URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
DistributedCache.addCacheFile(partitionUri, conf);
DistributedCache.createSymlink(conf);
JobClient.runJob(conf); return 0;
} public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run( new SortByTemperatureUsingTotalOrderPartitioner(), args);
System.exit(exitCode);
}
}</pre>

</div>

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,002評論 6贊 542
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,400評論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,136評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,714評論 1贊 317
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,452評論 6贊 412
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 55,818評論 1贊 328
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,812評論 3贊 446
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,997評論 0贊 290
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,552評論 1贊 335
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,292評論 3贊 358
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,510評論 1贊 374
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,035評論 5贊 363
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,721評論 3贊 348
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,121評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,429評論 1贊 294
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,235評論 3贊 398
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,480評論 2贊 379

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

MapReduce工作機制和序列化

MapReduce工作機制和序列化

MapReduce執行流程

MapReduce中的序列化

hadoop序列化的作用: <div class="mdContent">

Combiner

一、Combiner的出現背景

1.1 回顧Map階段五大步驟

1.2 為什么需要進行Map規約操作

二、初步探索Combiner

2.1 Combiner的橫空出世

2.2 融合Combiner的MapReduce

2.3 使用MyReducer作為Combiner

三、自己定義Combiner

3.1 改寫Mapper類的map方法

3.2 改寫Reducer類的reduce方法

3.3 添加MyCombiner類并重寫reduce方法

3.4 添加設置Combiner的代碼

3.5 調試運行的控制臺輸出信息

Partition分區

TotalOrderPartitioner 類

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

MapReduce工作機制和序列化

MapReduce執行流程

MapReduce中的序列化

hadoop序列化的作用: <div class="mdContent">

Combiner

一、Combiner的出現背景

1.1 回顧Map階段五大步驟

1.2 為什么需要進行Map規約操作

二、初步探索Combiner

2.1 Combiner的橫空出世

2.2 融合Combiner的MapReduce

2.3 使用MyReducer作為Combiner

三、自己定義Combiner

3.1 改寫Mapper類的map方法

3.2 改寫Reducer類的reduce方法

3.3 添加MyCombiner類并重寫reduce方法

3.4 添加設置Combiner的代碼

3.5 調試運行的控制臺輸出信息

Partition分區

TotalOrderPartitioner 類

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频