Hbase 是一種基于Hadoop的Nosql的數(shù)據(jù)庫,有高吞吐量的特點,由于近幾年國內(nèi)大數(shù)據(jù)的概念的快速興起,Hbase也因為它的高吞吐量和快速的檢索能力,得到了越來越多人的青睞,雖說Hbase的吞吐量很高,但是在全量數(shù)據(jù)的Load的時候不能避免的碰到compact-split風暴,由于Hbase底層region的存儲是基于HDFS實現(xiàn)的,所以官方推薦了一種快速進行數(shù)據(jù)Load的方式。
先附上兩個鏈接:
http://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_hbase_import.html
http://hbase.apache.org/book.html#arch.bulk.load
上面兩個鏈接一個是CDH官網(wǎng)的說明文檔,另一個是Hbase的官方說明文檔。
BulkLoad無疑是一種很好的機制,能幫我們快速的load數(shù)據(jù)到Hbase集群,而避免各種麻煩。
</br>
BulkLoad的原理和流程:
</br>
- 根據(jù)HDFS上的數(shù)據(jù)或者外部的數(shù)據(jù)生成Hbase的底層Hfile數(shù)據(jù)。
- 根據(jù)生成的目標Hfile,利用Hbase提供的BulkLoad工具將Hfile Load到Hbase目錄下面。
流程圖如下:
最近在實際中要針對某一個job,進行該job一段時期的瀏覽量的統(tǒng)計。具體代碼如下:
public class JobBrowseCountToHbase {
private static final String bulkPath = "/home/xxx/xxx/job_browse_bulkload";
public static class JobBrowseMapper extends Mapper<Object, Text, Text, IntWritable>{
public static String fileName = null;
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
/* if(context.getInputSplit() instanceof FileSplit){
FileSplit split = (FileSplit)context.getInputSplit();
Path filePath = split.getPath();
fileName = filePath.getName();
System.out.println("splitPath is "+filePath.toString()+" splitName is "+fileName);
}*/
}
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String itermContent = value.toString();
String iterms[] = itermContent.split("\t");
/*StringBuffer buff = new StringBuffer();
for(String it : iterms){
buff.append(it).append("###");
}
System.out.println(buff.toString());*/
if(iterms.length>=26&&iterms[iterms.length-1]!=null&&iterms[iterms.length-1].equals("pv")&&isNotBlank(iterms[iterms.length-2])){
String UrlUndecode = iterms[25];
if(UrlUndecode!=null && !"".equals(UrlUndecode)){
String Urldecode = URLDecoder.decode(UrlUndecode);
if(Urldecode!=null && !"".equals(Urldecode)){
if(Urldecode.matches("^(http://www.xxx.com/job/){1}[0-9a-z.]*")||Urldecode.matches("^(http://m.xxx.com/job/){1}[0-9a-z.]*")){
String pathContent[] = Urldecode.split("/");
String itermUrl = pathContent[pathContent.length-1];
if(itermUrl!=null){
String jobs[] = itermUrl.split("\\.");
String jobId = jobs[0];
String comId = iterms[iterms.length-2];
if(isNotBlank(comId)&&!comId.equals("-")){
String addtimeAll = iterms[19];
if(isNotBlank(addtimeAll)){
String addtime = addtimeAll.substring(0, 8);
String rowkey1 = comId+"_"+"1"+"_"+addtime+"_"+jobId;
String rowkey2 = comId+"_"+"2"+"_"+jobId+"_"+addtime;
context.write(new Text(rowkey1), new IntWritable(1));
context.write(new Text(rowkey2), new IntWritable(1));
}
}
}
}
}
}
}
}
@Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
}
public boolean isNotBlank(String content){
if(content!=null&&!"".equals(content)){
return true;
}
return false;
}
}
public static class JobBrowseReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void cleanup(Context context)
throws IOException, InterruptedException {
}
@Override
protected void reduce(Text text, Iterable<IntWritable> iterator,
Context context) throws IOException, InterruptedException {
if(text!=null&&!"".equals(text.toString())){
int browseCount = 0;
for(IntWritable count : iterator){
browseCount += count.get();
}
context.write(text, new IntWritable(browseCount));
}
}
@Override
protected void setup(Context context)
throws IOException, InterruptedException {
}
}
public static class LoadJobBrowseToHbaseMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue>{
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
String content = value.toString();
String kvContents[] = content.split("\t");
if(kvContents.length>=2){
byte[] rowkey = Bytes.toBytes(kvContents[0]);
ImmutableBytesWritable rowKeyWritable = new ImmutableBytesWritable(rowkey);
KeyValue lv = new KeyValue(rowkey,Bytes.toBytes("info"),Bytes.toBytes("count"),Bytes.toBytes(kvContents[1]));
context.write(rowKeyWritable, lv);
}
}
}
public static void main(String args[]) throws Exception{
String tableName = "hdp_xxx:xxx_job_browse";
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
conf.set("mapreduce.job.queuename", "root.offline.xxx.normal");
Job job = Job.getInstance(conf, "job_browse_to_hdfs");
job.setJarByClass(JobBrowseCountToHbase.class);
job.setMapperClass(JobBrowseMapper.class);
job.setReducerClass(JobBrowseReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileSystem fs = FileSystem.get(conf);
for (int i = 0; i < otherArgs.length - 1; i++) {
FileInputFormat.setInputPaths(job, new Path(otherArgs[i]));
}
if(fs.exists(new Path(otherArgs[otherArgs.length - 1]))){
fs.delete(new Path(otherArgs[otherArgs.length - 1]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[(otherArgs.length - 1)]));
job.setNumReduceTasks(20);
int flag = job.waitForCompletion(true) ? 0 : 1;
Job LoadHbaseJob = new Job(conf, "job_browse_to_hbase");
LoadHbaseJob.setJarByClass(JobBrowseCountToHbase.class);
LoadHbaseJob.setMapperClass(LoadJobBrowseToHbaseMapper.class);
/*
LoadHbaseJob.setMapOutputKeyClass(ImmutableBytesWritable.class); configureIncrementalLoad中已經(jīng)添加了
LoadHbaseJob.setMapOutputValueClass(KeyValue.class);
*/
if(fs.exists(new Path(bulkPath))){
fs.delete(new Path(bulkPath));//輸出目錄如果存在,先刪除掉
}
FileInputFormat.addInputPath(LoadHbaseJob, new Path(otherArgs[otherArgs.length - 1]));
FileOutputFormat.setOutputPath(LoadHbaseJob,new Path(bulkPath));
Configuration hbaseConfiguration = HBaseConfiguration.create();
HTable wordCountTable = new HTable(hbaseConfiguration, tableName);
HFileOutputFormat2.configureIncrementalLoad(LoadHbaseJob, wordCountTable);
int LoadHbaseJobFlag = LoadHbaseJob.waitForCompletion(true) ? 0 : 1;
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConfiguration);
loader.doBulkLoad(new Path(bulkPath), wordCountTable);
System.out.println("Load Hbase Sucess");
System.exit(LoadHbaseJobFlag);
}
}
由上面的代碼可以看到,整個過程分兩個job進行實現(xiàn),第一個job是進行數(shù)據(jù)的統(tǒng)計,第二個job進行的數(shù)據(jù)的bulkLoad。
上面代碼中幾個關(guān)鍵點如下:
map中寫兩次的原因,因為統(tǒng)計需求是要針對企業(yè)和job兩個維度去進行,所以進行了rowkey的冗余存儲,這樣存儲便于用一張表實現(xiàn)兩種需求。
-
HFileOutputFormat2.configureIncrementalLoad(LoadHbaseJob, wordCountTable)。這段代碼至關(guān)重要,Hbase存儲的數(shù)據(jù)是按照rowkey進行排序存儲的,所以在生成Hfile之前要對數(shù)據(jù)進行排序。此段代碼里面包含了對三種value的排序:
Configuration conf = job.getConfiguration(); job.setOutputKeyClass(ImmutableBytesWritable.class); job.setOutputValueClass(KeyValue.class); job.setOutputFormatClass(cls); if (KeyValue.class.equals(job.getMapOutputValueClass())) { job.setReducerClass(KeyValueSortReducer.class); } else if (Put.class.equals(job.getMapOutputValueClass())) { job.setReducerClass(PutSortReducer.class); } else if (Text.class.equals(job.getMapOutputValueClass())) { job.setReducerClass(TextSortReducer.class); } else { LOG.warn("Unknown map output value type:" + job.getMapOutputValueClass()); } conf.setStrings("io.serializations", new String[] { conf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName(), KeyValueSerialization.class.getName() }); LOG.info("Looking up current regions for table " + table.getName()); List<ImmutableBytesWritable> startKeys = getRegionStartKeys(regionLocator); LOG.info("Configuring " + startKeys.size() + " reduce partitions " + "to match current region count"); job.setNumReduceTasks(startKeys.size()); configurePartitioner(job, startKeys); configureCompression(table, conf); configureBloomType(table, conf); configureBlockSize(table, conf); configureDataBlockEncoding(table, conf); TableMapReduceUtil.addDependencyJars(job); TableMapReduceUtil.initCredentials(job); LOG.info("Incremental table " + table.getName() + " output configured.");
從上面可以看出,這段代碼為job設置了Reduce任務,在Reduce中進行了排序,同時支持了下面上面的幾種value的類型。
BulkLoad的使用場景:
</br>
盜用下CDH官方的文檔說明:
Use Cases for BulkLoad:
1.Loading your original dataset into HBase for the first time - Your initial dataset might be quite large, and bypassing the HBase write path can speed up the process considerably.
2.Incremental Load - To load new data periodically, use BulkLoad to import it in batches at your preferred intervals. This alleviates latency problems and helps you to achieve service-level agreements (SLAs). However, one trigger for compaction is the number of HFiles on a RegionServer. Therefore, importing a large number of HFiles at frequent intervals can cause major compactions to happen more often than they otherwise would, negatively impacting performance. You can mitigate this by tuning the compaction settings such that the maximum number of HFiles that can be present without triggering a compaction is very high, and relying on other factors, such as the size of the Memstore, to trigger compactions.
3.Data needs to originate elsewhere - If an existing system is capturing the data you want to have in HBase and needs to remain active for business reasons, you can periodically BulkLoad data from the system into HBase so that you can perform operations on it without impacting the system.
從上面看出來bulkload不是只能在初始的時候進行一次加載的,如果你的業(yè)務是每天定時更新數(shù)據(jù),每天bulkload也是挺好的方案。