Spark批量保存記錄到HBase

Spark PairRDDFunctions提供了兩個API函數(shù)saveAsHadoopDataset和saveAsNewAPIHadoopDataset,將RDD輸出到Hadoop支持的存儲系統(tǒng)中。

PairRDDFunctions

方法說明

def saveAsHadoopDataset(conf: JobConf): Unit
    Output the RDD to any Hadoop-supported storage system, 
    using a Hadoop JobConf object for that storage system. 
    The JobConf should set an OutputFormat and any output 
    paths required (e.g. a table name to write to) in the 
    same way as it would be configured for a Hadoop MapReduce job.
def saveAsNewAPIHadoopDataset(conf: Configuration): Unit
    Output the RDD to any Hadoop-supported storage system with new Hadoop API, 
    using a Hadoop Configuration object for that storage system. 
    The Conf should set an OutputFormat and any output paths required 
    (e.g. a table name to write to) in the same way as it would be 
    configured for a Hadoop MapReduce job.

舉例

package com.welab.wetag.mine

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ Put, Durability }
import org.apache.hadoop.hbase.util.Bytes

object SaveHbase {
  //記錄轉(zhuǎn)換代碼
  def convert(triple: (Int, String, Int)) = {
    val (key, name, age) = triple
    val p = new Put(Bytes.toBytes(key))
    p.setDurability(Durability.SKIP_WAL)
    p.addColumn(Bytes.toBytes("attr"), Bytes.toBytes("name"), Bytes.toBytes(name))
    p.addColumn(Bytes.toBytes("attr"), Bytes.toBytes("age"), Bytes.toBytes(age))
    (new ImmutableBytesWritable, p)
  }

  def main(args: Array[String]) {
    val appName: String = this.getClass.getSimpleName.split("\\$").last
    val sc = new SparkContext(new SparkConf().setAppName(appName))

    //定義HBase的配置,保證wetag已經(jīng)創(chuàng)建
    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set("hbase.zookeeper.quorum", "sparkmaster1")
    conf.set(TableOutputFormat.OUTPUT_TABLE, "wetag")

    val job = Job.getInstance(conf)
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    val rawData = Array((1, "Jazz", 14), (2, "Andy", 18), (3, "Vincent", 38))
    /* 業(yè)務(wù)邏輯,放在map里面實現(xiàn)
    val xdata = sc.parallelize(rawData).map {
      case (key, name, age) => {
        val p = new Put(Bytes.toBytes(key))
        p.setDurability(Durability.SKIP_WAL)
        p.addColumn(Bytes.toBytes("attr"), Bytes.toBytes("name"), Bytes.toBytes(name))
        p.addColumn(Bytes.toBytes("attr"), Bytes.toBytes("age"), Bytes.toBytes(age))
        (new ImmutableBytesWritable, p)
      }
    }
    */
    val xdata = sc.parallelize(rawData).map(convert)
    xdata.saveAsNewAPIHadoopDataset(job.getConfiguration)
  }
}

以上代碼編譯成jar,在spark-submit提交執(zhí)行

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容