1. 前言
這是一篇掛羊頭賣狗肉的文章,事實上,本文要描述的內容,和Spark Streaming沒有什么關系。
在上一篇文章http://www.lxweimin.com/p/a73c0c95d2fe 我們寫了如何通過Spark Streaming向數據庫中插入數據??赡苣阋呀洶l現了,數據是逐條插入數據庫的,效率底下。那么如何提高插入數據庫的效率呢?
數據庫寫是個IO任務,并行不一定能夠加速寫入數據庫的速度。我們主要說下批量提交和Bulk Copy Insert的方式。
2.批量提交
批量提交,就是JDBC Statment的executeBatch,直接看代碼吧。
/**
* 從Kafka中讀取數據,并把數據成批寫入數據庫
*/
object KafkaToDB {
val logger = LoggerFactory.getLogger(this.getClass)
def main(args: Array[String]): Unit = {
// 參數校驗
if (args.length < 2) {
System.err.println(
s"""
|Usage: KafkaToDB <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|""".stripMargin)
System.exit(1)
}
// 處理參數
val Array(brokers, topics) = args
// topic以“,”分割
val topicSet: Set[String] = topics.split(",").toSet
val kafkaParams: Map[String, Object] = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "example",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
// 創建上下文,以每1秒間隔的數據作為一批
val sparkConf = new SparkConf().setAppName("KafkaToDB")
val streamingContext = new StreamingContext(sparkConf, Seconds(2))
// 1.創建輸入流,獲取數據。流操作基于DStream,InputDStream繼承于DStream
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topicSet, kafkaParams)
)
// 2. DStream上的轉換操作
// 取消息中的value數據,以英文逗號分割,并轉成Tuple3
val values = stream.map(_.value.split(","))
.filter(x => x.length == 3)
.map(x => new Tuple3[String, String, String](x(0), x(1), x(2)))
// 輸入前10條到控制臺,方便調試
values.print()
// 3.同foreachRDD保存到數據庫
val sql = "insert into kafka_message(timeseq,timeseq2, thread, message) values (?,?,?,?)"
values.foreachRDD(rdd => {
val count = rdd.count()
println("-----------------count:" + count)
if (count > 0) {
rdd.foreachPartition(partitionOfRecords => {
val conn = ConnectionPool.getConnection.orNull
if (conn != null) {
val ps = conn.prepareStatement(sql)
try{
// 關閉自動執提交
conn.setAutoCommit(false)
partitionOfRecords.foreach(data => {
ps.setString(1, data._1)
ps.setString(2,System.currentTimeMillis().toString)
ps.setString(3, data._2)
ps.setString(4, data._3)
ps.addBatch()
})
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>
logger.error("Error in execution of insert. " + e.getMessage)
}finally {
ps.close()
ConnectionPool.closeConnection(conn)
}
}
})
}
})
streamingContext.start() // 啟動計算
streamingContext.awaitTermination() // 等待中斷結束計算
}
}
3. Bulk Copy Insert
我們使用的是PostgreSQL,其數據庫JDBC驅動程序提供了Copy Insert的API,其主要過程是:
- 1.獲取數據庫連接
- 2.創建CopyManager
- 3.把Spark Streaming中的流數據封裝成InputStream
- 4.執行CopyInsert
import java.sql.Connection
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.postgresql.copy.CopyManager
import org.postgresql.core.BaseConnection
import org.slf4j.LoggerFactory
object CopyInsert {
val logger = LoggerFactory.getLogger(this.getClass)
def main(args: Array[String]): Unit = {
// 參數校驗
if (args.length < 4) {
System.err.println(
s"""
|Usage: CopyInsert <brokers> <topics> <duration> <batchsize>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|""".stripMargin)
System.exit(1)
}
// 處理參數
val Array(brokers, topics,duration,batchsize) = args
// topic以“,”分割
val topicSet: Set[String] = topics.split(",").toSet
val kafkaParams: Map[String, Object] = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "example",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
// 創建上下文,以每1秒間隔的數據作為一批
val sparkConf = new SparkConf().setAppName("CopyInsertIntoPostgreSQL")
val streamingContext = new StreamingContext(sparkConf, Seconds(duration.toInt))
// 1.創建輸入流,獲取數據。流操作基于DStream,InputDStream繼承于DStream
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topicSet, kafkaParams)
)
// 2. DStream上的轉換操作
// 取消息中的value數據,以英文逗號分割,并轉成Tuple3
val values = stream.map(_.value.split(","))
.filter(x => x.length == 3)
.map(x => new Tuple3[String, String, String](x(0), x(1), x(2)))
// 輸入前10條到控制臺,方便調試
values.print()
// 3.同foreachRDD保存到數據庫
// http://rostislav-matl.blogspot.jp/2011/08/fast-inserts-to-postgresql-with-jdbc.html
values.foreachRDD(rdd => {
val count = rdd.count()
println("-----------------count:" + count)
if (count > 0) {
rdd.foreachPartition(partitionOfRecords => {
val start = System.currentTimeMillis()
val conn: Connection = ConnectionPool.getConnection.orNull
if (conn != null) {
val batch = batchsize.toInt
var counter: Int = 0
val sb: StringBuilder = new StringBuilder()
// 獲取數據庫連接
val baseConnection = conn.getMetaData.getConnection.asInstanceOf[BaseConnection]
// 創建CopyManager
val cpManager: CopyManager = new CopyManager(baseConnection)
partitionOfRecords.foreach(record => {
counter += 1
sb.append(record._1).append(",")
.append(System.currentTimeMillis()).append(",")
.append(record._2).append(",")
.append(record._3).append("\n")
if (counter == batch) {
// 構建輸入流
val in: InputStream = new ByteArrayInputStream(sb.toString().getBytes())
// 執行copyin
cpManager.copyIn("COPY kafka_message FROM STDIN WITH CSV", in)
println("-----------------batch---------------: " + batch)
counter = 0
sb.delete(0, sb.length)
closeInputStream(in)
}
})
val lastIn: InputStream = new ByteArrayInputStream(sb.toString().getBytes())
cpManager.copyIn("COPY kafka_message2 FROM STDIN WITH CSV", lastIn)
sb.delete(0, sb.length)
counter = 0
closeInputStream(lastIn)
val end = System.currentTimeMillis()
println("-----------------duration---------------ms :" + (end - start))
}
})
}
})
streamingContext.start() // 啟動計算
streamingContext.awaitTermination() // 等待中斷結束計算
}
def closeInputStream(in: InputStream): Unit ={
try{
in.close()
}catch{
case e: IOException =>
logger.error("Error on close InputStream. " + e.getMessage)
}
}
}
其它數據庫應該也有bulk load的方式,例如MySQL,com.mysql.jdbc.Statment中有setLocalInfileInputStream方法,功能應該和上述的Copy Insert類似,但我還沒有寫例子驗證。文檔里有如下的描述,供參考。原文地址
Sets an InputStream instance that will be used to send data to the MySQL server for a "LOAD DATA LOCAL INFILE" statement rather than a FileInputStream or URLInputStream that represents the path given as an argument to the statement.
(完)