代碼如下:
val data = sc.parallelize(List(("192.168.34.5", "pc", 5, 12)))
val url = "jdbc:mysql://ip:端口/數據庫?"http:///user=username&password=password”
classOf[com.mysql.jdbc.Driver]
val conn = DriverManager.getConnection(url, user, pwd)
try {
conn.setAutoCommit(false)
val prep = conn.prepareStatement("INSERT INTO info (ip, source, hour, count, ) VALUES (?, ?, ?, ?) ")
data.map{ case (ip, source, hour, count) => {
prep.setString(1, ip)
prep.setString(2, source)
prep.setInt(3, hour)
prep.setInt(4, count)
prep.addBatch()
}}
prep.executeBatch()
conn.commit()
}
catch{
case e:Exception =>e.printStackTrace
} finally {
conn.close
}
解決方案:
data.map{ case (ip, source, hour, count, count_all,...
替換為:
data.collect().foreach{ case (ip, source, hour, count, ...
原因:
prep是一個PrepareStatement對象,這個對象無法序列化,而傳入map中的對象是需要分布式傳送到各個節點上,傳送前先序列化,到達相應機器上后再反序列化,PrepareStatement是個Java類,如果一個java類想(反)序列化,必須實現Serialize接口,PrepareStatement并沒有實現這個接口,對象prep在driver端,collect后的數據也在driver端,就不需prep序列化傳到各個節點了。
但這樣其實會有collect的性能問題,解決方案:
使用mappartition在每一個分區內維持一個mysql連接進行插入
data.foreachPartition{it =>
val conn = DriverManager.getConnection(url, user, pwd)
try{
conn.setAutoCommit(false)
val prep = conn.prepareStatement("INSERT INTO reg_ip_info (ip, source, hour, count) VALUES ( ?,?, ?, ?) ")
it.foreach{ case (ip, source, hour, count) => {
prep.setString(1, ip)
prep.setString(2, source)
prep.setInt(3, hour)
prep.setInt(4, count)
// prep.setTimestamp(4, new Timestamp(System.currentTimeMillis()))
prep.addBatch()
}
}
prep.executeBatch()
conn.commit()
}
catch{
case e:Exception =>e.printStackTrace
} finally {
conn.close
}
}
參考鏈接:https://stackoverflow.com/questions/37462803/prepare-batch-statement-to-store-all-the-rdd-to-mysql-generated-from-spark-strea
ps:
mappartition函數和map函數類似,只不過映射函數的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器。如果在映射的過程中需要頻繁創建額外的對象,使用mapPartitions要比map高效的過。比如,將RDD中的所有數據通過JDBC連接寫入數據庫,如果使用map函數,可能要為每一個元素都創建一個connection,這樣開銷很大,如果使用mapPartitions,那么只需要針對每一個分區建立一個connection。