1、Spark DataFrame寫入mysql
DataFrame寫入mysql就沒什么可重點注意的了,這里說的Spark包含SparkCore/SparkSQL/SparkStreaming,實際上都一樣操作,以下展示的都是實際項目中的代碼,把整個DataFrame一次寫入MySQL (DataFrame的Schema要和MySQL表里定義的域名一致)
Dataset<Row> resultDF = spark.sql("select hphm,clpp,clys,tgsj,kkbh from t_cltgxx where id in (" + id.split("_")[0] + "," + id.split("_")[1] + ")");
resultDF.show();
Dataset<Row> resultDF2 = resultDF.withColumn("jsbh", functions.lit(new Date().getTime()))
.withColumn("create_time", functions.lit(new Timestamp(new Date().getTime())));
resultDF2.show();
resultDF2.write()
.format("jdbc")
.option("url","jdbc:mysql://lin01.cniao5.com:3306/traffic?characterEncoding=UTF-8")
.option("dbtable","t_tpc_result")
.option("user","root")
.option("password","123456")
.mode(SaveMode.Append)
.save();
2、Spark RDD寫入mysql
在RDD中調用foreach/foreachPartition,再建connection->prepare SQL->execute-> free connection,這個方法的好處是數據可以按需求處理了再update到表里,不一定需要用到整個DataFrame,代碼如下:
import java.util.concurrent.atomic.AtomicInteger
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkStreamingForPartition {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("NetCatWordCount")
conf.setMaster("local[3]")
val ssc = new StreamingContext(conf, Seconds(5))
val dstream = ssc.socketTextStream("hadoopMaster", 9999).flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _)
dstream.foreachRDD(rdd => {
/**embedded function*/
def func(records: Iterator[(String,Int)]) {
/**Connect the mysql*/
var conn: Connection = null
var stmt: PreparedStatement = null
try {
val url = "jdbc:mysql://hadoopMaster:3306/streaming";
val user = "root";
val password = "hadoop"
conn = DriverManager.getConnection(url, user, password)
records.foreach(word => {
val sql = "insert into wordcounts values (?,?)";
stmt = conn.prepareStatement(sql);
stmt.setString(1, word._1)
stmt.setInt(2, word._2)
stmt.executeUpdate();
})
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (stmt != null) {
stmt.close()
}
if (conn != null) {
conn.close()
}
}
}
val repartitionedRDD = rdd.repartition(3)
repartitionedRDD.foreachPartition(func)
})
ssc.start()
ssc.awaitTermination()
}
}
需要注意的點:foreachPartition和mapPartitions的區別
說明:foreachPartition屬于action運算操作,而mapPartitions是在Transformation中,所以是轉化操作,此外在應用場景上區別是mapPartitions可以獲取返回值,繼續在返回RDD上做其他的操作,而foreachPartition因為沒有返回值并且是action操作,所以使用它一般都是在程序末尾比如說要落地數據到存儲系統中如mysql,es,或者hbase中,可以用它。
當然在Transformation中也可以落地數據,但是它必須依賴action操作來觸發它,因為Transformation操作是延遲執行的,如果沒有任何action方法來觸發,那么Transformation操作是不會被執行的,這一點需要注意。
一個foreachPartition例子:
val sparkConf=new SparkConf()
val sc=new SparkContext(sparkConf)
sparkConf.setAppName("spark demo example ")
val rdd=sc.parallelize(Seq(1,2,3,4,5),3)
rdd.foreachPartition(partiton=>{
// partiton.size 不能執行這個方法,否則下面的foreach方法里面會沒有數據,
//因為iterator只能被執行一次
partiton.foreach(line=>{
//save(line) 落地數據
})
})
一個mapPartitions例子:
val sparkConf=new SparkConf()
val sc=new SparkContext(sparkConf)
sparkConf.setAppName("spark demo example ")
val rdd=sc.parallelize(Seq(1,2,3,4,5),3)
rdd.mapPartitions(partiton=>{
//只能用map,不能用foreach,因為foreach沒有返回值
partiton.map(line=>{
//save line
}
)
})
rdd.count()//需要action,來觸發執行
sc.stop()