目前Spark的最新版本是2.3.0,更新了Spark streaming對接Kafka的API,但是最新的API仍屬于實驗階段,正式版本可能會有變化,本文主要介紹2.3.0的API如何使用。
This version of the integration is marked as experimental, so the API is potentially subject to change.
pom.xml配置
加入如下依賴
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
</dependencies>
代碼
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, TaskContext}
object SparkStreamingNewAPIExample {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SparkStreamingNewAPIExample")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val kafkaParams = scala.collection.Map[String, Object](
"bootstrap.servers" -> "hostA:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "testGroup",
"auto.offset.reset" -> "latest",
"partition.assignment.strategy" -> "org.apache.kafka.clients.consumer.RangeAssignor",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("topic1","topic2")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { item =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"The record from topic [${o.topic}] is in partition ${o.partition} which offset from ${o.fromOffset} to ${o.untilOffset}")
println(s"The record content is ${item.toList.mkString}")
}
rdd.count()
}
ssc.start()
ssc.awaitTermination()
}
}
分析
上面的代碼的作用是spark streaming每10秒消費一次topic 1和topic2,然后將RDD的相關信息打印在標準輸出中。
其中可以看到KafkaUtils.createDirectStream
與spark 1.6.x版本不論是方法參數還是返回值都有了很大的不同,尤其是返回值,返回的RDD的類型不再是鍵值對,而是內容更加豐富的ConsumerRecord[K, V]
類型。
例如得到如下的日志打印,可以很詳細的知道當前spark處理的數據是來自kafka的哪個topic,partition和offset。
The record is in partition 0 which offset from 23 to 25
The record content is ConsumerRecord(topic = topic1, partition = 0, offset = 23, CreateTime = 1487209064531, checksum = 2357653885, serialized key size = -1, serialized value size = 6, key = null, value = aaaaaa)ConsumerRecord(topic = topic1, partition = 0, offset = 24, CreateTime = 1487209065989, checksum = 2696444472, serialized key size = -1, serialized value size = 8, key = null, value = bbbbbbbb)
參數說明
對于代碼中的enable.auto.commit
參數值是true
,含義是當數據被消費完之后會,如果spark streaming的程序由于某種原因停止之后再啟動,下次不會重復消費之前消費過的數據。這樣就會產生一個問題,從業務的角度,有可能消費之后的數據還沒有經過業務處理,并不是真正意義上的“消費完成”。所以如果為false
那么什么情況算消費完,由業務決定。這樣就需要手動提交,只需在rdd.count()
之前加入這段代碼stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
即可。