在前幾期文章里講了帶Receiver的Spark Streaming 應(yīng)用的相關(guān)源碼解讀,但是現(xiàn)在開發(fā)Spark Streaming的應(yīng)用越來越多的采用No Receivers(Direct Approach)的方式,No Receiver的方式的優(yōu)勢(shì): 1. 更強(qiáng)的控制自由度 2. 語義一致性
其實(shí)No Receivers的方式更符合我們讀取數(shù)據(jù),操作數(shù)據(jù)的思路的。因?yàn)镾park 本身是一個(gè)計(jì)算框架,他底層會(huì)有數(shù)據(jù)來源,如果沒有Receivers,我們直接操作數(shù)據(jù)來源,這其實(shí)是一種更自然的方式。 如果要操作數(shù)據(jù)來源,肯定要有一個(gè)封裝器,這個(gè)封裝器一定是RDD類型。 以直接訪問Kafka中的數(shù)據(jù)為例,看一下源碼中直接讀寫Kafka中數(shù)據(jù)的例子代碼:
object DirectKafkaWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(s"""
|Usage: DirectKafkaWordCount <brokers> <topics>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
|
""".stripMargin)
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(brokers, topics) = args
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// Get the lines, split them into words, count the words and print
val lines = messages.map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
Spark streaming 會(huì)將數(shù)據(jù)源封裝成一個(gè)RDD,也就是KafkaRDD:
/**
* A batch-oriented interface for consuming from Kafka.
* Starting and ending offsets are specified in advance,
* so that you can control exactly-once semantics.
* @param kafkaParams Kafka <a >
* configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set
* with Kafka broker(s) specified in host1:port1,host2:port2 form.
* @param offsetRanges offset ranges that define the Kafka data belonging to this RDD
* @param messageHandler function for translating each message into the desired type
*/
private[kafka]
class KafkaRDD[
K: ClassTag,
V: ClassTag,
U <: Decoder[_]: ClassTag,
T <: Decoder[_]: ClassTag,
R: ClassTag] private[spark] (
sc: SparkContext,
kafkaParams: Map[String, String],
val offsetRanges: Array[OffsetRange],//該RDD的數(shù)據(jù)偏移量
leaders: Map[TopicAndPartition, (String, Int)],
messageHandler: MessageAndMetadata[K, V] => R
) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges
可以看到KafkaRDD 混入了HasOffsetRanges,它是一個(gè)trait:
trait HasOffsetRanges {
def offsetRanges: Array[OffsetRange]
}
其中OffsetRange,標(biāo)識(shí)了RDD的數(shù)據(jù)的主題、分區(qū)、開始偏移量和結(jié)束偏移量:
inal class OffsetRange private(
val topic: String,
val partition: Int,
val fromOffset: Long,
val untilOffset: Long) extends Serializable
回到KafkaRDD,看一下KafkaRDD的getPartitions方法:
override def getPartitions: Array[Partition] = {
offsetRanges.zipWithIndex.map { case (o, i) =>
val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))
new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)
}.toArray
}
返回KafkaRDDPartition:
private[kafka]
class KafkaRDDPartition(
val index: Int,
val topic: String,
val partition: Int,
val fromOffset: Long,
val untilOffset: Long,
val host: String,
val port: Int
) extends Partition {
/** Number of messages this partition refers to */
def count(): Long = untilOffset - fromOffset
}
KafkaRDDPartition清晰的描述了數(shù)據(jù)的具體位置,每個(gè)KafkaRDDPartition分區(qū)的數(shù)據(jù)交給KafkaRDD的compute方法計(jì)算:
override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
val part = thePart.asInstanceOf[KafkaRDDPartition]
assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
if (part.fromOffset == part.untilOffset) {
log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
s"skipping ${part.topic} ${part.partition}")
Iterator.empty
} else {
new KafkaRDDIterator(part, context)
}
}
KafkaRDD的compute方法返回了KafkaIterator對(duì)象:
private class KafkaRDDIterator(
part: KafkaRDDPartition,
context: TaskContext) extends NextIterator[R] {
context.addTaskCompletionListener{ context => closeIfNeeded() }
log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +
s"offsets ${part.fromOffset} -> ${part.untilOffset}")
val kc = new KafkaCluster(kafkaParams)
val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[K]]
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[V]]
val consumer = connectLeader
var requestOffset = part.fromOffset
var iter: Iterator[MessageAndOffset] = null
//..................
}
KafkaIterator中創(chuàng)建了一個(gè)KakfkaCluster對(duì)象用于與Kafka集群進(jìn)行交互,獲取數(shù)據(jù)。
回到開頭的例子,我們使用 KafkaUtils.createDirectStream 創(chuàng)建了InputDStream:
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
看一下createDirectStream源碼:
def createDirectStream[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K]: ClassTag,
VD <: Decoder[V]: ClassTag] (
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Set[String]
): InputDStream[(K, V)] = {
val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
//創(chuàng)建KakfaCluster對(duì)象
val kc = new KafkaCluster(kafkaParams)
//更具kc的信息獲取數(shù)據(jù)偏移量
val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
ssc, kafkaParams, fromOffsets, messageHandler)
}
首先通過KafkaCluster從Kafka集群獲取信息,創(chuàng)建DirectKafkaInputDStream對(duì)象返回
DirectKafkaInputDStream的compute方法源碼:
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
//計(jì)算最近的數(shù)據(jù)終止偏移量
val untilOffsets = clamp(latestLeaderOffsets(maxRetries))
//利用數(shù)據(jù)的偏移量創(chuàng)建KafkaRDD
val rdd = KafkaRDD[K, V, U, T, R](
context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)
// Report the record number and metadata of this batch interval to InputInfoTracker.
val offsetRanges = currentOffsets.map { case (tp, fo) =>
val uo = untilOffsets(tp)
OffsetRange(tp.topic, tp.partition, fo, uo.offset)
}
val description = offsetRanges.filter { offsetRange =>
// Don't display empty ranges.
offsetRange.fromOffset != offsetRange.untilOffset
}.map { offsetRange =>
s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
}.mkString("\n")
// Copy offsetRanges to immutable.List to prevent from being modified by the user
val metadata = Map(
"offsets" -> offsetRanges.toList,
StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
val inputInfo = StreamInputInfo(id, rdd.count, metadata)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)
Some(rdd)
}
可以看到DirectKafkaInputDStream的compute方法中,首先從Kafka集群獲取數(shù)據(jù)的偏移量,然后利用獲取偏移量創(chuàng)建RDD,這個(gè)Receiver的RDD創(chuàng)建方式不同。
總結(jié):
而且KafkaRDDPartition只能屬于一個(gè)topic,不能讓partition跨多個(gè)topic,直接消費(fèi)一個(gè)kafkatopic,topic不斷進(jìn)來、數(shù)據(jù)不斷偏移,Offset代表kafka數(shù)據(jù)偏移量指針。
數(shù)據(jù)不斷流進(jìn)kafka,batchDuration假如每十秒都會(huì)從配置的topic中消費(fèi)數(shù)據(jù),每次會(huì)消費(fèi)一部分直到消費(fèi)完,下一個(gè)batchDuration會(huì)再流進(jìn)來的數(shù)據(jù),又可以從頭開始讀或上一個(gè)數(shù)據(jù)的基礎(chǔ)上讀取數(shù)據(jù)。
思考直接抓取kafka數(shù)據(jù)和receiver讀取數(shù)據(jù):
好處一:
直接抓取fakfa數(shù)據(jù)的好處,沒有緩存,不會(huì)出現(xiàn)內(nèi)存溢出等之類的問題。但是如果kafka Receiver的方式讀取會(huì)存在緩存的問題,需要設(shè)置讀取的頻率和block interval等信息。
好處二:
采用receiver方式的話receiver默認(rèn)情況需要和worker的executor綁定,不方便做分布式,當(dāng)然可以配置成分布式,采用direct方式默認(rèn)情況下數(shù)據(jù)會(huì)存在多個(gè)worker上的executor。Kafkardd數(shù)據(jù)默認(rèn)都是分布在多個(gè)executor上的,天然數(shù)據(jù)是分布式的存在多個(gè)executor,而receiver就不方便計(jì)算。
好處三:
數(shù)據(jù)消費(fèi)的問題,在實(shí)際操作的時(shí)候采用receiver的方式有個(gè)弊端,消費(fèi)數(shù)據(jù)來不及處理即操作數(shù)據(jù)有deLay多才時(shí),Spark Streaming程序有可能奔潰。但如果是direct方式訪問kafka數(shù)據(jù)不會(huì)存在此類情況。因?yàn)閐iect方式直接讀取kafka數(shù)據(jù),如果delay就不進(jìn)行下一個(gè)batchDuration讀取。
好處四:
完全的語義一致性,不會(huì)重復(fù)消費(fèi)數(shù)據(jù),而且保證數(shù)據(jù)一定被消費(fèi),跟kafka進(jìn)行交互,只有數(shù)據(jù)真正執(zhí)行成功之后才會(huì)記錄下來。
生產(chǎn)環(huán)境下強(qiáng)烈建議采用direct方式讀取kafka數(shù)據(jù)。