Spark Streaming源碼解讀之數據清理內幕徹底解密

SparkStreaming應用是持續不斷地運行著的。如果不對內存資源進行有效管理,內存就有可能很快就耗盡。

SparkStreaming應用一定有自己的對象、數據、元數據的清理機制。

如果把SparkStreaming研究透徹了,那也就能駕馭Spark的各種應用程序。


SparkStreaming應用中的對象、數據、元數據,是我們操作DStream時產生的。

DStream:

private[streaming]?vargeneratedRDDs=?new?HashMap[Time,?RDD[T]]?()

DStream根據時間生成的RDD是放入了這個generatedRDDs中。

DStream的持久化:

/**?Persist?RDDs?of?this?DStream?with?the?default?storage?level?(MEMORY_ONLY_SER)?*/

def?persist():?DStream[T]?=?persist(StorageLevel.MEMORY_ONLY_SER)

/**?Persist?RDDs?of?this?DStream?with?the?default?storage?level?(MEMORY_ONLY_SER)?*/

def?cache():?DStream[T]?=?persist()

對DStream的cache就是對RDD的cache。

RDD產生、釋放也應跟時鐘有關的。JobGenerator:

private?val?timer?=?new?RecurringTimer(clock,?ssc.graph.batchDuration.milliseconds,

longTime?=>?eventLoop.post(GenerateJobs(new?Time(longTime))),?"JobGenerator")

這個可以不斷的發出事件。


JobScheduler的JobHandler會在需要時發出JobCompleted的消息。

JobScheduler.JobHandler.run:

...

if?(_eventLoop?!=?null)?{

_eventLoop.post(JobStarted(job,?clock.getTimeMillis()))

//?Disable?checks?for?existing?output?directories?in?jobs?launched?by?the?streaming

//?scheduler,?since?we?may?need?to?write?output?to?an?existing?directory?during?checkpoint

//?recovery;?see?SPARK-4835?for?more?details.

PairRDDFunctions.disableOutputSpecValidation.withValue(true)?{

job.run()

}

_eventLoop?=?eventLoop

if?(_eventLoop?!=?null)?{

_eventLoop.post(JobCompleted(job,?clock.getTimeMillis()))

}

}?else?{

//?JobScheduler?has?been?stopped.

}

...

JobScheduler.processEvent:

private?def?processEvent(event:?JobSchedulerEvent)?{

try?{

event?match?{

case?JobStarted(job,?startTime)?=>?handleJobStart(job,?startTime)

caseJobCompleted(job,?completedTime)?=>handleJobCompletion(job,?completedTime)

case?ErrorReported(m,?e)?=>?handleError(m,?e)

}

}?catch?{

case?e:?Throwable?=>

reportError("Error?in?job?scheduler",?e)

}

}

JobCompleted事件的處理,是調用了handleJobCompletion。

JobScheduler.handleJobCompletion:

private?def?handleJobCompletion(job:?Job,?completedTime:?Long)?{

val?jobSet?=?jobSets.get(job.time)

jobSet.handleJobCompletion(job)

job.setEndTime(completedTime)

listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))

logInfo("Finished?job?"?+?job.id?+?"?from?job?set?of?time?"?+?jobSet.time)

if?(jobSet.hasCompleted)?{

jobSets.remove(jobSet.time)

jobGenerator.onBatchCompletion(jobSet.time)

logInfo("Total?delay:?%.3f?s?for?time?%s?(execution:?%.3f?s)".format(

jobSet.totalDelay?/?1000.0,?jobSet.time.toString,

jobSet.processingDelay?/?1000.0

))

listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))

}

job.result?match?{

case?Failure(e)?=>

reportError("Error?running?job?"?+?job,?e)

case?_?=>

}

}

JobSet做了清理,還調用了jobGenerator.onBatchCompletion。

JobGenerator.onBatchCompletion:

/**

*?Callback?called?when?a?batch?has?been?completely?processed.

*/

def?onBatchCompletion(time:?Time)?{

eventLoop.post(ClearMetadata(time))

}

ClearMetadata消息和前面的GenerateJobs消息一樣,也是在JobGenerator.processEvent里做處理。

JobGenerator.processEvent:

/**?Processes?all?events?*/

private?def?processEvent(event:?JobGeneratorEvent)?{

logDebug("Got?event?"?+?event)

event?match?{

caseGenerateJobs(time)?=>?generateJobs(time)

caseClearMetadata(time)?=>clearMetadata(time)

case?DoCheckpoint(time,?clearCheckpointDataLater)?=>

doCheckpoint(time,?clearCheckpointDataLater)

case?ClearCheckpointData(time)?=>?clearCheckpointData(time)

}

}

其中也有清理元數據事件(ClearMetadata)對應的處理。

JobGenerator.clearMetadata:

/**?Clear?DStream?metadata?for?the?given?`time`.?*/

private?defclearMetadata(time:?Time)?{

ssc.graph.clearMetadata(time)

//?If?checkpointing?is?enabled,?then?checkpoint,

//?else?mark?batch?to?be?fully?processed

if?(shouldCheckpoint)?{

eventLoop.post(DoCheckpoint(time,?clearCheckpointDataLater?=?true))

}?else?{

//?If?checkpointing?is?not?enabled,?then?delete?metadata?information?about

//?received?blocks?(block?data?not?saved?in?any?case).?Otherwise,?wait?for

//?checkpointing?of?this?batch?to?complete.

val?maxRememberDuration?=?graph.getMaxInputStreamRememberDuration()

jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time?-?maxRememberDuration)

jobScheduler.inputInfoTracker.cleanup(time?-?maxRememberDuration)

markBatchFullyProcessed(time)

}

}

可以看到有多項清理工作。

DStreamGraph.clearMetadata:

defclearMetadata(time:?Time)?{

logDebug("Clearing?metadata?for?time?"?+?time)

this.synchronized?{

outputStreams.foreach(_.clearMetadata(time))

}

logDebug("Cleared?old?metadata?for?time?"?+?time)

}

其中清理了ForeachDStream。

DStream.clearMetadata:

/**

*?Clear?metadata?that?are?older?than?`rememberDuration`?of?this?DStream.

*?This?is?an?internal?method?that?should?not?be?called?directly.?This?default

*?implementation?clears?the?old?generated?RDDs.?Subclasses?of?DStream?may?override

*?this?to?clear?their?own?metadata?along?with?the?generated?RDDs.

*/

private[streaming]?def?clearMetadata(time:?Time)?{

val?unpersistData?=?ssc.conf.getBoolean("spark.streaming.unpersist",?true)

val?oldRDDs?=?generatedRDDs.filter(_._1?<=?(time?-rememberDuration))

logDebug("Clearing?references?to?old?RDDs:?["?+

oldRDDs.map(x?=>?s"${x._1}?->?${x._2.id}").mkString(",?")?+?"]")

generatedRDDs?--=?oldRDDs.keys

if?(unpersistData)?{

logDebug("Unpersisting?old?RDDs:?"?+?oldRDDs.values.map(_.id).mkString(",?"))

oldRDDs.values.foreach?{?rdd?=>

rdd.unpersist(false)

//?Explicitly?remove?blocks?of?BlockRDD

rdd?match?{

case?b:?BlockRDD[_]?=>

logInfo("Removing?blocks?of?RDD?"?+?b?+?"?of?time?"?+?time)

b.removeBlocks()

case?_?=>

}

}

}

logDebug("Cleared?"?+?oldRDDs.size?+?"?RDDs?that?were?older?than?"?+

(time?-?rememberDuration)?+?":?"?+?oldRDDs.keys.mkString(",?"))

dependencies.foreach(_.clearMetadata(time))

}

spark.streaming.unpersist的配置可以用來設置是否手動清理。

想跨batch duration的話,可以設置rememberDuration。

其中把RDD清理掉了。依賴也清理掉了。

BlockRDD.removeBlocks:

/**

*?Remove?the?data?blocks?that?this?BlockRDD?is?made?from.?NOTE:?This?is?an

*?irreversible?operation,?as?the?data?in?the?blocks?cannot?be?recovered?back

*?once?removed.?Use?it?with?caution.

*/

private[spark]?def?removeBlocks()?{

blockIds.foreach?{?blockId?=>

sparkContext.env.blockManager.master.removeBlock(blockId)

}

_isValid?=?false

}

備注:

資料來源于:DT_大數據夢工廠(Spark發行版本定制)

更多私密內容,請關注微信公眾號:DT_Spark

如果您對大數據Spark感興趣,可以免費聽由王家林老師每天晚上20:00開設的Spark永久免費公開課,地址YY房間號:68917580

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容