SparkStreaming應用是持續不斷地運行著的。如果不對內存資源進行有效管理,內存就有可能很快就耗盡。
SparkStreaming應用一定有自己的對象、數據、元數據的清理機制。
如果把SparkStreaming研究透徹了,那也就能駕馭Spark的各種應用程序。
SparkStreaming應用中的對象、數據、元數據,是我們操作DStream時產生的。
DStream:
private[streaming]?vargeneratedRDDs=?new?HashMap[Time,?RDD[T]]?()
DStream根據時間生成的RDD是放入了這個generatedRDDs中。
DStream的持久化:
/**?Persist?RDDs?of?this?DStream?with?the?default?storage?level?(MEMORY_ONLY_SER)?*/
def?persist():?DStream[T]?=?persist(StorageLevel.MEMORY_ONLY_SER)
/**?Persist?RDDs?of?this?DStream?with?the?default?storage?level?(MEMORY_ONLY_SER)?*/
def?cache():?DStream[T]?=?persist()
對DStream的cache就是對RDD的cache。
RDD產生、釋放也應跟時鐘有關的。JobGenerator:
private?val?timer?=?new?RecurringTimer(clock,?ssc.graph.batchDuration.milliseconds,
longTime?=>?eventLoop.post(GenerateJobs(new?Time(longTime))),?"JobGenerator")
這個可以不斷的發出事件。
JobScheduler的JobHandler會在需要時發出JobCompleted的消息。
JobScheduler.JobHandler.run:
...
if?(_eventLoop?!=?null)?{
_eventLoop.post(JobStarted(job,?clock.getTimeMillis()))
//?Disable?checks?for?existing?output?directories?in?jobs?launched?by?the?streaming
//?scheduler,?since?we?may?need?to?write?output?to?an?existing?directory?during?checkpoint
//?recovery;?see?SPARK-4835?for?more?details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true)?{
job.run()
}
_eventLoop?=?eventLoop
if?(_eventLoop?!=?null)?{
_eventLoop.post(JobCompleted(job,?clock.getTimeMillis()))
}
}?else?{
//?JobScheduler?has?been?stopped.
}
...
JobScheduler.processEvent:
private?def?processEvent(event:?JobSchedulerEvent)?{
try?{
event?match?{
case?JobStarted(job,?startTime)?=>?handleJobStart(job,?startTime)
caseJobCompleted(job,?completedTime)?=>handleJobCompletion(job,?completedTime)
case?ErrorReported(m,?e)?=>?handleError(m,?e)
}
}?catch?{
case?e:?Throwable?=>
reportError("Error?in?job?scheduler",?e)
}
}
JobCompleted事件的處理,是調用了handleJobCompletion。
JobScheduler.handleJobCompletion:
private?def?handleJobCompletion(job:?Job,?completedTime:?Long)?{
val?jobSet?=?jobSets.get(job.time)
jobSet.handleJobCompletion(job)
job.setEndTime(completedTime)
listenerBus.post(StreamingListenerOutputOperationCompleted(job.toOutputOperationInfo))
logInfo("Finished?job?"?+?job.id?+?"?from?job?set?of?time?"?+?jobSet.time)
if?(jobSet.hasCompleted)?{
jobSets.remove(jobSet.time)
jobGenerator.onBatchCompletion(jobSet.time)
logInfo("Total?delay:?%.3f?s?for?time?%s?(execution:?%.3f?s)".format(
jobSet.totalDelay?/?1000.0,?jobSet.time.toString,
jobSet.processingDelay?/?1000.0
))
listenerBus.post(StreamingListenerBatchCompleted(jobSet.toBatchInfo))
}
job.result?match?{
case?Failure(e)?=>
reportError("Error?running?job?"?+?job,?e)
case?_?=>
}
}
JobSet做了清理,還調用了jobGenerator.onBatchCompletion。
JobGenerator.onBatchCompletion:
/**
*?Callback?called?when?a?batch?has?been?completely?processed.
*/
def?onBatchCompletion(time:?Time)?{
eventLoop.post(ClearMetadata(time))
}
ClearMetadata消息和前面的GenerateJobs消息一樣,也是在JobGenerator.processEvent里做處理。
JobGenerator.processEvent:
/**?Processes?all?events?*/
private?def?processEvent(event:?JobGeneratorEvent)?{
logDebug("Got?event?"?+?event)
event?match?{
caseGenerateJobs(time)?=>?generateJobs(time)
caseClearMetadata(time)?=>clearMetadata(time)
case?DoCheckpoint(time,?clearCheckpointDataLater)?=>
doCheckpoint(time,?clearCheckpointDataLater)
case?ClearCheckpointData(time)?=>?clearCheckpointData(time)
}
}
其中也有清理元數據事件(ClearMetadata)對應的處理。
JobGenerator.clearMetadata:
/**?Clear?DStream?metadata?for?the?given?`time`.?*/
private?defclearMetadata(time:?Time)?{
ssc.graph.clearMetadata(time)
//?If?checkpointing?is?enabled,?then?checkpoint,
//?else?mark?batch?to?be?fully?processed
if?(shouldCheckpoint)?{
eventLoop.post(DoCheckpoint(time,?clearCheckpointDataLater?=?true))
}?else?{
//?If?checkpointing?is?not?enabled,?then?delete?metadata?information?about
//?received?blocks?(block?data?not?saved?in?any?case).?Otherwise,?wait?for
//?checkpointing?of?this?batch?to?complete.
val?maxRememberDuration?=?graph.getMaxInputStreamRememberDuration()
jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time?-?maxRememberDuration)
jobScheduler.inputInfoTracker.cleanup(time?-?maxRememberDuration)
markBatchFullyProcessed(time)
}
}
可以看到有多項清理工作。
DStreamGraph.clearMetadata:
defclearMetadata(time:?Time)?{
logDebug("Clearing?metadata?for?time?"?+?time)
this.synchronized?{
outputStreams.foreach(_.clearMetadata(time))
}
logDebug("Cleared?old?metadata?for?time?"?+?time)
}
其中清理了ForeachDStream。
DStream.clearMetadata:
/**
*?Clear?metadata?that?are?older?than?`rememberDuration`?of?this?DStream.
*?This?is?an?internal?method?that?should?not?be?called?directly.?This?default
*?implementation?clears?the?old?generated?RDDs.?Subclasses?of?DStream?may?override
*?this?to?clear?their?own?metadata?along?with?the?generated?RDDs.
*/
private[streaming]?def?clearMetadata(time:?Time)?{
val?unpersistData?=?ssc.conf.getBoolean("spark.streaming.unpersist",?true)
val?oldRDDs?=?generatedRDDs.filter(_._1?<=?(time?-rememberDuration))
logDebug("Clearing?references?to?old?RDDs:?["?+
oldRDDs.map(x?=>?s"${x._1}?->?${x._2.id}").mkString(",?")?+?"]")
generatedRDDs?--=?oldRDDs.keys
if?(unpersistData)?{
logDebug("Unpersisting?old?RDDs:?"?+?oldRDDs.values.map(_.id).mkString(",?"))
oldRDDs.values.foreach?{?rdd?=>
rdd.unpersist(false)
//?Explicitly?remove?blocks?of?BlockRDD
rdd?match?{
case?b:?BlockRDD[_]?=>
logInfo("Removing?blocks?of?RDD?"?+?b?+?"?of?time?"?+?time)
b.removeBlocks()
case?_?=>
}
}
}
logDebug("Cleared?"?+?oldRDDs.size?+?"?RDDs?that?were?older?than?"?+
(time?-?rememberDuration)?+?":?"?+?oldRDDs.keys.mkString(",?"))
dependencies.foreach(_.clearMetadata(time))
}
spark.streaming.unpersist的配置可以用來設置是否手動清理。
想跨batch duration的話,可以設置rememberDuration。
其中把RDD清理掉了。依賴也清理掉了。
BlockRDD.removeBlocks:
/**
*?Remove?the?data?blocks?that?this?BlockRDD?is?made?from.?NOTE:?This?is?an
*?irreversible?operation,?as?the?data?in?the?blocks?cannot?be?recovered?back
*?once?removed.?Use?it?with?caution.
*/
private[spark]?def?removeBlocks()?{
blockIds.foreach?{?blockId?=>
sparkContext.env.blockManager.master.removeBlock(blockId)
}
_isValid?=?false
}
備注:
資料來源于:DT_大數據夢工廠(Spark發行版本定制)
更多私密內容,請關注微信公眾號:DT_Spark
如果您對大數據Spark感興趣,可以免費聽由王家林老師每天晚上20:00開設的Spark永久免費公開課,地址YY房間號:68917580