Flume OG
?OG:“Original Generation”
??0.9.x或cdh3以及更早版本
?由agent、collector、master等組件構(gòu)成
Flume NG
?NG:“Next/New Generation”
?1.x或cdh4以及之后的版本
?由Agent、Client等組件構(gòu)成
為什么要推出NG版本
?精簡(jiǎn)代碼
?架構(gòu)簡(jiǎn)化
Flume OG基本架構(gòu)
Agent
?用于采集數(shù)據(jù)
?數(shù)據(jù)流產(chǎn)生的地方
?通常由source和sink兩部分組成
?Source用于獲取數(shù)據(jù),可從文本文件,syslog,HTTP等獲取數(shù)據(jù);
?Sink將Source獲得的數(shù)據(jù)進(jìn)一步傳輸給后面的Collector。
?Flume自帶了很多source和sink實(shí)現(xiàn)
?syslogTcp(5440) | agentSink("localhost",35856)
?tail("/etc/service_files") | agentSink("localhost",35856)
Collector
匯總多個(gè)Agent結(jié)果
?將匯總結(jié)果導(dǎo)入后端存儲(chǔ)系統(tǒng),比如HDFS,HBase
?Flume自帶了很多collector實(shí)現(xiàn)
?collectorSource(35856) | console
?CollectorSource(35856) | collectorSink("file:///tmp/flume/collected", "syslog");
?collectorSource(35856) | collectorSink("hdfs://namenode/user/flume/ ","syslog");
Agent與Collector對(duì)應(yīng)關(guān)系
Agent與Collector對(duì)應(yīng)關(guān)系
?可手動(dòng)指定,也可自動(dòng)匹配
?自動(dòng)匹配的情況下,master會(huì)平衡collector之間的負(fù)載。
問題:為什么引入Collector?
?對(duì)Agent數(shù)據(jù)進(jìn)行匯總,避免產(chǎn)生過多小文件;
?避免多個(gè)agent連接對(duì)Hadoop造成過大壓力 ;
?中間件,屏蔽agent和hadoop間的異構(gòu)性。
Master
?管理協(xié)調(diào) agent 和collector的配置信息;
?Flume集群的控制器;
?跟蹤數(shù)據(jù)流的最后確認(rèn)信息,并通知agent;
?通常需配置多個(gè)master以防止單點(diǎn)故障;
?借助zookeeper管理管理多Master。
容錯(cuò)機(jī)制
三種可靠性級(jí)別
?agentE2ESink[("machine"[,port])]
?gent收到確認(rèn)消息才認(rèn)為數(shù)據(jù)發(fā)送成功,否則重試.
?agentDFOSink[("machine"[,port])]
當(dāng)agent發(fā)現(xiàn)在collector操作失敗的時(shí)候,agent寫入到本地硬盤上,當(dāng)collctor恢復(fù)后,再重新發(fā)送數(shù)據(jù)。
?agentBESink[("machine"[,port])]
效率最好,agent不寫入到本地任何數(shù)據(jù),如果在collector 發(fā)現(xiàn)處理失敗,直接刪除消息。
構(gòu)建基于Flume的數(shù)據(jù)收集系統(tǒng)
1.?Agent和Collector均可以動(dòng)態(tài)配置
2.?可通過命令行或Web界面配置
3.?命令行配置
?在已經(jīng)啟動(dòng)的master節(jié)點(diǎn)上,依次輸入”flume shell”è”connect localhost ”
如執(zhí)行 exec config a1 ‘tailDir(“/data/logfile”)’ ‘a(chǎn)gentSink’
4.?Web界面
?選中節(jié)點(diǎn),填寫source、sink等信息
常用架構(gòu)舉例—拓?fù)?
agentA : tail(“/ngnix/logs”) | agentSink("collector",35856);
agentB : tail(“/ngnix/logs”) | agentSink("collector",35856);
agentC : tail(“/ngnix/logs”) | agentSink("collector",35856);
agentD : tail(“/ngnix/logs”) | agentSink("collector",35856);
agentE : tail(“/ngnix/logs”) | agentSink("collector",35856);
agentF : tail(“/ngnix/logs”) | agentSink("collector",35856);
collector : collectorSource(35856) | collectorSink("hdfs://namenode/flume/","srcdata");
常用架構(gòu)舉例—拓?fù)?
agentA : src | agentE2ESink("collectorA",35856);
agentB : src | agentE2ESink("collectorA",35856);
agentC : src | agentE2ESink("collectorB",35856);
agentD : src | agentE2ESink("collectorB",35856);
agentE : src | agentE2ESink("collectorC",35856);
agentF : src | agentE2ESink("collectorC",35856);
collectorA : collectorSource(35856) | collectorSink("hdfs://...","src");
collectorB : collectorSource(35856) | collectorSink("hdfs://...","src");
collectorC : collectorSource(35856) | collectorSink("hdfs://...","src");
常用架構(gòu)舉例—拓?fù)?
agentA : src | agentE2EChain("collectorA:35856","collectorB:35856");
agentB : src | agentE2EChain("collectorA:35856","collectorC:35856");
agentC : src | agentE2EChain("collectorB:35856","collectorA:35853");
agentD : src | agentE2EChain("collectorB:35853","collectorC:35853");
agentE : src | agentE2EChain("collectorC:35853","collectorA:35853");
agentF : src | agentE2EChain("collectorC:35853","collectorB:35853");
collectorA : collectorSource(35853) | collectorSink("hdfs://...","src");
collectorB : collectorSource(35853) | collectorSink("hdfs://...","src");
collectorC : collectorSource(35853) | collectorSink("hdfs://...","src");