1 需求分析

WebServer/ApplicationServer分散在各個(gè)機(jī)器上，然而我們依舊想在Hadoop平臺上進(jìn)行統(tǒng)計(jì)分析，如何將日志收集到Hadoop平臺呢？

簡單的這樣嗎？

shell cp hadoop集群的機(jī)器上；
hadoop fs -put ... /

顯然該法面臨著容錯(cuò)、負(fù)載均衡、高延遲、數(shù)據(jù)壓縮等一系列問題
這顯然已經(jīng)無法滿足需求了！

不如問問神奇的Flume呢？？？

只需要配置文件，輕松解決以上問題！

2 Flume概述

2.1 官網(wǎng)

Flume是一種分布式，可靠且可用的服務(wù)，用于有效地收集，聚合和移動大量日志數(shù)據(jù)。
它具有基于流式數(shù)據(jù)流的簡單靈活的架構(gòu)。
它具有可靠的可靠性機(jī)制和許多故障轉(zhuǎn)移和恢復(fù)機(jī)制，具有強(qiáng)大的容錯(cuò)性。
它使用簡單的可擴(kuò)展數(shù)據(jù)模型，允許在線分析應(yīng)用程序。

2.2 設(shè)計(jì)目標(biāo)

可靠性
當(dāng)節(jié)點(diǎn)出現(xiàn)故障時(shí)，日志能夠被傳送到其他節(jié)點(diǎn)上而不會丟失。Flume提供了三種級別的可靠性保障，從強(qiáng)到弱依次分別為：end-to-end（收到數(shù)據(jù)agent首先將event寫到磁盤上，當(dāng)數(shù)據(jù)傳送成功后，再刪除；如果數(shù)據(jù)發(fā)送失敗，可以重新發(fā)送。），Store on failure（這也是scribe采用的策略，當(dāng)數(shù)據(jù)接收方crash時(shí)，將數(shù)據(jù)寫到本地，待恢復(fù)后，繼續(xù)發(fā)送），Best effort（數(shù)據(jù)發(fā)送到接收方后，不會進(jìn)行確認(rèn)）。

擴(kuò)展性
Flume采用了三層架構(gòu)，分別為agent，collector和storage，每一層均可以水平擴(kuò)展。
其中，所有agent和collector由master統(tǒng)一管理，這使得系統(tǒng)容易監(jiān)控和維護(hù)，且master允許有多個(gè)（使用ZooKeeper進(jìn)行管理和負(fù)載均衡），這就避免了單點(diǎn)故障問題。
管理性
所有agent和colletor由master統(tǒng)一管理，這使得系統(tǒng)便于維護(hù)。多master情況，F(xiàn)lume利用ZooKeeper和gossip，保證動態(tài)配置數(shù)據(jù)的一致性。用戶可以在master上查看各個(gè)數(shù)據(jù)源或者數(shù)據(jù)流執(zhí)行情況，且可以對各個(gè)數(shù)據(jù)源配置和動態(tài)加載。Flume提供了web 和shell script command兩種形式對數(shù)據(jù)流進(jìn)行管理。
功能可擴(kuò)展性
用戶可以根據(jù)需要添加自己的agent，collector或者storage。此外，F(xiàn)lume自帶了很多組件，包括各種agent（file， syslog等），collector和storage（file，HDFS等）。

2.3 主流競品對比

其他的還有比如：

Logstash: ELK(ElasticsSearch, Logstash, Kibana)
Chukwa: Yahoo/Apache, 使用Java語言開發(fā), 負(fù)載均衡不是很好, 已經(jīng)不維護(hù)了。
Fluentd: 和Flume類似, Ruby開發(fā)。

2.4 發(fā)展史

Cloudera公司提出0.9.2，叫Flume-OG
2011年Flume-728編號，重要里程碑(Flume-NG)，貢獻(xiàn)給Apache社區(qū)
2012年7月 1.0版本
2015年5月 1.6版本
~ 1.9版本

3 核心架構(gòu)及其組件

3.1 core架構(gòu)

在這里插入圖片描述

3.2 核心的組件

順便來看看官方文檔

3.2.1 Source - 收集

指定數(shù)據(jù)源（Avro, Thrift, Spooling, Kafka, Exec）

3.2.2 Channel - 聚集

把數(shù)據(jù)暫存（Memory, File, Kafka等用的比較多）

3.2.3 Sink - 輸出

把數(shù)據(jù)寫至某處（HDFS, Hive, Logger, Avro, Thrift, File, ES, HBase, Kafka等）

multi-agent flow

為了跨多個(gè)代理或跳數(shù)據(jù)流，先前代理的接收器和當(dāng)前跳的源需要是avro類型，接收器指向源的主機(jī)名（或IP地址）和端口。

Consolidation合并

日志收集中非常常見的情況是大量日志生成客戶端將數(shù)據(jù)發(fā)送到連接到存儲子系統(tǒng)的少數(shù)消費(fèi)者代理。例如，從數(shù)百個(gè)Web服務(wù)器收集的日志發(fā)送給寫入HDFS集群的十幾個(gè)代理。

這可以通過使用avro接收器配置多個(gè)第一層代理在Flume中實(shí)現(xiàn)，所有這些代理都指向單個(gè)代理的avro源（同樣，您可以在這種情況下使用thrift源/接收器/客戶端）。第二層代理上的此源將接收的事件合并到單個(gè)信道中，該信道由信宿器消耗到其最終目的地。

Multiplexing the flow

Flume支持將事件流多路復(fù)用到一個(gè)或多個(gè)目的地。這是通過定義可以復(fù)制或選擇性地將事件路由到一個(gè)或多個(gè)信道的流復(fù)用器來實(shí)現(xiàn)的。

上面的例子顯示了來自代理“foo”的源代碼將流程擴(kuò)展到三個(gè)不同的通道。扇出可以復(fù)制或多路復(fù)用。在復(fù)制流的情況下，每個(gè)事件被發(fā)送到所有三個(gè)通道。對于多路復(fù)用情況，當(dāng)事件的屬性與預(yù)配置的值匹配時(shí)，事件將被傳遞到可用通道的子集。例如，如果一個(gè)名為“txnType”的事件屬性設(shè)置為“customer”，那么它應(yīng)該轉(zhuǎn)到channel1和channel3，如果它是“vendor”，那么它應(yīng)該轉(zhuǎn)到channel2，否則轉(zhuǎn)到channel3。可以在代理的配置文件中設(shè)置映射。

4 環(huán)境配置與部署

4.1 系統(tǒng)需求

系統(tǒng)
macOS 10.14.14
Java運(yùn)行時(shí)環(huán)境
Java 1.8或更高版本
內(nèi)存源
通道或接收器使用的配置的足夠內(nèi)存
磁盤空間
通道或接收器使用的配置的足夠磁盤空間
目錄權(quán)限
代理使用的目錄的讀/寫權(quán)限

4.2 下載與安裝

4.3 配置

查看安裝路徑
系統(tǒng)配置文件

export FLUME_VERSION=1.9.0
export FLUME_HOME=/usr/local/Cellar/flume/1.9.0/libexec
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$FLUME_HOME/bin:$PATH

flume配置文件
配置JAVA_HOME
驗(yàn)證
bin下的命令執(zhí)行文件

安裝成功

5 實(shí)戰(zhàn)

使用Flume的核心就在于配置文件

配置Source
配置Channel
配置Sink
組織在一起

5.1 場景1 - 從指定網(wǎng)絡(luò)端口收集數(shù)據(jù)輸出到控制臺

看看官網(wǎng)的第一個(gè)案例

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1:agent名稱
r1：Source名稱
k1：Sink名稱
c1：Channel名稱

看看其中的

Sources ： netcat

類似于netcat的源，它偵聽給定端口并將每行文本轉(zhuǎn)換為事件。像nc -k -l [host] [port]這樣的行為。換句話說，它打開一個(gè)指定的端口并偵聽數(shù)據(jù)。期望是提供的數(shù)據(jù)是換行符分隔的文本。每行文本都轉(zhuǎn)換為Flume事件，并通過連接的通道發(fā)送。

必需屬性以粗體顯示。

Sinks：logger

在INFO級別記錄事件。通常用于測試/調(diào)試目的。必需屬性以粗體顯示。此接收器是唯一的例外，它不需要在“記錄原始數(shù)據(jù)”部分中說明的額外配置。

channel：memor

事件存儲在具有可配置最大大小的內(nèi)存中隊(duì)列中。它非常適用于需要更高吞吐量的流量，并且在代理發(fā)生故障時(shí)準(zhǔn)備丟失分階段數(shù)據(jù)。必需屬性以粗體顯示。

實(shí)戰(zhàn)

新建example.conf配置

在conf目錄下

啟動一個(gè)agent

使用名為flume-ng的shell腳本啟動代理程序，該腳本位于Flume發(fā)行版的bin目錄中。您需要在命令行上指定代理名稱，config目錄和配置文件：

bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

回顧命令參數(shù)的意義

bin/flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/example.conf \
-Dflume.root.logger=INFO,console

現(xiàn)在，代理將開始運(yùn)行在給定屬性文件中配置的源和接收器。

使用telnet進(jìn)行測試驗(yàn)證

注意

telnet 127.0.0.1 44444

發(fā)送了兩條數(shù)據(jù)
這邊接收到了數(shù)據(jù)

讓我們詳細(xì)分析下上圖中的數(shù)據(jù)信息

2019-06-12 17:52:39,711 (SinkRunner-PollingRunner-DefaultSinkProcessor)
[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
Event: { headers:{} body: 4A 61 76 61 45 64 67 65 0D                      JavaEdge. }

其中的Event是Fluem數(shù)據(jù)傳輸?shù)幕締卧?br> Event = 可選的header + byte array

5.2 場景2 - 監(jiān)控一個(gè)文件實(shí)時(shí)采集新增的數(shù)據(jù)輸出到控制臺

Exec Source

Exec源在啟動時(shí)運(yùn)行給定的Unix命令，并期望該進(jìn)程在標(biāo)準(zhǔn)輸出上連續(xù)生成數(shù)據(jù)（stderr被簡單地丟棄，除非屬性logStdErr設(shè)置為true）。如果進(jìn)程因任何原因退出，則源也會退出并且不會生成其他數(shù)據(jù)。這意味著諸如cat [named pipe]或tail -F [file]之類的配置將產(chǎn)生所需的結(jié)果，而日期可能不會 - 前兩個(gè)命令產(chǎn)生數(shù)據(jù)流，而后者產(chǎn)生單個(gè)事件并退出

Agent 選型

exec source + memory channel + logger sink

配置文件

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /Volumes/doc/data/data.log
a1.sources.r1.shell = /bin/sh -c

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

在conf下新建配置文件如下：

data.log文件內(nèi)容
成功接收

在這里插入圖片描述

5.3 應(yīng)用場景3 - 將A服務(wù)器上的日志實(shí)時(shí)采集到B服務(wù)器

技術(shù)選型

exec s + memory c + avro s
avro s + memory c + loger s

配置文件

exec-memory-avro.conf

# Name the components on this agent
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel

# Describe/configure the source
exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /Volumes/doc/data/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c

# Describe the sink
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.hostname = localhost
exec-memory-avro.sinks.avro-sink.port = 44444

# Use a channel which buffers events in memory
exec-memory-avro.channels.memory-channel.type = memory
exec-memory-avro.channels.memory-channel.capacity = 1000
exec-memory-avro.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel

# Name the components on this agent
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel

# Describe/configure the source
exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /Volumes/doc/data/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c

# Describe the sink
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.hostname = localhost
exec-memory-avro.sinks.avro-sink.port = 44444

# Use a channel which buffers events in memory
exec-memory-avro.channels.memory-channel.type = memory
exec-memory-avro.channels.memory-channel.capacity = 1000
exec-memory-avro.channels.memory-channel.transactionCapacity = 100

# Bind the source and sink to the channel
exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel

參考

https://tech.meituan.com/2013/12/09/meituan-flume-log-system-architecture-and-design.html

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

分布式日志收集框架 Flume

1 需求分析

2 Flume概述

2.1 官網(wǎng)

2.2 設(shè)計(jì)目標(biāo)

2.3 主流競品對比

2.4 發(fā)展史

3 核心架構(gòu)及其組件

3.1 core架構(gòu)

3.2 核心的組件

3.2.1 Source - 收集

3.2.2 Channel - 聚集

3.2.3 Sink - 輸出

multi-agent flow

Consolidation合并

Multiplexing the flow

4 環(huán)境配置與部署

4.1 系統(tǒng)需求

4.2 下載與安裝

4.3 配置

5 實(shí)戰(zhàn)

使用Flume的核心就在于配置文件

5.1 場景1 - 從指定網(wǎng)絡(luò)端口收集數(shù)據(jù)輸出到控制臺

Sources ： netcat

Sinks：logger

channel：memor

實(shí)戰(zhàn)

新建example.conf配置

啟動一個(gè)agent

使用telnet進(jìn)行測試驗(yàn)證

5.2 場景2 - 監(jiān)控一個(gè)文件實(shí)時(shí)采集新增的數(shù)據(jù)輸出到控制臺

Exec Source

Agent 選型

配置文件

5.3 應(yīng)用場景3 - 將A服務(wù)器上的日志實(shí)時(shí)采集到B服務(wù)器

技術(shù)選型

配置文件

參考

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频