logstash解析nginx日志 output到s3

背景

最近開始收集客戶的瀏覽訪問記錄數(shù)據(jù),為以后的用戶行為及用戶畫像打基礎(chǔ)。數(shù)據(jù)的流轉(zhuǎn)分析如下圖所示:

流量日志收集

這篇博文講的是 ?從 nginx 到s3 的過程,只涉及上圖的一小部分,使用的是logstash,版本5.4.3

注意,之前讓運(yùn)維讓安裝,默認(rèn)安裝的1.4.5,往s3寫入的過程各種報(bào)錯(cuò),升級(jí)版本之后才成功。


Logstash

手機(jī)日志的插件,安裝在服務(wù)器上可以解析日志,支持各種匹配,能很方便的讓你從復(fù)雜的日志文件里面收集出你想要的內(nèi)容,安裝和使用請(qǐng)看官網(wǎng)教程


Nginx日志

日志的格式運(yùn)維和前段可配合做調(diào)整,如需要cookie運(yùn)維配合日志記錄cookie;需要其他參數(shù)的話可以讓前端在url后面帶上。我們開看一段nginx記錄的原始日志:

192.168.199.63 - - [07/Jul/2017:20:55:38 +0800] "GET /c.gif?ezspm=4.0.2.0.0&keyword=humidifiers&catalog=SG HTTP/1.1" 304 0 "http://m2.sg.65emall.net/search?keyword=humidifiers&ezspm=4.0.2.0.0&listing_origin=20000002" "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36" "140.206.112.194" 0.001 _ga=GA1.3.1284908540.1465133186; tma=228981083.10548698.1465133185510.1465133185510.1465133185510.1; bfd_g=bffd842b2b4843320000156c000106ca575405eb; G_ENABLED_IDPS=google; __zlcmid=azfbRwlP4TTFyi; ez_cookie_id=dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5; 65_customer=9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg; _ga=GA1.2.1284908540.1465133186; _gid=GA1.2.1865898003.1499149209

日志按空格分割,可以解讀為:

第一個(gè)是客戶端ip ;第二第三是 “-” 這里應(yīng)該存的是用戶的信息,如果沒有的話就以 “-”代替;第四位是“[]”里面的內(nèi)容,記錄日志的服務(wù)器時(shí)間;再后面雙引號(hào)里面的一長串是http請(qǐng)求的信息,格式是固定的 :“請(qǐng)求方式 ?請(qǐng)求url ?http版本”;往后是http請(qǐng)求返回的code,數(shù)字類型;再是請(qǐng)求返回的內(nèi)容大小;往后依次是當(dāng)前的url、瀏覽器信息、服務(wù)器IP地址、請(qǐng)求耗時(shí)。如果后面還有就是cookie的信息。

最后面的cookie信息也是有規(guī)律的,以我這邊的日志為例:


_ga=GA1.3.1284908540.1465133186; tma=228981083.10548698.1465133185510.1465133185510.1465133185510.1; bfd_g=bffd842b2b4843320000156c000106ca575405eb; G_ENABLED_IDPS=google; __zlcmid=azfbRwlP4TTFyi; ez_cookie_id=dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5; 65_customer=9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg; _ga=GA1.2.1284908540.1465133186; _gid=GA1.2.1865898003.1499149209

我需要取到ez_cookie_id 和 65_customer 兩個(gè)信息。

logstash配置

filter

我們來創(chuàng)建一個(gè)logstash grok pattern:

grok{match => { "message" => "%{IP:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} /%{NOTSPACE:request_page} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}"? ? ? }}

這個(gè)pattern我只格式化到 服務(wù)器的返回code,需要往后可以按照格式繼續(xù)加

然后還要把request_page里面的 url參數(shù)也解析出來,這里用kv插件解析:

kv{

source => "request_page"

field_split =>"&?"

value_split => "="

trim_value => ";"

include_keys => ["ezspm","keyword","info","catalog","referrer","categoryid","productid"]

}

url里面可能有很多參數(shù)值,可以用 include_keys 來選出你只想要的值

同理也可以用kv插件 取出 log里面的cookie

kv{

?source => "message"

?field_split =>" "

?value_split => "="

?trim_value => ";"

?include_keys => ["ez_cookie_id","65_customer"]

}

output

s3 {

?access_key_id => "access-id"

?secret_access_key => "access-key"

region => "ap-southeast-1"

prefix? ? => "nginxlog/%{+YYYY-MM-dd}"

bucket => "bi-spm-test"

time_file=> 60

codec => "json_lines"

}

這里注意 bucket 不用寫 “s3://”,直接寫最外層的 bucket文件夾名稱,prefix可以實(shí)現(xiàn)按日志切割文件夾存放log

按上述配置,我的輸出目錄最終為:

s3://bi-spm-test/nginxlog/2017-07-07/ls.s3.....txt

按天切割文件夾,每天的日志放到?jīng)]天的文件夾下面。

date配置

logstash默認(rèn)是utc 時(shí)間,比我們晚8個(gè)小時(shí),比如?

2017-07-07 06:00:00

產(chǎn)生的日志,存儲(chǔ)到logstash 時(shí)間會(huì)變?yōu)椋?017-07-06T22:00:00Z,此時(shí)不對(duì)日期處理,日志就會(huì)存儲(chǔ)到 2017-07-06這個(gè)文件夾下面,可我明明是7月7號(hào)產(chǎn)生的log。

同時(shí)往s3 copy的時(shí)候 ,2017-07-06T22:00:00Z 這個(gè)時(shí)間 也會(huì)直接變成:

2017-07-06 22:00:00,這就造成了數(shù)據(jù)不準(zhǔn)確的問題了。

在logstash社區(qū)有個(gè)小伙伴提出了同樣的問題:date時(shí)區(qū)問題

這里我們用刀date filter來配置日期:

date {

locale=>"en"

match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss +0800"]

target => "@timestamp"

timezone => "UTC"

}

把timezone 就設(shè)定為UTC,在它的基礎(chǔ)上再加8個(gè)小時(shí)。

然后 2017-07-07 06:00:00 ?這個(gè)時(shí)間 存儲(chǔ)下來就是

2017-07-07T06:00:00Z ?, ?這是一個(gè)IOS的date格式,理論上 拿來用的時(shí)候 要再加 8個(gè)小時(shí),但是這里我 只需要它放到正確的文件夾下面,以及copy到s3的時(shí)候日期正確,所以這里用投機(jī)取巧的方式滿足我的需求。


完善的logstash配置:

input {

file {

path => ["/etc/logstash/test.log"]

type => "system"

start_position => "beginning"

sincedb_path => "/dev/null"

}

}

filter{

grok{

match => { "message" => "%{IP:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} /%{NOTSPACE:request_page} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}"? ? ? }

}

kv{

source => "message"

field_split =>" "

value_split => "="

trim_value => ";"

include_keys => ["ez_cookie_id","65_customer"]

}

kv{

source => "request_page"

field_split =>"&?"

value_split => "="

trim_value => ";"

include_keys => ["ezspm","keyword","info","catalog","referrer","categoryid","productid"]

}

urldecode {

all_fields => true

}

mutate{

remove_field=>["message","request_page","host","path","method","type","server_response","ident","auth","@version"]? ? ? ? }

if [tags]{ drop {} }

if ![ezspm] { drop{} }

if ![65_customer] {mutate { add_field => {"65_customer" => ""} }}

if ![categoryid] {mutate { add_field => {"categoryid" => 0} }}

if ![productid] {mutate { add_field => {"productid" => 0} }}

if ![keyword] {mutate { add_field => {"keyword" => ""} }}

if ![referrer]{mutate { add_field => { "referrer" => ""} }}

if ![info]{mutate { add_field => { "info" => ""} }}

date {

locale=>"en"

match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss +0800"]

target => "@timestamp"

timezone => "UTC"

}

mutate{remove_field =>? ["timestamp"]}

}

output {

s3 {

access_key_id => "access_id"

secret_access_key => "access_key"

region => "ap-southeast-1"

prefix? ? => "nginxlog/%{+YYYY-MM-dd}"

bucket => "bi-spm-test"

time_file=> 60

codec => "json_lines"

}

}

最終收集到的日志:

{"ez_cookie_id":"dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5","productid":"20000000080550","65_customer":"9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg","catalog":"SG","http_version":"1.1","referrer":"","@timestamp":"2017-07-07T20:55:58.000Z","ezspm":"4.20000002.22.0.0","client_ip":"192.168.199.63","keyword":"","categoryid":"0","info":""}


Copy到S3

copy dw.ods_nginx_spm

from 's3://bi-spm-test/nginxlog/2017-07-07-20/ls.s3'

REGION 'ap-southeast-1'

access_key_id 'access-id'

secret_access_key 'access-key'

timeformat 'auto'

FORMAT AS JSON 's3://bi-spm-test/test1.txt';

來查看下結(jié)果:


日志結(jié)果
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容