背景
最近開始收集客戶的瀏覽訪問記錄數(shù)據(jù),為以后的用戶行為及用戶畫像打基礎(chǔ)。數(shù)據(jù)的流轉(zhuǎn)分析如下圖所示:
這篇博文講的是 ?從 nginx 到s3 的過程,只涉及上圖的一小部分,使用的是logstash,版本5.4.3
注意,之前讓運(yùn)維讓安裝,默認(rèn)安裝的1.4.5,往s3寫入的過程各種報(bào)錯(cuò),升級(jí)版本之后才成功。
Logstash
手機(jī)日志的插件,安裝在服務(wù)器上可以解析日志,支持各種匹配,能很方便的讓你從復(fù)雜的日志文件里面收集出你想要的內(nèi)容,安裝和使用請(qǐng)看官網(wǎng)教程
Nginx日志
日志的格式運(yùn)維和前段可配合做調(diào)整,如需要cookie運(yùn)維配合日志記錄cookie;需要其他參數(shù)的話可以讓前端在url后面帶上。我們開看一段nginx記錄的原始日志:
192.168.199.63 - - [07/Jul/2017:20:55:38 +0800] "GET /c.gif?ezspm=4.0.2.0.0&keyword=humidifiers&catalog=SG HTTP/1.1" 304 0 "http://m2.sg.65emall.net/search?keyword=humidifiers&ezspm=4.0.2.0.0&listing_origin=20000002" "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Mobile Safari/537.36" "140.206.112.194" 0.001 _ga=GA1.3.1284908540.1465133186; tma=228981083.10548698.1465133185510.1465133185510.1465133185510.1; bfd_g=bffd842b2b4843320000156c000106ca575405eb; G_ENABLED_IDPS=google; __zlcmid=azfbRwlP4TTFyi; ez_cookie_id=dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5; 65_customer=9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg; _ga=GA1.2.1284908540.1465133186; _gid=GA1.2.1865898003.1499149209
日志按空格分割,可以解讀為:
第一個(gè)是客戶端ip ;第二第三是 “-” 這里應(yīng)該存的是用戶的信息,如果沒有的話就以 “-”代替;第四位是“[]”里面的內(nèi)容,記錄日志的服務(wù)器時(shí)間;再后面雙引號(hào)里面的一長串是http請(qǐng)求的信息,格式是固定的 :“請(qǐng)求方式 ?請(qǐng)求url ?http版本”;往后是http請(qǐng)求返回的code,數(shù)字類型;再是請(qǐng)求返回的內(nèi)容大小;往后依次是當(dāng)前的url、瀏覽器信息、服務(wù)器IP地址、請(qǐng)求耗時(shí)。如果后面還有就是cookie的信息。
最后面的cookie信息也是有規(guī)律的,以我這邊的日志為例:
_ga=GA1.3.1284908540.1465133186; tma=228981083.10548698.1465133185510.1465133185510.1465133185510.1; bfd_g=bffd842b2b4843320000156c000106ca575405eb; G_ENABLED_IDPS=google; __zlcmid=azfbRwlP4TTFyi; ez_cookie_id=dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5; 65_customer=9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg; _ga=GA1.2.1284908540.1465133186; _gid=GA1.2.1865898003.1499149209
我需要取到ez_cookie_id 和 65_customer 兩個(gè)信息。
logstash配置
filter
我們來創(chuàng)建一個(gè)logstash grok pattern:
grok{match => { "message" => "%{IP:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} /%{NOTSPACE:request_page} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}"? ? ? }}
這個(gè)pattern我只格式化到 服務(wù)器的返回code,需要往后可以按照格式繼續(xù)加
然后還要把request_page里面的 url參數(shù)也解析出來,這里用kv插件解析:
kv{
source => "request_page"
field_split =>"&?"
value_split => "="
trim_value => ";"
include_keys => ["ezspm","keyword","info","catalog","referrer","categoryid","productid"]
}
url里面可能有很多參數(shù)值,可以用 include_keys 來選出你只想要的值
同理也可以用kv插件 取出 log里面的cookie
kv{
?source => "message"
?field_split =>" "
?value_split => "="
?trim_value => ";"
?include_keys => ["ez_cookie_id","65_customer"]
}
output
s3 {
?access_key_id => "access-id"
?secret_access_key => "access-key"
region => "ap-southeast-1"
prefix? ? => "nginxlog/%{+YYYY-MM-dd}"
bucket => "bi-spm-test"
time_file=> 60
codec => "json_lines"
}
這里注意 bucket 不用寫 “s3://”,直接寫最外層的 bucket文件夾名稱,prefix可以實(shí)現(xiàn)按日志切割文件夾存放log
按上述配置,我的輸出目錄最終為:
s3://bi-spm-test/nginxlog/2017-07-07/ls.s3.....txt
按天切割文件夾,每天的日志放到?jīng)]天的文件夾下面。
date配置
logstash默認(rèn)是utc 時(shí)間,比我們晚8個(gè)小時(shí),比如?
2017-07-07 06:00:00
產(chǎn)生的日志,存儲(chǔ)到logstash 時(shí)間會(huì)變?yōu)椋?017-07-06T22:00:00Z,此時(shí)不對(duì)日期處理,日志就會(huì)存儲(chǔ)到 2017-07-06這個(gè)文件夾下面,可我明明是7月7號(hào)產(chǎn)生的log。
同時(shí)往s3 copy的時(shí)候 ,2017-07-06T22:00:00Z 這個(gè)時(shí)間 也會(huì)直接變成:
2017-07-06 22:00:00,這就造成了數(shù)據(jù)不準(zhǔn)確的問題了。
在logstash社區(qū)有個(gè)小伙伴提出了同樣的問題:date時(shí)區(qū)問題
這里我們用刀date filter來配置日期:
date {
locale=>"en"
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss +0800"]
target => "@timestamp"
timezone => "UTC"
}
把timezone 就設(shè)定為UTC,在它的基礎(chǔ)上再加8個(gè)小時(shí)。
然后 2017-07-07 06:00:00 ?這個(gè)時(shí)間 存儲(chǔ)下來就是
2017-07-07T06:00:00Z ?, ?這是一個(gè)IOS的date格式,理論上 拿來用的時(shí)候 要再加 8個(gè)小時(shí),但是這里我 只需要它放到正確的文件夾下面,以及copy到s3的時(shí)候日期正確,所以這里用投機(jī)取巧的方式滿足我的需求。
完善的logstash配置:
input {
file {
path => ["/etc/logstash/test.log"]
type => "system"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter{
grok{
match => { "message" => "%{IP:client_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"%{WORD:method} /%{NOTSPACE:request_page} HTTP/%{NUMBER:http_version}\" %{NUMBER:server_response}"? ? ? }
}
kv{
source => "message"
field_split =>" "
value_split => "="
trim_value => ";"
include_keys => ["ez_cookie_id","65_customer"]
}
kv{
source => "request_page"
field_split =>"&?"
value_split => "="
trim_value => ";"
include_keys => ["ezspm","keyword","info","catalog","referrer","categoryid","productid"]
}
urldecode {
all_fields => true
}
mutate{
remove_field=>["message","request_page","host","path","method","type","server_response","ident","auth","@version"]? ? ? ? }
if [tags]{ drop {} }
if ![ezspm] { drop{} }
if ![65_customer] {mutate { add_field => {"65_customer" => ""} }}
if ![categoryid] {mutate { add_field => {"categoryid" => 0} }}
if ![productid] {mutate { add_field => {"productid" => 0} }}
if ![keyword] {mutate { add_field => {"keyword" => ""} }}
if ![referrer]{mutate { add_field => { "referrer" => ""} }}
if ![info]{mutate { add_field => { "info" => ""} }}
date {
locale=>"en"
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss +0800"]
target => "@timestamp"
timezone => "UTC"
}
mutate{remove_field =>? ["timestamp"]}
}
output {
s3 {
access_key_id => "access_id"
secret_access_key => "access_key"
region => "ap-southeast-1"
prefix? ? => "nginxlog/%{+YYYY-MM-dd}"
bucket => "bi-spm-test"
time_file=> 60
codec => "json_lines"
}
}
最終收集到的日志:
{"ez_cookie_id":"dce5aaf7-6eef-4193-b9d2-dd65dd0a2de5","productid":"20000000080550","65_customer":"9C7E020D4493C5A9,DPS:1dSKhm:7XnxmT6xJXDmu5h3mYdecAMwgmg","catalog":"SG","http_version":"1.1","referrer":"","@timestamp":"2017-07-07T20:55:58.000Z","ezspm":"4.20000002.22.0.0","client_ip":"192.168.199.63","keyword":"","categoryid":"0","info":""}
Copy到S3
copy dw.ods_nginx_spm
from 's3://bi-spm-test/nginxlog/2017-07-07-20/ls.s3'
REGION 'ap-southeast-1'
access_key_id 'access-id'
secret_access_key 'access-key'
timeformat 'auto'
FORMAT AS JSON 's3://bi-spm-test/test1.txt';
來查看下結(jié)果: