一、數據來源:
對技術論壇網站的tomcat access log日志進行分析,計算該論壇的一些關鍵指標,供運營者進行決策時參考。
開發該系統的目的是為了獲取一些業務相關的指標,這些指標在第三方工具中無法獲得的;
該日志數據的記錄格式,其中每行記錄有5部分組成:訪問者IP、訪問時間、訪問資源、訪問狀態(HTTP狀態碼)、本次訪問流量。
以下是部分數據:
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/filetype/common.gif HTTP/1.1" 200 90
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wx_jqr.gif HTTP/1.1" 200 1770
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/recommend_1.gif HTTP/1.1" 200 1030
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/logo.png HTTP/1.1" 200 4542
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /data/attachment/common/c8/common_2_verify_icon.png HTTP/1.1" 200 582
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /static/js/logging.js?y7a HTTP/1.1" 200 603
8.35.201.144 - - [30/May/2013:17:38:20 +0800] "GET /uc_server/avatar.php?uid=29331&size=middle HTTP/1.1" 301 -
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/common_smilies_var.js?y7a HTTP/1.1" 200 3184
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/swfupload.swf?preventswfcaching=1369906718144 HTTP/1.1" 200 13333
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/editor/editor.gif HTTP/1.1" 200 13648
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/94/42_avatar_middle.jpg HTTP/1.1" 200 6153
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/03/13/51_avatar_middle.jpg HTTP/1.1" 200 5087
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/87/94_avatar_middle.jpg HTTP/1.1" 200 5117
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/01/03_avatar_middle.jpg HTTP/1.1" 200 5844
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/12/85_avatar_middle.jpg HTTP/1.1" 200 3174
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=53635&size=middle HTTP/1.1" 301 -
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arw_r.gif HTTP/1.1" 200 65
8.35.201.166 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/px.png HTTP/1.1" 200 210
8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pmto.gif HTTP/1.1" 200 152
8.35.201.161 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/search.png HTTP/1.1" 200 3047
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=57232&size=middle HTTP/1.1" 301 -
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/83/35_avatar_middle.jpg HTTP/1.1" 200 7171
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/54/22_avatar_middle.jpg HTTP/1.1" 200 5396
8.35.201.166 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arrow_top.gif HTTP/1.1" 200 51
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arw_l.gif HTTP/1.1" 200 844
8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/qmenu.png HTTP/1.1" 200 1744
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/smile.gif HTTP/1.1" 200 1662
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sad.gif HTTP/1.1" 200 1237
二、數據清洗
1、將數據清洗成以下格式:
110.52.250.126 20130530173820 data/cache/style_1_widthauto.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wsh_zk.css
110.52.250.126 20130530173820 data/cache/style_1_forum_index.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wx_jqr.gif
27.19.74.143 20130530173820 data/attachment/common/c8/common_2_verify_icon.png
27.19.74.143 20130530173820 data/cache/common_smilies_var.js?y7a
2、編寫Map Reduce清理程序
2.1 工具類
package com.neusoft;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
/**
* Created by Administrator on 2019/1/8.
*/
public class LogParser {
public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
"d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
"yyyyMMddHHmmss");
public static void main(String[] args) throws ParseException {
final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
LogParser parser = new LogParser();
final String[] array = parser.parse(S1);
System.out.println("樣例數據: " + S1);
System.out.format(
"解析結果: ip=%s, time=%s, url=%s, status=%s, traffic=%s",
array[0], array[1], array[2], array[3], array[4]);
}
/**
* 解析英文時間字符串
*
* @param string
* @return
* @throws ParseException
*/
private Date parseDateFormat(String string) {
Date parse = null;
try {
parse = FORMAT.parse(string);
} catch (ParseException e) {
e.printStackTrace();
}
return parse;
}
/**
* 解析日志的行記錄
*
* @param line
* @return 數組含有5個元素,分別是ip、時間、url、狀態、流量
*/
public String[] parse(String line) {
String ip = parseIP(line);
String time = parseTime(line);
String url = parseURL(line);
String status = parseStatus(line);
String traffic = parseTraffic(line);
return new String[] { ip, time, url, status, traffic };
}
private String parseTraffic(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String traffic = trim.split(" ")[1];
return traffic;
}
private String parseStatus(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String status = trim.split(" ")[0];
return status;
}
private String parseURL(String line) {
final int first = line.indexOf("\"");
final int last = line.lastIndexOf("\"");
String url = line.substring(first + 1, last);
return url;
}
private String parseTime(String line) {
final int first = line.indexOf("[");
final int last = line.indexOf("+0800]");
String time = line.substring(first + 1, last).trim();
Date date = parseDateFormat(time);
return dateformat1.format(date);
}
private String parseIP(String line) {
String ip = line.split("- -")[0].trim();
return ip;
}
}
2.2 Mapper程序
package com.neusoft;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class CleanMapper extends Mapper<LongWritable,Text,LongWritable,Text>
{
//hello world
LogParser logParser = new LogParser();
Text outputValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException, IOException {
final String[] parsed = logParser.parse(value.toString());
// step1.過濾掉靜態資源訪問請求
if (parsed[2].startsWith("GET /static/")
|| parsed[2].startsWith("GET /uc_server")
|| parsed[2].endsWith(".css")
|| parsed[2].endsWith(".js")) {
return;
}
// step2.過濾掉開頭的指定字符串
if (parsed[2].startsWith("GET /")) {
parsed[2] = parsed[2].substring("GET /".length());
} else if (parsed[2].startsWith("POST /")) {
parsed[2] = parsed[2].substring("POST /".length());
}
// step3.過濾掉結尾的特定字符串
if (parsed[2].endsWith(" HTTP/1.1")) {
parsed[2] = parsed[2].substring(0, parsed[2].length()
- " HTTP/1.1".length());
}
if (parsed[2].contains(".css")
|| parsed[2].contains(".js")
|| parsed[2].contains(".jpg")
|| parsed[2].contains(".png")
|| parsed[2].contains(".gif")
|| parsed[2].contains(".jpeg")) {
return;
}
// step4.只寫入前三個記錄類型項
outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
context.write(key, outputValue);
}
}
2.3 Reduce程序
package com.neusoft;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class CleanReducer extends Reducer<LongWritable,Text,Text,NullWritable>
{
@Override
protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
context.write(values.iterator().next(),NullWritable.get());
}
}
2.4 Driver程序
package com.neusoft;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CleanDriver {
public static void main(String[] args) throws Exception {
System.setProperty("HADOOP_USER_NAME", "root") ;
System.setProperty("hadoop.home.dir", "e:/hadoop-2.8.3");
if (args == null || args.length == 0) {
return;
}
com.neusoft.FileUtil.deleteDir(args[1]);
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//jar
job.setJarByClass(CleanDriver.class);
job.setMapperClass(CleanMapper.class);
job.setReducerClass(CleanReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileInputFormat.setMaxInputSplitSize(job, 1024*1024);
FileOutputFormat.setOutputPath(job,new Path(args[1]));
boolean bResult = job.waitForCompletion(true);
System.out.println("--------------------------------");
System.exit(bResult ? 0 : 1);
}
}
2.5 刪除文件夾工具類
package com.neusoft;
import java.io.File;
/**
* Created by bee on 3/25/17.
*/
public class FileUtil {
public static boolean deleteDir(String path) {
File dir = new File(path);
if (dir.exists()) {
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
deleteDir(f.getAbsolutePath());
} else {
f.delete();
}
}
dir.delete();
return true;
} else {
System.out.println("文件(夾)不存在!");
return false;
}
}
}
三、 用清洗完的數據進行分析
1、建表
1.1 在hdfs上創建一個分區,用來建外部表
hadoop dfs -mkdir -p /project/techbbs/cleaned
1.2 創建外部表
進入hive,使用hive床架一個外部表
CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/project/techbbs/cleaned';
1.3 建立分區
建立了分區表之后,就需要增加一個分區,增加分區的語句如下:(這里主要針對20150425這一天的日志進行分區)
hive>ALTER TABLE techbbs ADD PARTITION(logdate='2015_04_25') LOCATION '/project/techbbs/cleaned/2015_04_25';
1.4 寫入數據
將清洗后的數據寫入剛剛創建好的表
0: jdbc:hive2://localhost:10000> load data local inpath '/root/cleaned' into table techbbs3 partition(logdate='2015_04_25');
2 分析統計數據
2.1 PV量
頁面瀏覽量即為PV(Page View),是指所有用戶瀏覽頁面的總和,一個獨立用戶每打開一個頁面就被記錄1 次。這里,我們只需要統計日志中的記錄個數即可,HQL代碼如下:
0: jdbc:hive2://localhost:10000> SELECT COUNT(1) AS PV FROM techbbs WHERE logdate='2015_04_25';
2.2 注冊用戶數
該論壇的用戶注冊頁面為member.php,而當用戶點擊注冊時請求的又是member.php?mod=register的url。因此,這里我們只需要統計出日志中訪問的URL是member.php?mod=register的即可,HQL代碼如下:
0: jdbc:hive2://localhost:10000> select count(*) from techbbs where url like '%member.php?mod=register%';
2.3 獨立IP數
一天之內,訪問網站的不同獨立 IP 個數加和。其中同一IP無論訪問了幾個頁面,獨立IP 數均為1。因此,這里我們只需要統計日志中處理的獨立IP數即可,在SQL中我們可以通過DISTINCT關鍵字,在HQL中也是通過這個關鍵字:
0: jdbc:hive2://localhost:10000> SELECT COUNT(DISTINCT ip) AS IP FROM techbbs WHERE logdate='2015_04_25';
2.4 跳出用戶數
只瀏覽了一個頁面便離開了網站的訪問次數,即只瀏覽了一個頁面便不再訪問的訪問次數。這里,我們可以通過用戶的IP進行分組,如果分組后的記錄數只有一條,那么即為跳出用戶。將這些用戶的數量相加,就得出了跳出用戶數,HQL代碼如下:
0: jdbc:hive2://localhost:10000> select count(*) from (select ip,count(ip) as num from techbbs group by ip) as tmpTable where tmpTable.num = 1;
PS:跳出率是指只瀏覽了一個頁面便離開了網站的訪問次數占總的訪問次數的百分比,即只瀏覽了一個頁面的訪問次數 / 全部的訪問次數匯總。這里,我們可以將這里得出的跳出用戶數/PV數即可得到跳出率。