背景
一般互聯(lián)網(wǎng)公司還有數(shù)據(jù)分析公司,喜歡使用爬蟲爬取頁面,并進行數(shù)據(jù)分析.爬蟲的數(shù)據(jù)種類很多,由于本人僅接觸java技術,所以只提供簡單的java爬蟲技術.感謝網(wǎng)絡上那么多的資料分享.現(xiàn)就我自己經(jīng)驗,編寫一個java的demo.
爬取頁面
目前有很多網(wǎng)站做了爬蟲反扒技術,只有一些服務器比較牛的公司(暫且這么說吧),經(jīng)得起爬蟲爬取頁面,像新浪微博的頁面,爬蟲技術爬取就比較困難了,現(xiàn)在以容易爬取的頁面為例,使用GET方式,爬取路徑:
http://top.baidu.com/buzz?b=259&c=9&fr=topbuzz_b612_c9(百度的歷史人物數(shù)據(jù),這個比較方便挑選關鍵數(shù)據(jù),比如人物名稱,人物簡介,相關貼吧和相關視頻,本文僅獲取內(nèi)容并保存到文件.)另外百度搜索還支持榜單的嵌入定制,很方便的,嵌入條件如下圖.
Demo代碼
Get請求類
package yuxiSoftware.cn.demo;
import java.io.IOException;
import org.apache.http.HttpEntity;
import org.apache.http.ParseException;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
/**
* Get請求獲取頁面內(nèi)容
* @author yuxiSoftware
*/
public class HttpRequestBean {
public String getMethod(String uri) {
// 創(chuàng)建默認httpclient實例
CloseableHttpClient httpClient = HttpClients.createDefault();
try {
// 創(chuàng)建httpGet
HttpGet httpGet = new HttpGet(uri);
// 執(zhí)行get請求
CloseableHttpResponse response = httpClient.execute(httpGet);
try {
// 獲取響應實體
HttpEntity entity = response.getEntity();
// 獲取響應狀態(tài)
System.out.println(response.getStatusLine());
if (entity != null) {
// 響應內(nèi)容
return EntityUtils.toString(entity, "gb2312");
}
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
response.close();
}
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return "";
}
}
爬取內(nèi)容輸出類
package yuxiSoftware.cn.demo;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
/**
* 文件輸出類
* @author yuxiSoftware
*
*/
public class OutPutToFile {
public void writeFile(String s,String url){
File f = new File(url);
OutputStream out = null;
try {
out = new FileOutputStream(f);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
byte b [] = s.getBytes();
try {
out.write(b);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
main方法類
package yuxiSoftware.cn.demo;
/**
* 爬蟲小demo
* @author yuxiSoftware
*
*/
public class WormDemo {
public static void main(String[] args) {
String url = "http://top.baidu.com/buzz?b=259&c=9&fr=topbuzz_b612_c9";
OutPutToFile outPutToFile = new OutPutToFile();
HttpRequestBean httpRequestBean = new HttpRequestBean();
String s = httpRequestBean.getMethod(url);
String fileurl = "C:\\Users\\yuxiSoftware\\Desktop\\百度爬取頁面.txt";
outPutToFile.writeFile(s, fileurl);
}
}
輸出結(jié)果
說明
該java技術,僅供大家學習娛樂使用,非商業(yè)用途,如有不妥之處,請聯(lián)系作者刪除,謝謝!.