由于項目需要,學習了一下如何從網頁抓取數據,進行數據分析。實際上單獨使用jsoup也可以直接處理,但是測試過程中發現jsoup處理頁頁有連接超時的情況,因此,結合httpclient和jsoup做分析處理。
httpclient和jsoup的maven配置如下:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.6</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.3</version>
</dependency>
分析了一下目標頁面,頁面通過post請求,httpclient封裝post請求,直接上代碼
/**
* 封裝post請求
* @param url 訪問的url
* @param map 參數列表
* @param charset 字符編碼
* @return
*/
public static String doPost(String url,Map<String,String> map,String charset){
HttpClient httpClient = null;
HttpPost httpPost = null;
String result = null;
try{
httpClient = new DefaultHttpClient();
httpPost = new HttpPost(url);
//設置參數
List<NameValuePair> list = new ArrayList<NameValuePair>();
Iterator iterator = map.entrySet().iterator();
while(iterator.hasNext()){
Entry<String,String> elem = (Entry<String, String>) iterator.next();
list.add(new BasicNameValuePair(elem.getKey(),elem.getValue()));
}
if(list.size() > 0){
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list,charset);
httpPost.setEntity(entity);
}
HttpResponse response = httpClient.execute(httpPost);
if(response != null){
HttpEntity resEntity = response.getEntity();
if(resEntity != null){
result = EntityUtils.toString(resEntity,charset);
}
}
}catch(Exception ex){
ex.printStackTrace();
}
return result;
}
上述的返回結果,采用jsoup解析,即Jsoup.parse方法,封裝方法如下:
public static List<String> getElement(String content){
// try {
// Document document = Jsoup.connect(url).get();//這種情況可以直接解析url
Document document = Jsoup.parse(content);//這種情況是解析網頁內容
List<String> list = new ArrayList<>();
// System.out.println(document.toString());
// Elements tableElements = document.getElementsByTag("tr");
Elements tableElements = document.getElementsByClass("viewTable");
Elements trElements = tableElements.get(0).getElementsByTag("tr");
for(int i=1;i<trElements.size();i++){
list.add(trElements.get(i).text().replaceAll(" ", ","));
// System.out.println(trElements.get(i).text().replaceAll(" ", ","));
}
return list;
// } catch (IOException e) {
// e.printStackTrace();
// }
}
通過測試,處理的結果如下:
image.png
然后對結果進行處理、入庫、分析、查詢、展示等操作,達到自己的目標。