最近練習(xí)爬蟲有點瘋魔的感覺,這不,又對百思不得姐上的段子下手了
通過F12查看第一頁每個段子的內(nèi)容都可以通過
'div.j-r-list > ul > li > div.j-r-list-c > div.j-r-list-c-desc'
定位到。
我們要做的,就是把每個段子都拿到并寫入到txt文件里面去
詳細代碼如下:
#!usr/bin/env
# -*-coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup as BS
import time
budejie_url = "http://www.budejie.com/"
first_page_url = "http://www.budejie.com/text/1"
# Set proxy
proxies = {
"http": "http://yourproxy.com:8080/",
"https": "https://yourproxy.com:8080/",
}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
text_of_each_page = ""
for i in range(100):
url_of_each_page = budejie_url + "text/" + str(i+1)
# print url_of_each_page
r = requests.get(url_of_each_page, headers=headers, proxies=proxies)
# print r.status_code
if r.status_code == 200:
soup = BS(r.text, "lxml")
text_lists = soup.select('div.j-r-list > ul > li > div.j-r-list-c > div.j-r-list-c-desc')
for text_of_duanzi in text_lists:
text_of_each_page += text_of_duanzi.get_text()
time.sleep(3)
else:
continue
myfile = open("budejie.txt", "w")
myfile.write(text_of_each_page.encode('utf-8'))
myfile.close()