這里我簡單的爬取了煎蛋網(wǎng)的段子,煎蛋網(wǎng)有些段子會被屏蔽的現(xiàn)象產(chǎn)生,所以要對這塊東西進(jìn)行處理。
下面就是按常規(guī)去處理,附上具體代碼
import requests
froml xml import etree
url='http://jandan.net/duan'
headers={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Host':'jandan.net',
'Pragma':'no-cache',
'Referer':'http://jandan.net/qa',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36',
}
html=requests.get(url,headers=headers);
html.encoding="utf-8"
root=etree.HTML(html.text)
result=root.xpath("http://div[@class='row']")
for i in range(len(result)):
author=result[i].xpath(".//div[@class='author']/strong/text()")
text=re sult[i].xpath(".//div[@class='text']")[0]
if(text.xpath("./p[@class='bad_content']")):
text=result[i].xpath(".//div[@class='text']/p[2]/text()")
else:
text=result[i].xpath(".//div[@class='text']/p/text()")
print '作者',author[0],'內(nèi)容',text[0]
上面的xpath上的.//div[@class='author']/strong/text()解釋,就是在class為row的div下找到class為author的div,再在strong標(biāo)簽下,得到標(biāo)簽中的字。