解析一個本地網頁,獲取標題,圖片地址,價格,評分量和評分星級。
網頁如下
作業1.2.png
代碼
from bs4 import BeautifulSoup
with open('D:\宣宣\homework/index.html','r') as wb_data:
soup = BeautifulSoup(wb_data,'lxml') #解析網頁內容
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
tittles = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
prices = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
reviews = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
# print(images,tittles,price,reviews,stars,sep= '\n--------------\n')
for tittle,image,price,review,star in zip(tittles,images,prices,reviews,stars):
data = {
'tittle':tittle.get_text(), #提取文本信息
'image':image.get('src'), #提取圖片地址src是地址參數
'price':price.get_text(),
'review':review.get_text(),
'star':len(star.find_all("span",class_='glyphicon glyphicon-star'))
}
print(data)
'''
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3)
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p.pull-right
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
運行結果
122.png
總結
1.用Python爬取網頁信息,首先得對網頁有基本的了解。知道如何在瀏覽器查詢對應圖片、文字的HTML代碼。再通過copy CSS selector進行有用信息的提取
2.在星級提取中,stars = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)'),copy CSS selector是body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.ratings > p:nth-child(2) > span:nth-child(3),開始沒把最后的span:nth-child(3)這一串去掉,結果star=0.后來才明白要提取總共多少個星星,應該寫到父級標簽 p:nth-child(2) ,才會統計所有。nth-child是會出錯的。應改為nth-of-type(2),意為選擇器匹配屬于父元素的特定類型的第 2個子元素的每個元素。
3.通過不停的出錯,對照答案,查文檔,對代碼的理解加深的。最后運行代碼成功,又是一件喜悅的事情,學習動力持續不斷。