實(shí)戰(zhàn)計(jì)劃0430-石頭的練習(xí)作業(yè)
作業(yè)的要求如下
作業(yè)要求
html的重要結(jié)構(gòu)如下
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0005_828148335519990171_c234285520ff.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$64.99</h4>
<h4><a href="#">New Pocket</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">12 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
實(shí)現(xiàn)的代碼如下
__author__ = 'daijielei'
from bs4 import BeautifulSoup
file = open('1_2_homework_required/index.html','r')#將本地的html加載進(jìn)來(lái)
soup = BeautifulSoup(file.read(),'lxml')#用BeautifuSoup將html的文本格式化以方便尋找所需要的內(nèi)容
for item in soup.select('.thumbnail'):
data={}
data['title'] = item.select('a')[0].getText() if item.select('a') else '' #if else 做校驗(yàn),以防止抓取到的數(shù)據(jù)為無(wú)效數(shù)據(jù)
data['imgurl'] = item.select('img')[0].get('src') if item.select('img') else ''
data['price'] = item.select('h4.pull-right')[0].getText() if item.select('h4.pull-right') else ''
data['review'] = item.select('p.pull-right')[0].getText() if item.select('p.pull-right') else ''
data['rate'] = 1*len(item.select('.glyphicon.glyphicon-star')) + 0.5*len(item.select('.glyphicon.glyphicon-star-empty'))#根據(jù)星的個(gè)數(shù)計(jì)算分?jǐn)?shù)
print(data)
實(shí)現(xiàn)效果
筆記、思考與總結(jié)
1、html里都是一塊塊的,結(jié)構(gòu)非常清晰,所以先用篩選出一整塊的資料,這樣要處理的都是塊內(nèi)的內(nèi)容,比較清晰
for item in soup.select('.thumbnail'):
2、對(duì)每塊的內(nèi)容進(jìn)行處理時(shí),加了判斷語(yǔ)句來(lái)保證抓取的內(nèi)容不對(duì)的時(shí)候不會(huì)出錯(cuò),因?yàn)槭窍葘W(xué)完了課程后面才來(lái)補(bǔ)作業(yè)的,所以比最開始學(xué)的時(shí)候簡(jiǎn)潔多了
data['title'] = item.select('a')[0].getText() if item.select('a') else '' #if else 做校驗(yàn),以防止抓取到的數(shù)據(jù)為無(wú)效數(shù)據(jù)
3、提取的內(nèi)容中,比較難提取的是評(píng)分,因?yàn)椴荒苤苯犹崛〕鲆粋€(gè)標(biāo)簽,html語(yǔ)句是這樣的
<div class="ratings">
<p class="pull-right">12 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
實(shí)際上就是要算出一共有多少個(gè)滿星標(biāo)簽和多少個(gè)半星標(biāo)簽,做一下?lián)Q算。
data['rate'] = 1*len(item.select('.glyphicon.glyphicon-star')) + 0.5*len(item.select('.glyphicon.glyphicon-star-empty'))#根據(jù)星的個(gè)數(shù)計(jì)算分?jǐn)?shù)