永遠(yuǎn)保持一顆求知的心,不要被已知所束縛。
第二篇就不想寫(xiě)了,這樣不行的。
A任務(wù):爬取stackoverflow的問(wèn)題和詳細(xì)內(nèi)容數(shù)據(jù)并保存csv文件
用Firefox分析網(wǎng)頁(yè)元素:
可以看到我們要爬取的所有鏈接可以由這個(gè)目錄進(jìn),查詢(xún)到目的網(wǎng)頁(yè)的鏈接地址:
進(jìn)入目的子網(wǎng)頁(yè)并分析元素:
網(wǎng)頁(yè)分析完成以后編寫(xiě)一簡(jiǎn)單爬蟲(chóng)進(jìn)行試驗(yàn):
import scrapy
class StackOverflowSpider(scrapy.Spider):
????? name = "Stackoverflow"
????? start_urls=["https://stackoverflow.com/questions?sort=votes"]
????? def parse(self,response):
????? ????? for href in response.css('.question-summary h3 a::attr(href)'):
????? ????? ????? full_url = response.urljoin(href.extract())
????? ????? ????? yield scrapy.Request(full_url,callback=self.parse_question)
????? def parse_question(self,response):
????? ????? yield {
????? ????? ????? 'title':response.css('.inner-content h1 a::text').extract()[0],
????? ????? ????? 'votes':response.css(".vote .vote-count-post::text").extract()[0],
????? ????? ????? 'body':response.css(".post-text").extract()[0],
????? ????? ????? 'tags':response.css('.post-taglist .post-tag::text').extract(),
????? ????? ????? 'link':response.url,
????? ????? ????? }
執(zhí)行爬蟲(chóng)并保存為csv文件:
> scrapy runspider scrapy1.py? -o abc.csv
就可以在文件夾下發(fā)現(xiàn)已經(jīng)存有數(shù)據(jù)的abc.csv,需要重點(diǎn)學(xué)習(xí)css選擇器的使用規(guī)則。