練習(xí)(三)
目標(biāo)抓取
- 在練習(xí)二的基礎(chǔ)上按照分頁信息抓取每一頁信息
首先我們抓取下一頁的連接
>>> response.css('.next a::attr(href)').extract_first()
'?page=2'
接下來修改parse方法
def parse(self, response):
next_page = response.css('.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
parse方法返回request時(shí)scrapy會(huì)繼續(xù)安排抓取該連接并調(diào)用相應(yīng)回調(diào)函數(shù)
代碼中我們使用了response.urljoin(next_page)
該方法會(huì)將相對(duì)路徑的URL拼接返回絕對(duì)路徑的URL
scrapy.Request接受絕對(duì)路徑的URL參數(shù)
便捷方式
除了上面的scrapy.Request方式外我們還可以采用response.follow的方式
def parse(self, response):
items = response.css("div.news__item")
for item in items:
load = ItemLoader(item=NewsItem(),selector=item)
load.add_css('url', "div.news__item-info h4 a::attr(href)")
load.add_css('praise', "div.stream__item-zan span.stream__item-zan-number::text")
load.add_css('title', "div.news__item-info h4 a::text")
yield load.load_item()
for a in response.css('.next a'):
yield response.follow(a, callback=self.parse)
response.follow可以接受相對(duì)路徑并且可以接受selector和string類型,如果是a標(biāo)簽則會(huì)自動(dòng)提取href屬性的連接