寫在前面
在Scrapy基礎——Spider中,我簡要地說了一下Spider類。Spider基本上能做很多事情了,但是如果你想爬取知乎或者是簡書全站的話,你可能需要一個更強大的武器。
CrawlSpider基于Spider,但是可以說是為全站爬取而生。
簡要說明
CrawlSpider是爬取那些具有一定規則網站的常用的爬蟲,它基于Spider并有一些獨特屬性
- rules: 是Rule對象的集合,用于匹配目標網站并排除干擾
- parse_start_url: 用于爬取起始響應,必須要返回Item,Request中的一個。
因為rules是Rule對象的集合,所以這里也要介紹一下Rule。它有幾個參數:link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定義,也可以使用已有LinkExtractor類,主要參數為:
- allow:滿足括號中“正則表達式”的值會被提取,如果為空,則全部匹配。
- deny:與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取。
- allow_domains:會被提取的鏈接的domains。
- deny_domains:一定不會被提取鏈接的domains。
- restrict_xpaths:使用xpath表達式,和allow共同作用過濾鏈接。還有一個類似的restrict_css
下面是官方提供的例子,我將從源代碼的角度開始解讀一些常見問題:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
問題:CrawlSpider如何工作的?
因為CrawlSpider繼承了Spider,所以具有Spider的所有函數。
首先由start_requests
對start_urls
中的每一個url發起請求(make_requests_from_url
),這個請求會被parse接收。在Spider里面的parse需要我們定義,但CrawlSpider定義parse
去解析響應(self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
)
_parse_response根據有無callback
,follow
和self.follow_links
執行不同的操作
def _parse_response(self, response, callback, cb_kwargs, follow=True):
##如果傳入了callback,使用這個callback解析頁面并獲取解析得到的reques或item
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
## 其次判斷有無follow,用_requests_to_follow解析響應是否有符合要求的link。
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
其中_requests_to_follow
又會獲取link_extractor
(這個是我們傳入的LinkExtractor)解析頁面得到的link(link_extractor.extract_links(response))
,對url進行加工(process_links,需要自定義),對符合的link發起Request。使用.process_request
(需要自定義)處理響應。
問題:CrawlSpider如何獲取rules?
CrawlSpider類會在__init__
方法中調用_compile_rules
方法,然后在其中淺拷貝rules
中的各個Rule
獲取要用于回調(callback),要進行處理的鏈接(process_links)和要進行的處理請求(process_request)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
那么Rule
是怎么樣定義的呢?
class Rule(object):
def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
self.link_extractor = link_extractor
self.callback = callback
self.cb_kwargs = cb_kwargs or {}
self.process_links = process_links
self.process_request = process_request
if follow is None:
self.follow = False if callback else True
else:
self.follow = follow
因此LinkExtractor會傳給link_extractor。
有callback的是由指定的函數處理,沒有callback的是由哪個函數處理的?
由上面的講解可以發現_parse_response
會處理有callback
的(響應)respons。
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow
會將self._response_downloaded
傳給callback
用于對頁面中匹配的url發起請求(request)。
r = Request(url=link.url, callback=self._response_downloaded)
如何在CrawlSpider進行模擬登陸
因為CrawlSpider和Spider一樣,都要使用start_requests發起請求,用從Andrew_liu大神借鑒的代碼說明如何模擬登陸:
##替換原來的start_requests,callback為
def start_requests(self):
return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]
def post_login(self, response):
print 'Preparing login'
#下面這句話用于抓取請求網頁后返回網頁中的_xsrf字段的文字, 用于成功提交表單
xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
print xsrf
#FormRequeset.from_response是Scrapy提供的一個函數, 用于post表單
#登陸成功后, 會調用after_login回調函數
return [FormRequest.from_response(response, #"http://www.zhihu.com/login",
meta = {'cookiejar' : response.meta['cookiejar']},
headers = self.headers,
formdata = {
'_xsrf': xsrf,
'email': '1527927373@qq.com',
'password': '321324jia'
},
callback = self.after_login,
dont_filter = True
)]
#make_requests_from_url會調用parse,就可以與CrawlSpider的parse進行銜接了
def after_login(self, response) :
for url in self.start_urls :
yield self.make_requests_from_url(url)
理論說明如上,有不足或不懂的地方歡迎在留言區和我說明。
其次,我會寫一段爬取簡書全站用戶的爬蟲來說明如何具體使用CrawlSpider
最后貼上Scrapy.spiders.CrawlSpider的源代碼,以便檢查
"""
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.
See documentation in docs/topics/spiders.rst
"""
import copy
import six
from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider
def identity(x):
return x
class Rule(object):
def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
self.link_extractor = link_extractor
self.callback = callback
self.cb_kwargs = cb_kwargs or {}
self.process_links = process_links
self.process_request = process_request
if follow is None:
self.follow = False if callback else True
else:
self.follow = follow
class CrawlSpider(Spider):
rules = ()
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
self._compile_rules()
def parse(self, response):
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
def parse_start_url(self, response):
return []
def process_results(self, response, results):
return results
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
self._rules = [copy.copy(r) for r in self.rules]
for rule in self._rules:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
spider._follow_links = crawler.settings.getbool(
'CRAWLSPIDER_FOLLOW_LINKS', True)
return spider
def set_crawler(self, crawler):
super(CrawlSpider, self).set_crawler(crawler)
self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)