Scrapy基礎——CrawlSpider詳解

寫在前面

在Scrapy基礎——Spider中，我簡要地說了一下Spider類。Spider基本上能做很多事情了，但是如果你想爬取知乎或者是簡書全站的話，你可能需要一個更強大的武器。
CrawlSpider基于Spider，但是可以說是為全站爬取而生。

簡要說明

CrawlSpider是爬取那些具有一定規則網站的常用的爬蟲，它基于Spider并有一些獨特屬性

rules: 是Rule對象的集合，用于匹配目標網站并排除干擾
parse_start_url: 用于爬取起始響應，必須要返回Item，Request中的一個。

因為rules是Rule對象的集合，所以這里也要介紹一下Rule。它有幾個參數：link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None
其中的link_extractor既可以自己定義，也可以使用已有LinkExtractor類，主要參數為：

allow：滿足括號中“正則表達式”的值會被提取，如果為空，則全部匹配。
deny：與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取。
allow_domains：會被提取的鏈接的domains。
deny_domains：一定不會被提取鏈接的domains。
restrict_xpaths：使用xpath表達式，和allow共同作用過濾鏈接。還有一個類似的restrict_css

下面是官方提供的例子，我將從源代碼的角度開始解讀一些常見問題：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

問題：CrawlSpider如何工作的？

因為CrawlSpider繼承了Spider，所以具有Spider的所有函數。
首先由start_requests對start_urls中的每一個url發起請求（make_requests_from_url)，這個請求會被parse接收。在Spider里面的parse需要我們定義，但CrawlSpider定義parse去解析響應（self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)）
_parse_response根據有無callback,follow和self.follow_links執行不同的操作

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
    ##如果傳入了callback，使用這個callback解析頁面并獲取解析得到的reques或item
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
    ## 其次判斷有無follow，用_requests_to_follow解析響應是否有符合要求的link。
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

其中_requests_to_follow又會獲取link_extractor（這個是我們傳入的LinkExtractor）解析頁面得到的link（link_extractor.extract_links(response)）,對url進行加工（process_links，需要自定義），對符合的link發起Request。使用.process_request(需要自定義）處理響應。

問題：CrawlSpider如何獲取rules？

CrawlSpider類會在__init__方法中調用_compile_rules方法，然后在其中淺拷貝rules中的各個Rule獲取要用于回調(callback)，要進行處理的鏈接（process_links）和要進行的處理請求（process_request)

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

那么Rule是怎么樣定義的呢？

    class Rule(object):

        def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
            self.link_extractor = link_extractor
            self.callback = callback
            self.cb_kwargs = cb_kwargs or {}
            self.process_links = process_links
            self.process_request = process_request
            if follow is None:
                self.follow = False if callback else True
            else:
                self.follow = follow

因此LinkExtractor會傳給link_extractor。

有callback的是由指定的函數處理，沒有callback的是由哪個函數處理的？

由上面的講解可以發現_parse_response會處理有callback的（響應）respons。
cb_res = callback(response, **cb_kwargs) or ()
而_requests_to_follow會將self._response_downloaded傳給callback用于對頁面中匹配的url發起請求（request）。
r = Request(url=link.url, callback=self._response_downloaded)

如何在CrawlSpider進行模擬登陸

因為CrawlSpider和Spider一樣，都要使用start_requests發起請求，用從Andrew_liu大神借鑒的代碼說明如何模擬登陸：

##替換原來的start_requests，callback為
def start_requests(self):
    return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]
def post_login(self, response):
    print 'Preparing login'
    #下面這句話用于抓取請求網頁后返回網頁中的_xsrf字段的文字, 用于成功提交表單
    xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
    print xsrf
    #FormRequeset.from_response是Scrapy提供的一個函數, 用于post表單
    #登陸成功后, 會調用after_login回調函數
    return [FormRequest.from_response(response,   #"http://www.zhihu.com/login",
                        meta = {'cookiejar' : response.meta['cookiejar']},
                        headers = self.headers,
                        formdata = {
                        '_xsrf': xsrf,
                        'email': '1527927373@qq.com',
                        'password': '321324jia'
                        },
                        callback = self.after_login,
                        dont_filter = True
                        )]
#make_requests_from_url會調用parse，就可以與CrawlSpider的parse進行銜接了
def after_login(self, response) :
    for url in self.start_urls :
        yield self.make_requests_from_url(url)

理論說明如上，有不足或不懂的地方歡迎在留言區和我說明。
其次，我會寫一段爬取簡書全站用戶的爬蟲來說明如何具體使用CrawlSpider

最后貼上Scrapy.spiders.CrawlSpider的源代碼，以便檢查

"""
This modules implements the CrawlSpider which is the recommended spider to use
for scraping typical web sites that requires crawling pages.

See documentation in docs/topics/spiders.rst
"""

import copy
import six

from scrapy.http import Request, HtmlResponse
from scrapy.utils.spider import iterate_spider_output
from scrapy.spiders import Spider


def identity(x):
    return x


class Rule(object):

    def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
        self.link_extractor = link_extractor
        self.callback = callback
        self.cb_kwargs = cb_kwargs or {}
        self.process_links = process_links
        self.process_request = process_request
        if follow is None:
            self.follow = False if callback else True
        else:
            self.follow = follow


class CrawlSpider(Spider):

    rules = ()

    def __init__(self, *a, **kw):
        super(CrawlSpider, self).__init__(*a, **kw)
        self._compile_rules()

    def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

    def parse_start_url(self, response):
        return []

    def process_results(self, response, results):
        return results

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

    def _response_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

    def _compile_rules(self):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
        spider._follow_links = crawler.settings.getbool(
            'CRAWLSPIDER_FOLLOW_LINKS', True)
        return spider

    def set_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

最后編輯于：2017.12.03 06:43:43

?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明：文章內容（如有圖片或視頻亦包括在內）由作者上傳并發布，文章內容僅代表作者本人觀點，簡書系信息發布平臺，僅提供信息存儲服務。

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 229,836評論 6贊 540
死咒
序言：濱河連續發生了三起死亡事件，死亡現場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發現死者居然都...
沈念sama閱讀 99,275評論 3贊 428
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 177,904評論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,633評論 1贊 317
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 72,368評論 6贊 410
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發上，一...
開封第一講書人閱讀 55,736評論 1贊 328
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,740評論 3贊 446
雙鴛鴦連環套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,919評論 0贊 289
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當地人在樹林里發現了一具尸體，經...
沈念sama閱讀 49,481評論 1贊 335
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 41,235評論 3贊 358
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發現自己被綠了。大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 43,427評論 1贊 374
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,968評論 5贊 363
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發生泄漏。R本人自食惡果不足惜，卻給世界環境...
茶點故事閱讀 44,656評論 3贊 348
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,055評論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,348評論 1贊 294
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 52,160評論 3贊 398
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 48,380評論 2贊 379

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Scrapy基礎——CrawlSpider詳解

Scrapy基礎——CrawlSpider詳解

寫在前面

簡要說明

問題：CrawlSpider如何工作的？

問題：CrawlSpider如何獲取rules？

有callback的是由指定的函數處理，沒有callback的是由哪個函數處理的？

如何在CrawlSpider進行模擬登陸

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Scrapy基礎——CrawlSpider詳解

寫在前面

簡要說明

問題：CrawlSpider如何工作的？

問題：CrawlSpider如何獲取rules？

有callback的是由指定的函數處理，沒有callback的是由哪個函數處理的？

如何在CrawlSpider進行模擬登陸

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频