通過本文了解scrapy的基本使用，并通過一個demo感受它的強(qiáng)大。

scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

我們暫時就認(rèn)為它是一個功能加大的爬蟲框架即可。

Quotes to Scrape

我們要練習(xí)抓取的是scrapy官方提供的網(wǎng)站：quotes.toscrape.com。對這個網(wǎng)站的抓取讓我們對scrapy框架有個基本的認(rèn)識，可以更輕松的入門。因?yàn)樵谖易畛踅佑|框架的時候，總覺得它是一個很神秘很復(fù)雜的東西，還不如使用requests庫更容易。

Quotes to Scrape

這個網(wǎng)站主要是一些名人名言，雖然看似簡陋，卻包含了文本，標(biāo)簽，超鏈接等大多數(shù)網(wǎng)站都具備的格式。所以這個網(wǎng)站用來入門scrapy是不二選擇啊！

Demo

抓取流程

我們通過抓取第一頁的信息，獲取該頁的內(nèi)容和下一頁的鏈接，實(shí)現(xiàn)翻頁抓取，然后將抓取到的網(wǎng)頁內(nèi)容保存為特定的格式并存入數(shù)據(jù)庫。

創(chuàng)建項(xiàng)目

在命令行輸入

scrapy startproject quotes

Tree

然后輸入命令創(chuàng)建spider

cd quotes
scrapy genspider quote quotes.toscrape.com

創(chuàng)建好spider文件后,我們就可以繼續(xù)完善代碼了。

初步測試

我們先來測試下框架，在生成好的spider文件中，我們先抓取網(wǎng)頁的狀態(tài)碼和網(wǎng)頁源碼。

代碼為：

# -*- coding: utf-8 -*-
import scrapy


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        print(response.status)
        print(response.text)

這是運(yùn)行的部分截圖，正常輸出了網(wǎng)頁的狀態(tài)碼和網(wǎng)頁源碼。

接下來我們就開始正式抓取了。

完善代碼

items.py

我們要抓取這個網(wǎng)頁的名人名言，作者和標(biāo)簽，首先要在items.py文件下定義字段。

import scrapy


class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

spider.py

在講這部分代碼時，先介紹一個強(qiáng)大的工具：shell，有了這個工具，在抓取網(wǎng)頁內(nèi)容的時候會更加得心應(yīng)手。

在命令行輸入：

scrapy shell http://quotes.toscrape.com/

這樣就進(jìn)入了命令行交互模式，做一些調(diào)試。

這是我做的一些簡單調(diào)試，相信你也會充分利用這個工具。

繼續(xù)完善代碼。

# -*- coding: utf-8 -*-
import scrapy
from quotes.items import QuotesItem


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = QuotesItem()

        quotes = response.css('.quote')
        for quote in quotes:
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()

            item['text'] = text
            item['author'] = author
            item['tags'] = tags

            yield item

這樣基本能看到抓取的結(jié)果了：

不過只有第一頁的內(nèi)容，接下來我們要抓取所有頁的內(nèi)容。想要抓取下一頁的內(nèi)容也非常簡單，只要在本頁找到下一頁的鏈接，生成下一頁的鏈接后不斷重復(fù)這個過程，直到最后一頁停止抓取。

抓取所有頁數(shù)的完整代碼為：

# -*- coding: utf-8 -*-
import scrapy
from quotes.items import QuotesItem


class QuoteSpider(scrapy.Spider):
    name = "quote"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        item = QuotesItem()

        quotes = response.css('.quote')
        for quote in quotes:
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()

            item['text'] = text
            item['author'] = author
            item['tags'] = tags

            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

解釋一下新增的三行代碼。第一行用來找到下一頁的超鏈接；第二行生成一個絕對的URL，第三行使用Request方法，傳入新生成的url，使用回調(diào)來遞歸調(diào)用parse函數(shù)解析新生成的url。運(yùn)行以后就能采集所有頁的名人名言了。

保存結(jié)果

保存到本地文件

抓取好網(wǎng)頁的內(nèi)容如何保存呢？可以使用scrapy的命令保存成多種文件格式。

輸入命令：

scrapy crawl quote -o quotes.json

運(yùn)行以后會生成一個json文件，保存了剛才我們抓取到的所有內(nèi)容。

保存到數(shù)據(jù)庫

在保存到數(shù)據(jù)庫前，我們先要對抓取到的文本做一些處理，如果名人名言的長度大于50，那就切斷并顯示為省略號。實(shí)現(xiàn)也很簡單，要用到pipelines.py文件。

處理文本的代碼為：

from scrapy.exceptions import DropItem


class TextPipeline(object):

    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = "".join([item['text'][0:self.limit].strip(), "..."])
                return item
        else:
            return DropItem('Missing Text.')

然后在settings.py文件中開啟。

ITEM_PIPELINES = {
    'quotes.pipelines.TextPipeline': 300,
}

這樣設(shè)置以后，運(yùn)行得到的結(jié)果就是我們已經(jīng)處理后的結(jié)果了。接下來我們就要保存到數(shù)據(jù)庫中。

我們先在setting.py文件中插入mongo數(shù)據(jù)庫的相關(guān)信息：

MONGO_URI = 'localhost'
MONGO_DB = 'quotes'

回到piplines.py文件編輯：

import pymongo


class MongoPipline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        if self.db['quotes'].insert(dict(item)):
            print("ok.")
            return item

這段代碼實(shí)現(xiàn)的功能是傳入?yún)?shù)后，獲取mongo數(shù)據(jù)庫的配置信息，然后在spider運(yùn)行前開啟mongo服務(wù)，運(yùn)行過程中插入到數(shù)據(jù)庫。

再次運(yùn)行后打開mongo數(shù)據(jù)庫，就能看到處理過的文本信息都已經(jīng)保存好了。

總結(jié)

原文見博客：AlPha - scrapy學(xué)習(xí)筆記（一）

本文涉及的代碼見 github。

從這個簡單的項(xiàng)目中我們就能體會到scrapy的強(qiáng)大。不過這只是冰山一角，需要我們學(xué)習(xí)的內(nèi)容還有很多很多，在之后的教程中會更加詳細(xì)的介紹scrapy每一個模塊的用法。

scrapy中文教程

scrapy英文教程

讓我們共同進(jìn)步！ :-)

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

scrapy框架的基本使用

scrapy框架的基本使用

scrapy

Quotes to Scrape

Demo

抓取流程

創(chuàng)建項(xiàng)目

初步測試

完善代碼

items.py

spider.py

保存結(jié)果

保存到本地文件

保存到數(shù)據(jù)庫

總結(jié)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

scrapy框架的基本使用

scrapy

Quotes to Scrape

Demo

抓取流程

創(chuàng)建項(xiàng)目

初步測試

完善代碼

items.py

spider.py

保存結(jié)果

保存到本地文件

保存到數(shù)據(jù)庫

總結(jié)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频