scrapy爬取StackOverflow并采用MongoDB存儲

參考自:Web Scraping With Scrapy and MongoDB

0x00

采用了scrapy爬蟲框架,爬取了StackOverflow的最新問題及問題的url。爬取的結果用mongodb存儲。

0x01 定義Item

# -*- coding: utf-8 -*-

import scrapy

class QuestionItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()

首先定義Item,我們只需要存儲兩個信息,所以像上面一樣簡單的定義兩個成員變量就好了。scrapy.Field()可以理解成python的字典。

0x02 編寫Spider

#!/usr/bin/env python3
#coding=utf-8

import scrapy
from stackOverflow.items import QuestionItem

class QuestionSpider(scrapy.Spider):
    # spider name
    name = "question"
    # only scrape domain in allowed_domains
    allowed_domains = ["stackoverflow.com"]
    start_urls = [
        "http://stackoverflow.com/questions?page=1&sort=newest"
    ]

    def parse(self, response):
        for question in response.xpath('//div[@class="summary"]/h3'):
            item = QuestionItem()
            item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract_first()
            item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract_first()
            yield item

        for i in range(1,11):
            next_page = "http://stackoverflow.com/questions?page=%s&sort=newest" % str(i)
            yield scrapy.Request(next_page, callback=self.parse)

在由于我的電腦容量太小了太小了太小了!所以只爬取前10頁意思一下。

0x03 將數據存入MongoDB

首先在settings.py中定義我們需要的一下東西。ITEM_PIPELINES中的INT數字從0-1000,表示調用時的優先級,數字越小優先級越高。

ITEM_PIPELINES = {
    'stackOverflow.pipelines.MongoDBPipeline':300,
}
MONGO_URI = 'mongodb://localhost/'
MONGO_DATABASE = 'stackoverflow'

連接數據庫,存入數據。

# -*- coding: utf-8 -*-
import pymongo
from scrapy.conf import settings
from scrapy import log
from scrapy.exceptions import DropItem

#class StackoverflowPipeline(object):
#    def process_item(self, item, spider):
#        return item

class MongoDBPipeline(object):
    # that is the name of MongoDB Collection
    collection_name = 'questions'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.db[self.collection_name].insert(dict(item))
            log.msg("Question added to MongoDB database!", level=log.DEBUG, spider=spider)

        return item

0x04 運行MongoDB,運行爬蟲

  1. 在一個終端中,輸入mongod,運行MongoDB。

  2. 在另一個終端中運行我們的爬蟲,scrapy crawl question,這里question是我的spider的名字。

==這里看到10頁一共爬到了549條問題

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容