參考自:Web Scraping With Scrapy and MongoDB
0x00
采用了scrapy爬蟲框架,爬取了StackOverflow的最新問題及問題的url。爬取的結果用mongodb存儲。
0x01 定義Item
# -*- coding: utf-8 -*-
import scrapy
class QuestionItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
url = scrapy.Field()
首先定義Item,我們只需要存儲兩個信息,所以像上面一樣簡單的定義兩個成員變量就好了。scrapy.Field()可以理解成python的字典。
0x02 編寫Spider
#!/usr/bin/env python3
#coding=utf-8
import scrapy
from stackOverflow.items import QuestionItem
class QuestionSpider(scrapy.Spider):
# spider name
name = "question"
# only scrape domain in allowed_domains
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?page=1&sort=newest"
]
def parse(self, response):
for question in response.xpath('//div[@class="summary"]/h3'):
item = QuestionItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract_first()
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract_first()
yield item
for i in range(1,11):
next_page = "http://stackoverflow.com/questions?page=%s&sort=newest" % str(i)
yield scrapy.Request(next_page, callback=self.parse)
在由于我的電腦容量太小了太小了太小了!所以只爬取前10頁意思一下。
0x03 將數據存入MongoDB
首先在settings.py中定義我們需要的一下東西。ITEM_PIPELINES中的INT數字從0-1000,表示調用時的優先級,數字越小優先級越高。
ITEM_PIPELINES = {
'stackOverflow.pipelines.MongoDBPipeline':300,
}
MONGO_URI = 'mongodb://localhost/'
MONGO_DATABASE = 'stackoverflow'
連接數據庫,存入數據。
# -*- coding: utf-8 -*-
import pymongo
from scrapy.conf import settings
from scrapy import log
from scrapy.exceptions import DropItem
#class StackoverflowPipeline(object):
# def process_item(self, item, spider):
# return item
class MongoDBPipeline(object):
# that is the name of MongoDB Collection
collection_name = 'questions'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.db[self.collection_name].insert(dict(item))
log.msg("Question added to MongoDB database!", level=log.DEBUG, spider=spider)
return item
0x04 運行MongoDB,運行爬蟲
在一個終端中,輸入
mongod
,運行MongoDB。-
在另一個終端中運行我們的爬蟲,
scrapy crawl question
,這里question是我的spider的名字。
==這里看到10頁一共爬到了549條問題