前言
系統環境:CentOS7
本文假設你已經安裝了virtualenv,并且已經激活虛擬環境ENV1,如果沒有,請參考這里:使用virtualenv創建python沙盒(虛擬)環境
目標
使用scrapy的命令行工具創建項目以及spider,使用Pycharm編碼并在虛擬環境中運行spider抓取http://quotes.toscrape.com/中的article和author信息, 將抓取的信息存入txt文件。
正文
1.使用命令行工具創建項目并指定項目路徑,具體用法為
scrapy startproject [project_dir]
項目名稱
[project_dir]項目路徑,缺省時默認為當前路徑
本文中quotes為項目名稱,PycharmProjects/quotes為項目路徑
(ENV1) [eason@localhost ~]$scrapy startprojectquotesPycharmProjects/quotes
New Scrapy project 'quotes', using template directory '/home/eason/ENV1/lib/python2.7/site-packages/scrapy/templates/project', created in:
/home/eason/PycharmProjects/quotes
You can start your first spider with:
cd PycharmProjects/quotes
scrapy genspider example example.com
(ENV1) [eason@localhost ~]$
2.進入項目路徑并創建spider,命令的具體用法為
scrapy genspider [-t template]
[-t template] 指定生成spider的模板,可用模板有如下4種,缺省時默認為basic
basic
crawl
csvfeed
xmlfeed
設定spider的名字
設定allowed_domains和start_urls
本文的spider名稱為quotes_spider
(ENV1) [eason@localhost ~]$cd PycharmProjects/quotes
(ENV1) [eason@localhost quotes]$scrapy genspiderquotes_spiderquotes.toscrape.com
Created spider 'quotes_spider' using template 'basic' in module:
quotes.spiders.quotes_spider
(ENV1) [eason@localhost quotes]$
至此,創建項目以及spider的工作已經完成了。
3.在Pycharm中打開上面剛剛創建的項目
紅框內為我們剛才創建項目的目錄結構
├── quotes
│? └── spiders
│? ? ? └── __init__.py
|? ? ? └── quotes_spider.py
│? ├── __init__.py
│? ├── items.py
│? ├── pipelines.py
│? ├── settings.py
└── scrapy.cfg
參考官網文檔的解釋如下:
quotes/
project's Python module, you'll import your code from here(該項目的python模塊。之后您將在此加入代碼。)
quotes/spiders/
a directory where you'll later put your spiders(放置spider代碼的目錄.用來將網頁爬下來)
quotes/spiders/quotes_spider.py
剛才自動生成的spider文件
quotes/items.py
project items definition file(項目中的item文件,其實就是要抓取的數據的結構定義)
quotes/pipelines.py
project pipelines file(項目的pipelines文件,在這里可以定義將抓取的數據以何種方式保存)
quotes/settings.py
project settings file(項目的設置文件)
scrapy.cfg
deploy configuration file(項目配置文件)
4.此時打開quotes_spider.py 文件會報錯,提示找不到scrapy的模塊,這是因為當前pycharm是在全局環境打開該項目,而我全局環境并沒有安裝scrapy,所以下面更改項目設置,讓pycharm能使用虛擬環境的包和模塊
依次點擊菜單欄的File-->Settings打開設置界面,Project Interpreter下拉選擇當前已經激活的虛擬環境,可能你那邊的路徑不一樣,本文是/home/eason/ENV1/bin/python
選好以后點擊OK,重新打開quotes_spider.py發現已經不報錯了。
5.編輯items.py定義數據結構
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuotesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
article=scrapy.Field()
author=scrapy.Field()
pass
5.編輯quotes_spider.py添加爬取規則
# -*- coding: utf-8 -*-
import scrapy
from ..items import QuotesItem
class QuotesSpiderSpider(scrapy.Spider):
name = "quotes_spider"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
items=[]
articles=response.xpath("http://div[@class='quote']")
for article in articles:
item=QuotesItem()
content=article.xpath("span[@class='text']/text()").extract_first()
author=article.xpath("span/small[@class='author']/text()").extract_first()
item['article']=content.encode('utf-8')
item['author'] = author.encode('utf-8')
items.append(item)
return items
6.編輯pipelines.py,確定數據保存方法,本文為寫到文本文件result.txt中
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class QuotesPipeline(object):
def process_item(self, item, spider):
#爬取的數據保存在/home/eason/PycharmProjects/quotes/路徑下
f = open(r"/home/eason/PycharmProjects/quotes/result.txt", "a")
f.write(item['article']+'\t' +item['author']+'\n')
f.close()
return item
7.為了讓pipeline.py生效,還需要在settings.py文件中注冊
# -*- coding: utf-8 -*-
# Scrapy settings for quotes project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#? ? http://doc.scrapy.org/en/latest/topics/settings.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#? ? http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'quotes'
SPIDER_MODULES = ['quotes.spiders']
NEWSPIDER_MODULE = 'quotes.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'quotes.pipelines.QuotesPipeline': 300,
}
8.在Pycharm中打開Terminal,激活虛擬環境并運行spider
[eason@localhost quotes]$source /home/eason/ENV1/bin/activate
(ENV1) [eason@localhost quotes]$scrapy crawl quotes_spider
9.爬取完成后,會在/home/eason/PycharmProjects/quotes/路徑下生成result.txt文件,打開result.txt后內容如下
10.Done!