Scrapy入門
<strong>注意: Python版本需要為2.7</strong>
<strong>叢書編者按</strong>:Scrapy由 Python 編寫。如果剛接觸并且好奇這門語言的特性以及Scrapy的詳情, 對于已經(jīng)熟悉其他語言并且想快速學習Python的編程老手, Learn Python The Hard Way , 對于想從Python開始學習的編程新手, 非程序員的Python學習資料列表 將是您的選擇。
1.定義Item爬取模型
首先根據(jù)需要從dmoz.org獲取到的數(shù)據(jù)對item進行建模。 我們需要從dmoz中獲取名字,url,以及網(wǎng)站的描述。 對此,在item中定義相應的字段。編輯 tutorial 目錄中的 items.py 文件:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
# Creator yuluoxinsheng
import scrapy
class WikiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
2.編寫第一個爬蟲(Spider)
為了創(chuàng)建一個Spider,必須繼承 scrapy.Spider
類, 且定義以下三個屬性:
name
: 用于區(qū)別Spider。 該名字必須是唯一的,您不可以為不同的Spider設定相同的名字
allowed_domains :
代表允許執(zhí)行的url范圍,通常以http請求字段的orion url為基準
start_urls
: 包含了Spider在啟動時進行爬取的url列表。 因此,第一個被獲取到的頁面將是其中之一。 后續(xù)的URL則從初始的URL獲取到的數(shù)據(jù)中提取。
parse()
是spider的一個方法。 被調用時,每個初始URL完成下載后生成的 Response
對象將會作為唯一的參數(shù)傳遞給該函數(shù)。 該方法負責解析返回的數(shù)據(jù)(response data),提取數(shù)據(jù)(生成item)以及生成需要進一步處理的URL的 Request
對象。
import scrapy
class WikiSpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["domz.org"]
start_urls = [
"http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://www.dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename,'wb') as f:
f.write(response.body)
3.創(chuàng)建wikipedia scrapy
> scrapy startproject wiki
console output
又遇到了windows的環(huán)境變量問題,題主也是無奈,嘗試解決方法
成功解決:
爬取 (crawl)
<h1> Using the scrapy tool </h1>
Scrapy X.Y - no active project
Usage:
?scrapy <command> [options] [args]
Available commands:
?crawl ?Run a spider
?fetch ?Fetch a URL using the Scrapy downloader
[...]
進入項目的根目錄(題主的目錄為e:\Spider\wiki),執(zhí)行下列命令啟動spider:
> scrapy crawl wiki
console output
查看包含 [wiki] 的輸出,輸出的log中包含定義在 start_urls 的初始URL,并且與spider中是一一對應的。可以看到發(fā)出的get請求指向我們創(chuàng)建的兩條鏈接,除此之外沒有指向其他頁面
<strong>4.Spider解析數(shù)據(jù):</strong>
?1.Scrapy為Spider的 start_urls屬性中的每個URL創(chuàng)建了 scrapy.Request
對象,并將 parse 方法作為回調函數(shù)??(callback)賦值給了Request。
?2.Request對象經(jīng)過調度,執(zhí)行生成 scrapy.http.Response
對象并送回給spider parse()
方法。
進入項目的根目錄,執(zhí)行下列命令來啟動shell用來加載解析的頁面數(shù)據(jù):
> scrapy shell "http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/"
當shell載入后,將得到一個包含response數(shù)據(jù)的本地 response 變量。我們之前在parse函數(shù)中寫入的response,body。shell顯示的就是從domz爬取得body信息
console output
但這樣解析出的數(shù)據(jù)某些不符合我們的需求,比如我們只要求回調/Languages的資源地址。而當輸入 response.selector 時, 獲取到一個可以用于查詢返回數(shù)據(jù)的selector(選擇器), 以及映射到 response.selector.xpath() 、 response.selector.css() 的 快捷方法(shortcut): response.xpath() 和 response.css() 。
selector根據(jù)response的類型自動選擇最合適的分析規(guī)則(XML vs HTML)。
5.以Domz為例,老樣子解析頁面結構
標簽層次調用如下:
title aside->div->h2->a->font
link aside->div->h2->a href
desc aside->div->h3->a->font
更改Spider\WikiSpider,由于Domz網(wǎng)站域名更新,修改url
import scrapy
from wikiSpider.items import WikispiderItem
class WikiSpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["dmoztools.net"]
start_urls = [
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
item = WikispiderItem()
content = response.xpath('//section/aside/div/text()')
print content
for sel in response.xpath('//section/aside/div'):
item['title'] = sel.xpath('h2/a/text()')
item['link'] = sel.xpath('h2/a/@href')
item['desc'] = sel.xpath('h3/a/text()')
return item
命令重啟動spider:
> scrapy crawl wiki
console output :Error:xpath無法解析路徑
Python console
啟動shell,解析Books頁面
scrapy shell http://dmoztools.net/Computers/Programming/Languages/Python/Books/
6.測試xpath獲取頁面各個標簽值
Test Result:
7.重新解析頁面層次調用結構
- "title" div class="title-and-desc" -> a -> div class="site-title" -> text()
- "link" div class="title-and-desc" -> a -> @href
- "desc" div class="site-descr" -> text()
console Input
response.xpath("http://div[@class='title-and-desc']")
更新WikiSpider.py
import scrapy
from wikiSpider.items import WikispiderItem
class WikiSpider(scrapy.Spider):
name = "wiki"
allowed_domains = ["dmoztools.net"]
start_urls = [
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
item = WikispiderItem()
content = response.xpath("http://div[contains(@class,'site-item')]/text()")
print content
for sel in response.xpath("http://div[contains(@class,'site-item')]"):
item['title'] = sel.xpath("div[@class='title-and-desc']/a/div[@class='site-title']/text()")
item['link'] = sel.xpath("div[@class='title-and-desc']/a/@href")
item['desc'] = sel.xpath("div[@class='site-descr']/text()")
return item
console output
發(fā)現(xiàn)某些div未正確解析,查看嵌套選擇器,原因在于Xpath是一種基于XML文檔的搜索方式,對于css的規(guī)則僅解析為字符串,嘗試contains標記屬性節(jié)點
console Input
response.xpath("http://div[contains(@class,'site-item')]")
console output > > 標注行已成功返回響應標簽值
F:\pythonProject\wikiSpider>scrapy crawl wiki
2017-07-09 22:24:14 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: wikiSpider)
2017-07-09 22:24:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wikiSpider.spiders', 'SPIDER_MODULES': ['wikiSpider.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'wikiSpider'}
2017-07-09 22:24:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-07-09 22:24:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-09 22:24:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-09 22:24:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-09 22:24:15 [scrapy.core.engine] INFO: Spider opened
2017-07-09 22:24:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-09 22:24:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6027
2017-07-09 22:24:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dmoztools.net/robots.txt> (referer: None)
2017-07-09 22:24:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dmoztools.net/Computers/Programming/Languages/Python/Books/> (referer: None)
[<Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u
'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'
\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()"
data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" d
ata=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/te
xt()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/tex
t()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item
')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item'
)]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'sit
e-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site
-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@clas
s,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class
,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains
(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(
@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[co
ntains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[con
tains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://
div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://d
iv[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpa
th="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpat
h="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Select
or xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selecto
r xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <
Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <S
elector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>]
2017-07-09 22:24:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://dmoztools.net/Computers/Programming/Languages/Python/Books/>
{'desc': [],
'link': [<Selector xpath="div[@class='title-and-desc']/a/@href" data=u'http://www.brpreiss.com/books/opus7/html'>],
'title': [<Selector xpath="div[@class='title-and-desc']/a/div[@class='site-title']/text()" data=u'Data Structures and Algorithms with Obje'>]}
2017-07-09 22:24:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://dmoztools.net/Computers/Programming/Languages/Python/Resources/> (referer: None)
[<Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n\r\n '>, <Selector xpath="http://div[contains(@class,'site-item')]/text()" data=u'\r\n
'>]
> > 2017-07-09 22:24:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://dmoztools.net/Computers/Programming/Languages/Python/Resources/>
{'desc': [],
'link': [<Selector xpath="div[@class='title-and-desc']/a/@href" data=u'http://www.pythonware.com/daily/'>],
'title': [<Selector xpath="div[@class='title-and-desc']/a/div[@class='site-title']/text()" data=u"eff-bot's Daily Python URL ">]}
2017-07-09 22:24:19 [scrapy.core.engine] INFO: Closing spider (finished)