試想一下,前面做的實驗和例子都只有一個spider。然而,現實的開發的爬蟲肯定不止一個。既然這樣,那么就會有如下幾個問題:1、在同一個項目中怎么創建多個爬蟲的呢?2、多個爬蟲的時候是怎么將他們運行起來呢?
說明:本文章是基于前面幾篇文章和實驗的基礎上完成的。如果您錯過了,或者有疑惑的地方可以在此查看:
scrapy爬蟲成長日記之創建工程-抽取數據-保存為json格式的數據
一、創建spider
1、創建多個spider,scrapy genspider spidername domain
scrapy genspider CnblogsHomeSpider cnblogs.com
通過上述命令創建了一個spider name為CnblogsHomeSpider的爬蟲,start_urls為http://www.cnblogs.com/的爬蟲
2、查看項目下有幾個爬蟲scrapy list
[root@bogon cnblogs]# scrapy listCnblogsHomeSpider
CnblogsSpider
由此可以知道我的項目下有兩個spider,一個名稱叫CnblogsHomeSpider,另一個叫CnblogsSpider。
更多關于scrapy命令可參考:http://doc.scrapy.org/en/latest/topics/commands.html
二、讓幾個spider同時運行起來
現在我們的項目有兩個spider,那么現在我們怎樣才能讓兩個spider同時運行起來呢?你可能會說寫個shell腳本一個個調用,也可能會說寫個python腳本一個個運行等。然而我在stackoverflow.com上看到。的確也有不上前輩是這么實現。然而官方文檔是這么介紹的。
1、Run Scrapy from a script
import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider(scrapy.Spider):
? ? # Your spider definition? ? ...
process = CrawlerProcess({
? ? 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
這里主要通過scrapy.crawler.CrawlerProcess來實現在腳本里運行一個spider。更多的例子可以在此查看:https://github.com/scrapinghub/testspiders
2、Running multiple spiders in the same process
通過CrawlerProcess
import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
通過CrawlerRunner
import scrapyfrom twisted.internet import reactorfrom scrapy.crawler import CrawlerRunnerfrom scrapy.utils.log import configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
通過CrawlerRunner和鏈接(chaining) deferred來線性運行
from twisted.internet import reactor, deferfrom scrapy.crawler import CrawlerRunnerfrom scrapy.utils.log import configure_loggingclass MySpider1(scrapy.Spider):
? ? # Your first spider definition? ? ...class MySpider2(scrapy.Spider):
? ? # Your second spider definition? ? ...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacksdef crawl():
? ? yield runner.crawl(MySpider1)
? ? yield runner.crawl(MySpider2)
? ? reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
這是官方文檔提供的幾種在script里面運行spider的方法。
三、通過自定義scrapy命令的方式來運行
1、創建commands目錄
mkdir commands
注意:commands和spiders目錄是同級的
2、在commands下面添加一個文件crawlall.py
這里主要通過修改scrapy的crawl命令來完成同時執行spider的效果。crawl的源碼可以在此查看:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py
from scrapy.commands import ScrapyCommand from scrapy.crawler import CrawlerRunnerfrom scrapy.utils.conf import arglist_to_dictclass Command(ScrapyCommand):
? ? requires_project = True
? ? def syntax(self):?
? ? ? ? return '[options]'?
? ? def short_desc(self):?
? ? ? ? return 'Runs all of the spiders'?
? ? def add_options(self, parser):
? ? ? ? ScrapyCommand.add_options(self, parser)
? ? ? ? parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
? ? ? ? ? ? ? ? ? ? ? ? ? help="set spider argument (may be repeated)")
? ? ? ? parser.add_option("-o", "--output", metavar="FILE",
? ? ? ? ? ? ? ? ? ? ? ? ? help="dump scraped items into FILE (use - for stdout)")
? ? ? ? parser.add_option("-t", "--output-format", metavar="FORMAT",
? ? ? ? ? ? ? ? ? ? ? ? ? help="format to use for dumping items with -o")
? ? def process_options(self, args, opts):
? ? ? ? ScrapyCommand.process_options(self, args, opts)
? ? ? ? try:
? ? ? ? ? ? opts.spargs = arglist_to_dict(opts.spargs)
? ? ? ? except ValueError:
? ? ? ? ? ? raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
? ? def run(self, args, opts):
? ? ? ? #settings = get_project_settings()? ? ? ? spider_loader = self.crawler_process.spider_loader
? ? ? ? for spidername in args or spider_loader.list():
? ? ? ? ? ? print "*********cralall spidername************" + spidername
? ? ? ? ? ? self.crawler_process.crawl(spidername, **opts.spargs)
? ? ? ? self.crawler_process.start()
這里主要是用了self.crawler_process.spider_loader.list()方法獲取項目下所有的spider,然后利用self.crawler_process.crawl運行spider
3、commands命令下添加__init__.py文件
touch __init__.py
注意:這一步一定不能省略。我就是因為這個問題折騰了一天。囧。。。就怪自己半路出家的吧。
如果省略了會報這樣一個異常
Traceback (most recent call last):
? File "/usr/local/bin/scrapy", line 9, in
? ? load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
? File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute
? ? cmds = _get_commands_dict(settings, inproject)
? File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict
? ? cmds.update(_get_commands_from_module(cmds_module, inproject))
? File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module
? ? for cmd in _iter_command_classes(module):
? File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes
? ? for module in walk_modules(module_name):
? File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules
? ? mod = import_module(path)
? File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
? ? __import__(name)
ImportError: No module named commands
一開始怎么找都找不到原因在哪。耗了我一整天,后來到http://stackoverflow.com/上得到了網友的幫助。再次感謝萬能的互聯網,要是沒有那道墻該是多么的美好呀!扯遠了,繼續回來。
4、settings.py目錄下創建setup.py(這一步去掉也沒影響,不知道官網幫助文檔這么寫有什么具體的意義。)
from setuptools import setup, find_packages
setup(name='scrapy-mymodule',
? entry_points={
? ? 'scrapy.commands': [
? ? ? 'crawlall=cnblogs.commands:crawlall',
? ? ],
? },
)
這個文件的含義是定義了一個crawlall命令,cnblogs.commands為命令文件目錄,crawlall為命令名。
5. 在settings.py中添加配置:
COMMANDS_MODULE = 'cnblogs.commands'
6. 運行命令scrapy crawlall