1. 安裝scrapy ?pip install scrapy
? ? 安裝scrapy-redis ? pip install scrapy
2.安裝mongodb?
mongo.exe 服務端 mongod.exe客戶端
安裝mongodb服務? 存放在F盤下php/mongodb
F:\php\mongodb\bin>dir? 查看目錄
mongo --dbpath F:/php/mongodb? F:\php\mongodb 表示數據存放位置
啟動mongo 安裝
mongod.exe? --dbpath F:/php/mongodb/bin/
在啟動一個cmd? 然后進入到bin目錄下? 輸入mongo.exe
py安裝pymongo
pip install pymongo
3安裝readis
爬取目標:彩票網站開獎的數據? http://www.bwlc.net/
首選創建爬蟲
scrapy startproject fucai?
進入 目錄 >fucai
然后創建爬蟲 scrpy genspider ff?
在 items.py 進行編進
import scrapy
class FucaiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
qihao =scrapy.Field()
kaijiang =scrapy.Field()
riqi =scrapy.Field()
然后進入spiders 目錄里對ff.py進行編輯
import scrapy
from scrapy.http import Request
from fucai.items import FucaiItem
from scrapy_redis.spiders import RedisSpider
class FfSpider(RedisSpider):
name = "ff"
redis_key='ff:start_urls'
allowed_domains = ["bwlc.net"]
def start_requests(self):
for url in self.start_urls:
yield Requset(url=url,callback=self.parse)
def parse(self, response):
url= response.xpath('//div[@class="fc_fanye"]/span[2]/b[@class="col_red"]/text()').extract()
print url
for j in range(1,3):
page = "http://www.bwlc.net/bulletin/prevqck3.html?page="+str(j)
yield Request(url=page,callback=self.next2)
def next2(self,response):
urla = response.xpath('//tr[@class]')
for i in urla:
item = FucaiItem()
item["qihao"]=i.xpath('td/text()').extract()[0]
item["kaijiang"] =i.xpath('td/text()').extract()[1]
item["riqi"] =i.xpath('td/text()').extract()[2]
yield item
SCHEDULER ="scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS ="scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS ='scrapy_redis.queue.SpiderQueue'
ITEM_PIPELINES = {
'fucai.pipelines.FucaiPipeline':300,
}
MONGODB_HOST='127.0.0.1'
MONGODB_POST =27017
MONGODB_DBNAME='jike'
MONGODB_DOCNAME='reada'
在redis 里輸入要爬取的內容
然后 scrapy crawl ?ff ?進行爬取