欧美色吧,波多野结衣高潮尿喷,体育生巨大粗爽gv2022小说

源代碼來自于基于Scrapy的Python3分布式淘寶爬蟲，做了一些改動，對失效路徑進(jìn)行了更新，增加了一些內(nèi)容。使用了隨機(jī)User-Agent，scrapy-redis分布式爬蟲，使用MySQL數(shù)據(jù)庫存儲數(shù)據(jù)。

目錄
第一步創(chuàng)建并配置scrapy項(xiàng)目
第二步將數(shù)據(jù)導(dǎo)出至json文件和MySQL數(shù)據(jù)庫
第三步設(shè)置隨機(jī)訪問頭User-Agent
第四步配置scrapy-redis實(shí)現(xiàn)分布式爬蟲

數(shù)據(jù)分析部分：2018.7淘寶粉底市場數(shù)據(jù)分析

開發(fā)環(huán)境

電腦系統(tǒng)：macOS High Sierra
Python第三方庫：scrapy、pymysql、scrapy-redis、redis、redis-py
Python版本：Anaconda 4.5.8 ,集成Python版本 3.6.4
數(shù)據(jù)庫： MySQL 8.0.11、redis 4.0.1

第一步創(chuàng)建scrapy項(xiàng)目

cmd輸入：

scrapy startproject taobao
cd taobao
scrapy genspider -t basic tb taobao.com

1. 爬蟲程序編寫tb.py

在源代碼的基礎(chǔ)上添加了銷量、產(chǎn)品描述信息的爬取；
更新了url分類判斷的方式；
抓包取得的評論數(shù)網(wǎng)頁格式有變化，更新了正則表達(dá)式。

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from taobao.items import TaobaoItem
import urllib.request

class TbSpider(scrapy.Spider):
    name = 'tb'
    allowed_domains = ['taobao.com']
    start_urls = ['http://taobao.com/']

    def parse(self, response):
        key = input("請輸入你要爬取的關(guān)鍵詞\t")
        pages = input("請輸入你要爬取的頁數(shù)\t")
        print("\n")
        print("當(dāng)前爬取的關(guān)鍵詞是",key)
        print("\n")
        for i in range(0, int(pages)):
            url = "https://s.taobao.com/search?q=" + str(key) + "&s=" + str(44*i)
            yield Request(url=url, callback=self.page)
        pass
    #搜索頁
    def page(self,response):
        body = response.body.decode('utf-8', 'ignore')

        pat_id = '"nid":"(.*?)"'    #匹配id
        pat_now_price = '"view_price":"(.*?)"'      #匹配現(xiàn)價格
        pat_address = '"item_loc":"(.*?)"'      #匹配商家地址
        pat_sale = '"view_sales":"(.*?)人付款"' #銷量

        all_id = re.compile(pat_id).findall(body)
        all_now_price = re.compile(pat_now_price).findall(body)
        all_address = re.compile(pat_address).findall(body)
        all_sale = re.compile(pat_sale).findall(body)

        for i in range(0, len(all_id)):
            this_id = all_id[i]
            now_price = all_now_price[i]
            address = all_address[i]
            sale_count = all_sale[i] 
            url = "https://item.taobao.com/item.htm?id=" + str(this_id)
            yield Request(url=url, callback=self.next, meta={ 'now_price': now_price, 'address': address,'sale_count':sale_count})
            pass
        pass
    #詳情頁
    def next(self, response):
        item = TaobaoItem()
        url = response.url
      
        #由于淘寶和天貓的某些信息采用不同方式的Ajax加載，做一個分類
        if 'tmall' in url:  #天貓、天貓超市、天貓國際
            title = response.xpath("http://html/head/title/text()").extract()  #獲取商品名稱
            #price = response.xpath("http://span[@class='tm-count']/text()").extract()  
            #這里獲取商品原價格-但一直抓到的是空值，Xpath在xpath finder里驗(yàn)證有效，暫時不知道為什么。。。由于后續(xù)會影響到數(shù)據(jù)庫的寫入，暫時隱了
            #以下是產(chǎn)品描述信息欄內(nèi)的信息獲得，檢索文字標(biāo)簽獲得對應(yīng)內(nèi)容：
            brand = response.xpath("http://li[@id='J_attrBrandName']/text()").re('品牌:\xa0(.*?)$')   #品牌
            produce = response.xpath("http://li[contains(text(),'產(chǎn)地')]/text()").re('產(chǎn)地:\xa0(.*?)$') #產(chǎn)地
            effect = response.xpath("http://li[contains(text(),'功效')]/text()").re('功效:\xa0(.*?)$') #功效
            pat_id = 'id=(.*?)&'
            this_id = re.compile(pat_id).findall(url)[0]
            pass       
        else:       #淘寶
            title = response.xpath("/html/head/title/text()").extract() #獲取商品名稱
            #price = response.xpath("http://em[@class = 'tb-rmb-num']/text()").extract()  
            #獲取商品原價格-和上面保持一致
            brand = response.xpath("http://li[contains(text(),'品牌')]/text()").re('品牌:\xa0(.*?)$') #品牌
            produce = response.xpath("http://li[contains(text(),'產(chǎn)地')]/text()").re('產(chǎn)地:\xa0(.*?)$') #產(chǎn)地
            effect = response.xpath("http://li[contains(text(),'功效')]/text()").re('功效:\xa0(.*?)$') #功效
            pat_id = 'id=(.*?)$'
            this_id = re.compile(pat_id).findall(url)[0]
            pass

        #抓取評論總數(shù)
        comment_url = "https://rate.taobao.com/detailCount.do?callback=jsonp144&itemId="+str(this_id) 
        comment_data = urllib.request.urlopen(comment_url).read().decode('utf-8', 'ignore')
        each_comment = '"count":(.*?)}' 
        comment = re.compile(each_comment).findall(comment_data)


        item['title'] = title
        item['link'] = url
        #item['price'] = price
        item['now_price'] = response.meta['now_price']
        item['comment'] = comment
        item['address'] = response.meta['address']
        item['sale_count'] = response.meta['sale_count']
        item['brand']=brand
        item['produce']=produce
        item['effect']=effect
        
        yield item

2. settings.py配置

設(shè)置用戶代理、不遵循robots.txt協(xié)議、取消Cookies。

# -*- coding: utf-8 -*-

# Scrapy settings for taobao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'taobao'

SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'   #設(shè)置用戶代理值

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  #不遵循 robots.txt協(xié)議

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 0.25 #設(shè)置訪問延遲
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False #取消Cookies

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'taobao.middlewares.TaobaoSpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'taobao.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'taobao.pipelines.TaobaoJsonPipeline':300  #導(dǎo)出文json文件
    'taobao.pipelines.TaobaoPipeline':200   #導(dǎo)出至Mysql
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.在items.py中添加存儲容器對象

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TaobaoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    #price = scrapy.Field()
    comment = scrapy.Field()
    now_price = scrapy.Field()
    address = scrapy.Field()
    sale_count = scrapy.Field()
    brand =  scrapy.Field()
    produce = scrapy.Field()
    effect = scrapy.Field()
    pass

第二步將數(shù)據(jù)導(dǎo)出并存儲至Mysql數(shù)據(jù)庫

1. 將數(shù)據(jù)導(dǎo)出為json

在pipeline.py文件內(nèi)寫入如下內(nèi)容，在setting.py文件中開啟（詳見settings.py）,

# -*- coding: utf-8 -*-
import json
import codecs

class TaobaoJsonPipeline:
    def __init__(self):
        self.file=codecs.open('taobao.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(lines)
        return item
    def close_spider(self, spider):
        self.file.close()

運(yùn)行爬蟲，在終端輸入

scrapy crawl tb --nolog

導(dǎo)出后文件自動存儲在爬蟲目錄下：

屏幕快照 2018-07-20 下午9.02.19.png

2.將數(shù)據(jù)導(dǎo)出至MySQL

1）首先要先下載安裝MySQL數(shù)據(jù)庫

下載鏈接，dmg格式，一鍵安裝。（安裝過程中要求設(shè)置root用戶的密碼，選擇普通加密，如果選高級加密的話后面會一直連接失敗....）
設(shè)置完成后開啟數(shù)據(jù)庫：

屏幕快照 2018-07-20 下午9.07.37.png

可視化操作安裝Workbentch，
Workbentch連接數(shù)據(jù)庫，建立新的數(shù)據(jù)庫，并新建表格并設(shè)置好字段：

屏幕快照 2018-07-22 下午8.53.15.png

2）在Python中安裝pymysql包

cmd輸入：conda install pymysql
或者直接用pip install pymysql

3）pipelines.py文件設(shè)置

這里數(shù)據(jù)庫存儲使用了異步操作，目的是防止插入數(shù)據(jù)的速度跟不上網(wǎng)頁的爬取解析速度，造成阻塞。Python 中提供了 Twisted 框架來實(shí)現(xiàn)異步操作，該框架提供了一個連接池，通過連接池可以實(shí)現(xiàn)數(shù)據(jù)插入 MySQL 的異步化。詳細(xì)教程參考Scrapy 入門筆記(4) --- 使用 Pipeline 保存數(shù)據(jù)

在pipeline.py文件中加入以下代碼，并在setting.py中開啟對應(yīng)pipeline（詳見settings.py）,

# -*- coding: utf-8 -*-
import pymysql
import pymysql.cursors
from twisted.enterprise import adbapi

class TaobaoPipeline(object):  
    #鏈接數(shù)據(jù)庫
    def __init__(self,):
        dbparms = dict(
            host='127.0.0.1',
            db='數(shù)據(jù)庫名稱',
            user='root',
            passwd='數(shù)據(jù)庫密碼',
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor, 
            use_unicode=True,
        )
        # 指定擦做數(shù)據(jù)庫的模塊名和數(shù)據(jù)庫參數(shù)參數(shù)
        self.dbpool = adbapi.ConnectionPool("pymysql", **dbparms)

    # 使用twisted將mysql插入變成異步執(zhí)行
    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) #處理異常
           
    #處理異步插入的異常  
    def handle_error(self, failure, item, spider):
        print (failure)
    
    #執(zhí)行具體的插入
    def do_insert(self, cursor, item): 
       
        #從item中導(dǎo)入
        title = item['title'][0]
        link = item['link']
        #price = item['price'][0]
        comment = item['comment'][0]
        now_price = item['now_price']
        address = item['address']
        sale = item['sale_count']
        brand=item['brand'][0]
        produce=item['produce'][0]
        effect = item['effect'][0]
              
        print('商品標(biāo)題\t', title)
        print('商品鏈接\t', link)
        #print('商品原價\t', price)
        print('商品現(xiàn)價\t', now_price)
        print('商家地址\t', address)
        print('評論數(shù)量\t', comment)
        print('銷量\t', sale)
        print('品牌\t',brand)
        print('產(chǎn)地\t',produce)
        print('功效\t',effect)

        try:            
            sql="insert into taobaokh(title,link,comment,now_price,address,sale,brand,produce,effect) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)"
            values=(title,link,comment,now_price,address,sale,brand,produce,effect)
            cursor.execute(sql,values)
            print('導(dǎo)入成功')
            print('------------------------------\n')
            return item
        except Exception as err:
            pass

運(yùn)行爬蟲:

scrapy crawl tb --nolog

屏幕快照 2018-07-22 下午8.56.08.png

到此，爬蟲基本已經(jīng)可以正常運(yùn)轉(zhuǎn)起來了。

第三步設(shè)置設(shè)置隨機(jī)User-Agent

目的是每次請求時通過更換不同的user-agent，可以更好地偽裝瀏覽器。

1.更新了源碼的ua列表（PC端），添加到settings.py最后

USER_AGENT_LIST = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.4",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
    "Mozilla/5.0 (iPad; CPU OS 10_3_2 like Mac OS X) AppleWebKit/603.2.4 (KHTML, like Gecko) Version/10.0 Mobile/14F89 Safari/602.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
    "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/58.0.3029.110 Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/603.2.5 (KHTML, like Gecko) Version/10.1.1 Safari/603.2.5",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36 OPR/46.0.2597.32",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/59.0.3071.109 Chrome/59.0.3071.109 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.2552.898",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.39",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:54.0) Gecko/20100101 Firefox/54.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
                   ]

DOWNLOADER_MIDDLEWARES = {
    'taobao.middlewares.ProcessHeaderMidware': 543,
}

github上有人專門寫了一個user-agent 的插件，也可以直接調(diào)用，鏈接

2.在middlewares.py文件里添加如下代碼：

# encoding: utf-8
from scrapy.utils.project import get_project_settings
import random

settings = get_project_settings()

class ProcessHeaderMidware():
    """process request add request info"""

    def process_request(self, request, spider):
        """
        隨機(jī)從列表中獲得header， 并傳給user_agent進(jìn)行使用
        """
        ua = random.choice(settings.get('USER_AGENT_LIST'))
        spider.logger.info(msg='now entring download midware')
        if ua:
            request.headers['User-Agent'] = ua
            # Add desired logging message here.
            spider.logger.info(u'User-Agent is : {} {}'.format(request.headers.get('User-Agent'), request))
        pass

設(shè)置完成。

第四步使用Scrapy-redis實(shí)現(xiàn)分布式爬蟲

為了進(jìn)一步提高效率和防反爬蟲能力，就要用到多進(jìn)程和分布式爬蟲了。
Scrapy-redis還有一個好處是支持?jǐn)帱c(diǎn)續(xù)傳，爬的過程中遇到過sracpy卡主住不動的情況，直接重新打開一個終端，輸入爬蟲指令，又繼續(xù)跑起來~

1. Scrapy-redis環(huán)境搭建：

需要分別安裝redis，scrapy-redis，和redis-py三個庫：
1）redis
直接使用conda install redis安裝（或pip install redis）
2） scrapy-redis
由于anaconda中沒有scrapy-redis的安裝包，需要下載第三方zip安裝包，下載鏈接。安裝過程：cmd依次輸入

cd /Users/用戶名/Downloads
unzip scrapy-redis-master.zip -d/Users/用戶名/Downloads/ #解壓文件到指定路徑
cd scrapy-redis-master 
python setup.py install #安裝文件
password:***** #輸入密碼

如果不使用Anaconda，直接在終端pip install scrapy-redis應(yīng)該也可以。
3） redis-py
裝完redis之后，運(yùn)行程序一直報錯"ImportError: No module named redis"，搜過之后發(fā)現(xiàn)是Python默認(rèn)不支持Redis，需要安裝redis-py才能正常調(diào)用。下載鏈接
安裝方法同上。

2.修改Scrapy項(xiàng)目文件

1）在settings.py中增加以下內(nèi)容

SCHEDULER = "scrapy_redis.scheduler.Scheduler"  #啟用Redis調(diào)度存儲請求隊(duì)列
SCHEDULER_PERSIST = True    #不清除Redis隊(duì)列、這樣可以暫停/恢復(fù) 爬取
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #確保所有的爬蟲通過Redis去重
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_HOST = '127.0.0.1'  # 也可以根據(jù)情況改成 localhost
REDIS_PORT = 6379
REDIS_URL = None

2）在items.py中增加以下內(nèi)容

from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join

class TaobaoSpiderLoader(ItemLoader):
    default_item_class = TaobaoItem
    default_input_processor = MapCompose(lambda s: s.strip())
    default_output_processor = TakeFirst()
    description_out = Join()

3）對tb.py文件進(jìn)行更改

import相關(guān)包：

from scrapy_redis.spiders import RedisSpider

修改TbSpider類：

class TbSpider(RedisSpider):
    name = 'tb'
    #allowed_domains = ['taobao.com']
    #start_urls = ['http://taobao.com/']
    redis_key = 'Taobao:start_urls'

配置完成！

3. 運(yùn)行分布式爬蟲

1）打開終端，啟動redis服務(wù)器redis-server：

localhost:~ $ redis-server
3708:C 20 Jul 22:42:41.914 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
3708:C 20 Jul 22:42:41.915 # Redis version=4.0.10, bits=64, commit=00000000, modified=0, pid=3708, just started
3708:C 20 Jul 22:42:41.915 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
3708:M 20 Jul 22:42:41.916 * Increased maximum number of open files to 10032 (it was originally set to 256).
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 4.0.10 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 3708
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

3708:M 20 Jul 22:42:41.920 # Server initialized
3708:M 20 Jul 22:42:41.920 * DB loaded from disk: 0.000 seconds
3708:M 20 Jul 22:42:41.920 * Ready to accept connections

看到這個界面就證明服務(wù)器開啟，關(guān)掉窗口。

2）打開一個新的終端，運(yùn)行爬蟲：

scrapy crawl tb --nolog

此時爬蟲處于等待狀態(tài)，需要設(shè)置start_url。

3）再打開一個新的終端，輸入：

redis-cli
127.0.0.1:6379>LPUSH Taobao:start_urls http://taobao.com
(integer) 1

返回(integer) 1 則表示設(shè)置成功。（指令中的Taobao:start_urls對應(yīng)tb.py文件中的設(shè)置redis_key = 'Taobao:start_urls'）

4）此時，爬蟲開始運(yùn)行....MacOS不會像windows一樣，彈出多個終端，只在一個終端里跑，但明顯速度加快了好多。

5）如果要中途停止爬蟲，按ctrl+c。
停止后再輸入 scrapy crawl taobao –nolog 運(yùn)行的話，程序會斷點(diǎn)續(xù)傳，原因是在setting.py中設(shè)置了 SCHEDULER_PERSIST = True 。
如果想取消這個功能，要把True改為False。

6）爬取完畢后，要清除redis緩存

127.0.0.1:6379>flushdb
ok

完畢！

總結(jié)：

通過Python3.6和scrapy構(gòu)建了一個淘寶商品的爬蟲，通過scrapy-redis實(shí)現(xiàn)了分布式爬蟲，最后用MySQL來存儲數(shù)據(jù)。

問題

tmall鏈接下的商品原價格一直抓取失敗，xpath在xpath finder驗(yàn)證可行，運(yùn)行后一直是空值，猜測可能是網(wǎng)頁有異步加載，待研究。
tmall鏈接抓取過程中，很多鏈接進(jìn)行了重定向（301、302）導(dǎo)致數(shù)據(jù)無法抓取，應(yīng)該是跳轉(zhuǎn)登錄之類的反爬措施。

（聲明：此文章僅作為學(xué)習(xí)交流，不做為其它用途）

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Scrapy+Redis+MySQL分布式爬取商品信息

Scrapy+Redis+MySQL分布式爬取商品信息

第一步創(chuàng)建scrapy項(xiàng)目

1. 爬蟲程序編寫tb.py

2. settings.py配置

3.在items.py中添加存儲容器對象

第二步將數(shù)據(jù)導(dǎo)出并存儲至Mysql數(shù)據(jù)庫

1. 將數(shù)據(jù)導(dǎo)出為json

2.將數(shù)據(jù)導(dǎo)出至MySQL

1）首先要先下載安裝MySQL數(shù)據(jù)庫

2）在Python中安裝pymysql包

3）pipelines.py文件設(shè)置

第三步設(shè)置設(shè)置隨機(jī)User-Agent

1.更新了源碼的ua列表（PC端），添加到settings.py最后

2.在middlewares.py文件里添加如下代碼：

第四步使用Scrapy-redis實(shí)現(xiàn)分布式爬蟲

1. Scrapy-redis環(huán)境搭建：

2.修改Scrapy項(xiàng)目文件

1）在settings.py中增加以下內(nèi)容

2）在items.py中增加以下內(nèi)容

3）對tb.py文件進(jìn)行更改

3. 運(yùn)行分布式爬蟲

總結(jié)：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Scrapy+Redis+MySQL分布式爬取商品信息

第一步 創(chuàng)建scrapy項(xiàng)目

1. 爬蟲程序編寫tb.py

2. settings.py配置

3.在items.py中添加存儲容器對象

第二步 將數(shù)據(jù)導(dǎo)出并存儲至Mysql數(shù)據(jù)庫

1. 將數(shù)據(jù)導(dǎo)出為json

2.將數(shù)據(jù)導(dǎo)出至MySQL

1）首先要先下載安裝MySQL數(shù)據(jù)庫

2）在Python中安裝pymysql包

3）pipelines.py文件設(shè)置

第三步 設(shè)置設(shè)置隨機(jī)User-Agent

1.更新了源碼的ua列表（PC端），添加到settings.py最后

2.在middlewares.py文件里添加如下代碼：

第四步 使用Scrapy-redis實(shí)現(xiàn)分布式爬蟲

1. Scrapy-redis環(huán)境搭建：

2.修改Scrapy項(xiàng)目文件

1）在settings.py中增加以下內(nèi)容

2）在items.py中增加以下內(nèi)容

3）對tb.py文件進(jìn)行更改

3. 運(yùn)行分布式爬蟲

總結(jié)：

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

第一步創(chuàng)建scrapy項(xiàng)目

第二步將數(shù)據(jù)導(dǎo)出并存儲至Mysql數(shù)據(jù)庫

第三步設(shè)置設(shè)置隨機(jī)User-Agent

第四步使用Scrapy-redis實(shí)現(xiàn)分布式爬蟲