父女乱,亚洲免费高清无码视频,嫖妓丰满肥熟妇在线精品

Scrapy，Python開發的一個快速,高層次的屏幕抓取和web抓取框架，用于抓取web站點并從頁面中提取結構化的數據。Scrapy用途廣泛，可以用于數據挖掘、監測和自動化測試。
Scrapy框架已經可以完成很大的一部分爬蟲工作了。但是如果遇到比較大規模的數據爬取，直接可以用上python的多線程/多進程，如果你擁有多臺服務器，分布式爬取是最好的解決方式，也是最有效率的方法。
Scrapy-redis是基于redis的一個scrapy組件，scrapy-redis提供了維持待爬取url的去重以及儲存requests的指紋驗證。原理是：redis維持一個共同的url隊列，各個不同機器上的爬蟲程序獲取到的url都保存在redis的url隊列，各個爬蟲都從redis的uel隊列獲取url，并把數據統一保存在同一個數據庫里面。
之前聽了崔慶才老師的知乎爬蟲課程，但是關于利用scrapy-redis構建分布式一直不太清晰。所以下面會利用MongoDB、redis搭建分布式爬蟲。

1.scrapy-redis分布式架構圖：
- Scheduler調度器從redis獲取請求的url地址，傳遞給Downloader下載器下載數據網頁，然后把數據網頁傳遞給spiders爬蟲提取數據邏輯器處理，最后把結構化保存數據的item數據對象經過itemPipeLine保存在redis數據庫。
- 其他機器的item Proccess進程和圖上的單一進程相類似，Master主爬蟲程序則維持redis數據庫的url隊列。
  
  分布式爬蟲架構圖
2.準備條件：

1. linux系統機器一臺(博主用的是阿里云ECS centos7.2,如需ECS安裝的過程可以參照之前的阿里云ECS安裝文章)
2. Redis[redis的windows客戶端和windows的RedisDesktopMananger]和Linux redis版本
3. Anaconda(windows)和Anaconda(Linux版本)
4  MongoDB(linux版本)
5. Robomongo 0.9.0(mongodb的可視化管理工具)

說走就走?。?/div>

3.安裝windows的redis客戶端以及linux的redis的服務端。
- 博主安裝的版本是 redis2.8.2402和redis可視化工具RedisDesktopManager
- windows下安裝redis以及RedisDesktopManager十分簡單，直接下一步下一步就可以完成。
- 驗證redis是否成功，在windows的DOS命令進入你安裝redis的目錄下，輸入以下命令,博主安裝目錄是D盤的redis目錄：
  
  啟動redis-server
- redis的二進制安裝文件包含了redis的鏈接客戶端，打開另外一個命令行終端，輸入如下圖的命令?？梢赃B接上本地windows的redis數據庫。
  
  啟動redis客戶端
- 似乎是不是對于DOS命令窗口不太感冒而且也不太好管理，RedisDesktopManager派上用場了。安裝完RedisDesktopManager啟動如下圖，輸入如圖的信息，即可連接上本地redis數據庫：
  
  redisdesktop
- 至此已經完成安裝windows的redis數據庫。感覺路還長著。
  
  任重道遠
1. 在阿里云ECS上面安裝Redis：
- 在xshell登錄阿里云ECS終端，運行下面命令安裝redis：
```
[author@iZpq90f23ft5jyj3s7fmduhZ ~]# yum -y install redis
```
- 博主的阿里云系統是CentOS7.2，如果你自己的是Ubuntu，可以運行下面的命令安裝：
```
[author@iZpq90f23ft5jyj3s7fmduhZ ~]$sudo apt-get install redis
```
- Redis數據庫安裝完之后，會自動啟動。運行下面命令查看redis運行狀態。
```
  [author@iZpq90f23ft5jyj3s7fmduhZ ~]# ps -aux|grep redis
  root     13925  0.0  0.0 112648   964 pts/0    R+   14:42   0:00 grep --color=auto redis
  redis    29418  0.0  0.6 151096 11912 ?        Ssl  Sep22   1:25 /usr/bin/redis-server *:6379
```
- 如果不設置redis密碼，那么跟在大街上裸奔有什么區別。依稀還記得早些時候MongoDB國內外發生拖庫事件，所以還是為redis設置密碼。默認安裝redis的配置文件在/etc/下面，如下所示,然后修改里面的幾條信息：
```
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# vim /etc/redis.conf
 # bind 127.0.0.1(注釋綁定的IP地址鏈接，如果你想只綁定特定的鏈接IP地址，可以改為自己的IP地址)
   requirepass xxxxxxx(這xxxxxx是設置的密碼，把requirepass前面的#去除)
   port 6379(這是連接redis數據庫的端口，可以修改為其他的端口，博主采用默認的端口)
   protected-mode no(里面no設置為yes)
```
- 修改完成，保存退出。重新啟動redis服務：
```
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# service redis restart
```
坑
- 使用windows的RedisDesktopManager連接阿里云上面的Redis：
  
  連接數據庫
- 意外永遠是預料不到的，連接不上。這是因為阿里云的安全規則，要添加開放6379的端口，才能進行連接。
  
  悲劇
- 登錄阿里云個人管理控制臺，然后添加安全組規則。如下圖所示:其中授權對象0.0.0.0/0是指允許所有的IP地址連接redis，端口范圍6379/6379就是說只開放6379端口
  
  redis開放6379端口號
- 完成安全組設置，在RedisDesktopManager設置IP地址和密碼，即可登錄上阿里云的redis數據庫：
  
  連接上redis數據庫
5.安裝Anaconda：
- Anaconda 4.4.0 在windows安裝過程很簡單，下載好可執行文件，直接下一步下一步就可完成。Anaconda默認包含python解釋器，博主選擇的是python3.6版，在windows運行一下命令，查看Anaconda安裝了什么包：
```
C:\User\Username>conda list
```
- 因為scrapy框架在window安裝比較麻煩，經常出現很多不知名的錯誤依賴，所以選擇Anaconda，可以很快安裝scrapy，scrapy-reis，pymongo，redis包；當然也可以直接使用pip安裝模塊包。
```
conda install scrapy
conda install scrapy-redis
conda install pymongo
conda install redis  
```
- Anaconda 4.0.4 linux可執行腳本文件，可以直接在windows下載，然后在通過Filezilla上傳到到阿里云ECS。上傳到Linux上，執行下面的命令。Anaconda在linux'安裝需要手動enter，并且過程中輸入是否把conda命令寫進環境變量，整個過程，如果遇到詢問，直接輸入yes即可：
```
[author@iZpq90f23ft5jyj3s7fmduhZ ~]# bash Anaconda3-4.4.0-Linux-x86_64.sh
```
- 安裝完Anaconda之后，在命令行窗口輸入python，即可發現是python3.6的版本。阿里云ECS CentOS7.2默認的python版本是python2.7.使用anaconda安裝pymongo、redis、scrapy、scrapy-redis依賴包。
```
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# python
  Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> 
  >>> 
  [author@iZpq90f23ft5jyj3s7fmduhZ ~]# conda install scrapy
  [author@iZpq90f23ft5jyj3s7fmduhZ ~]# conda install scrapy-redis
  [author@iZpq90f23ft5jyj3s7fmduhZ ~]# conda install pymongo
  [author@iZpq90f23ft5jyj3s7fmduhZ ~]# conda install redis  
```

怎么安裝還沒有完成

6.在阿里云ECS上面安裝MongoDB:

在MongoDB官網下載 mongodb3.4.9,下載完成之后，通過文件FileZilla上傳到阿里云ECS
在阿里云ECS運行一下命令安裝MongoDB,其中db.createUser方法的db是將來爬蟲使用數據庫。如果想詳細了解db.createUser可以直接到MongoDB文檔查閱

 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# tar -vxzf  mongodb-linux-x86_64-amazon-3.4.9.tgz
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# mv  mongodb-linux-x86_64-amazon-3.4.9.tgz mongodb
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# cd mongodb
 [author@iZpq90f23ft5jyj3s7fmduhZ mongodb~]# mkdir db
 [author@iZpq90f23ft5jyj3s7fmduhZ mongodb~]# mkdir logs
 [author@iZpq90f23ft5jyj3s7fmduhZ mongodb~]# cd logs
 [author@iZpq90f23ft5jyj3s7fmduhZ logs~]# touch mongodb.log
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# cd ..
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# cd ..
 [author@iZpq90f23ft5jyj3s7fmduhZ mognodb~]# cd bin
 [author@iZpq90f23ft5jyj3s7fmduhZ mognodb bin~]# touch mongodb.conf(創建mongodb的日志保存路徑以及數據保存路徑)
# 下面是mongodb.conf的文件內容
  dbpath=/author/mongodb/db()
  logpath=/author/mongodb/logs/mongodb.log
  port=27017
  fork=true
  nohttpinterface=true
##############################
 [author@iZpq90f23ft5jyj3s7fmduhZ mongodb bin ~]# ./mongod --config mongodb.conf(啟動mongoDB)
 [author@iZpq90f23ft5jyj3s7fmduhZ mongodb bin ~]# ./mongo （啟動mongodb客戶端）
  MongoDB shell version v3.4.9
  connecting to: mongodb://127.0.0.1:27017
  MongoDB server version: 3.4.9
  > db.createUser({user:"xxx",pwd:"xxx",roles:[{role:"readWrite",db:"zhihu"}]})
 [author@iZpq90f23ft5jyj3s7fmduhZ ~]# kill -9 pid(這里是mongodb的進程id，可以通過ps -aux|grep mongodb查看)
 [author@iZpq90f23ft5jyj3s7fmduhZ mognodb bin~]# ./mongod --config mongodb.conf --auth(--auth以需要授權的方式啟動mongodb)

7.windows安裝 Robomongo可視化工具:
- 安裝Robbomongo過程很簡單，就不太再敘述了。安裝完之后，其中的username是剛才創建的user，zhihu是要連接的數據庫。這里會發現連接時間過長失敗，原因也是想Redis一樣，阿里云的安全規則限制，所以可以像redis那樣設置連接開放27017端口就可以了。
  
  登陸
  
  登陸成功
- 終于全部安裝完所需要的工具，工欲善其事必先利其器，真的是有苦說不來。
  
  沒完沒了是吧

8.scrapy-redis的源碼貼圖。這里是崔慶才大神的源碼，因為通過抓包分析。知乎的json的格式數據已經改變了以及自己安裝的Mongodb需要進行驗證，所以自己改寫了一部分。崔慶才源碼

setting.py配置文件部分：

  # -*- coding: utf-8 -*-
  # Scrapy settings for zhihuuser project
  #
  # For simplicity, this file contains only settings considered important or
  # commonly used. You can find more settings consulting the documentation:
  #
  #     http://doc.scrapy.org/en/latest/topics/settings.html
  #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

    BOT_NAME = 'zhihuuser'

    SPIDER_MODULES = ['zhihuuser.spiders']
    NEWSPIDER_MODULE = 'zhihuuser.spiders'


  # Crawl responsibly by identifying yourself (and your website) on the user-agent
  # USER_AGENT = 'zhihuuser (+http://www.yourdomain.com)'
  # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

  # Configure maximum concurrent requests performed by Scrapy (default: 16)
  #CONCURRENT_REQUESTS = 32

  # Configure a delay for requests for the same website (default: 0)
  # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  # See also autothrottle settings and docs
  #DOWNLOAD_DELAY = 3
  # The download delay setting will honor only one of:
  #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  #CONCURRENT_REQUESTS_PER_IP = 16
  # Disable cookies (enabled by default)
  #COOKIES_ENABLED = False
  # Disable Telnet Console (enabled by default)
  #TELNETCONSOLE_ENABLED = False
  # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     'Accept-Language':'en',
     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3)                 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
     'authorization':'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
    }  
  # Enable or disable spider middlewares
  # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  #SPIDER_MIDDLEWARES = {
  #    'zhihuuser.middlewares.ZhihuuserSpiderMiddleware': 543,
  #}

  # Enable or disable downloader middlewares
  # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  #DOWNLOADER_MIDDLEWARES = {
  #    'zhihuuser.middlewares.MyCustomDownloaderMiddleware': 543,
  #}

  # Enable or disable extensions
  # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  #EXTENSIONS = {
  #    'scrapy.extensions.telnet.TelnetConsole': None,
  #}

  # Configure item pipelines
  # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
      'zhihuuser.pipelines.MongoPipeline': 300,
      # 'zhihuuser.pipelines.JsonWriterPipeline': 300,
      'scrapy_redis.pipelines.RedisPipeline': 301
    }
  # Enable and configure the AutoThrottle extension (disabled by default)
  # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  #AUTOTHROTTLE_ENABLED = True
  # The initial download delay
  #AUTOTHROTTLE_START_DELAY = 5
  # The maximum download delay to be set in case of high latencies
  #AUTOTHROTTLE_MAX_DELAY = 60
  # The average number of requests Scrapy should be sending in parallel to
  # each remote server
  #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  # Enable showing throttling stats for every response received:
  #AUTOTHROTTLE_DEBUG = False
  # Enable and configure HTTP caching (disabled by default)
  # See http://scrapy.readthedocs.org/en/latest/topics/downloader-        middleware.html#httpcache-middleware-settings
  #HTTPCACHE_ENABLED = True
  #HTTPCACHE_EXPIRATION_SECS = 0
  #HTTPCACHE_DIR = 'httpcache'
  #HTTPCACHE_IGNORE_HTTP_CODES = []
  #HTTPCACHE_STORAGE =  'scrapy.extensions.httpcache.FilesystemCacheStorage'
    MONGO_URI='hostIP'
    MONGO_DATABASE='zhihu'
    MONGO_USER="username"
    MONGO_PASS="password"
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    REDIS_URL = 'redis://username:pass@hostIP:6379'

Pipelines.py管道部分：

  # -*- coding: utf-8 -*-
  # Define your item pipelines here
  # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  import pymongo
  class MongoPipeline(object):
      collection_name="users"
      def __init__(self,mongo_uri,mongo_db,mongo_user,mongo_pass):
        self.mongo_uri=mongo_uri
        self.mongo_db=mongo_db
        self.mongo_user=mongo_user
        self.mongo_pass=mongo_pass
    @classmethod
    def from_crawler(cls,crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DATABASE'),mongo_user=crawler.settings.get("MONGO_USER"),mongo_pass=crawler.settings.get("MONGO_PASS"))
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.db.authenticate(self.mongo_user,self.mongo_pass)       
    def close_spider(self, spider):
        self.client.close()
    def process_item(self, item, spider):
        # self.db[self.collection_name].update({'url_token': item['url_token']}, {'$set': dict(item)}, True)
        # return item
        self.db[self.collection_name].insert(dict(item))
        return item
  # import json
  # class JsonWriterPipeline(object):
  #     def __init__(self):
  #         self.file = open('data.json', 'w',encoding='UTF-8')
  #     def process_item(self, item, spider):
  #         #self.file.write("我開始打印了\n")
  #         line = json.dumps(dict(item)) + "\n"
  #         self.file.write(line)
  #         return item

items部分，知乎json數據已經改變，所以改寫了這部分：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item,Field
class ZhihuuserItem(Item):
   allow_message=Field()
   answer_count=Field()
   articles_count=Field()
   avatar_url_template=Field()
   badge=Field()
   employments=Field()
   follower_count=Field()
   gender=Field()
   headline=Field()
   id=Field()
   is_advertiser=Field()
   is_blocking=Field()
   is_followed=Field()
   is_following=Field()
   url=Field()
   url_token=Field()
   user_type=Field()

zhihu.py即spiders部分:

  # -*- coding: utf-8 -*-
  from  scrapy import Spider,Request
  import json
  from zhihuuser.items import ZhihuuserItem
  class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['http://www.zhihu.com/']
    #獲取用戶的關注列表
    follows_url="https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}"
    #用戶的詳細信息
    user_url="https://www.zhihu.com/api/v4/members/{user}?include={include}"
    #開始用戶名
    start_user="zhang-yu-meng-7"
    #用戶詳細信息include參數
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
  #獲取關注人的include的參數
  follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
  followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
  followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
  def start_requests(self):
      yield Request(self.user_url.format(user=self.start_user,include=self.user_query),self.parse_user)
      yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),self.parse_followers)
      yield Request(self.follows_url.format(user=self.start_user,include=self.follows_query,limit=20,offset=0),self.parse_follows)
  #保存用戶詳細信息
  def parse_user(self, response):
      result=json.loads(response.text)
      item=ZhihuuserItem()
      for field in item.fields:
          if field in result.keys():
              item[field]=result.get(field)
      yield item
  #獲取用戶關注用戶列表
  def parse_follows(self,response):
      results=json.loads(response.text)
      if 'data' in results.keys():
          for result in results.get('data'):
              yield Request(self.user_url.format(user=result.get('url_token'),include=self.user_query),self.parse_user)
      if 'paging' in results.keys()and results.get('paging').get('is_end')==False:
          next_page=results.get('paging').get('next')
          yield Request(next_page,self.parse_follows)
  def parse_followers(self, response):
      results = json.loads(response.text)
      if 'data' in results.keys():
          for result in results.get('data'):
              yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),self.parse_user)
      if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
          next_page = results.get('paging').get('next')
          yield Request(next_page, self.parse_followers)

在windows和linux中分別啟動爬蟲進程，然后查看獲取到的數據：

windows啟動爬蟲程序:

scrapy crawl zhihu

阿里云linux啟動爬蟲程序

scrapy crawl zhihu

查看redis：

redis
查看mongodb數據庫

數據

浪

至此已經完成了scrapy-redis分布式的配置
本文參考：崔慶才博客

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Scrapy-redis實現分布式爬蟲

Scrapy-redis實現分布式爬蟲

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Scrapy-redis實現分布式爬蟲

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频