基本設(shè)置

配置環(huán)境

Python：
Python 2.7.11 (v2.7.11:6d1b6a68f775, Dec 5 2015, 20:32:19) [MSC v.1500 32 bit (Intel)] on win32
Redis：
Redis server v=3.2.100 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=dd26f1f93c5130ee
Scrapy：
Scrapy 1.1.1
redis-py：
2.10.5
scrapy-redis：
scrapy-redis-0.6.3
jieba
jieba-0.38
開源代碼：
https://github.com/fxsjy/jieba
學(xué)習(xí)筆記：
- 緒論部分
  https://segmentfault.com/a/1190000004061791
分詞模式
https://segmentfault.com/a/1190000004065927
DAG（有向無環(huán)圖）
https://segmentfault.com/a/1190000004085949
詳細(xì)使用過程介紹
http://blog.csdn.net/u010454729/article/details/40476483

安裝

進(jìn)入到pip.exe目錄下，使用安裝命令pip install redis即可。如果缺少其他組件也可以通過方法pip install modulename安裝。

install redis-py

調(diào)試

python代碼調(diào)試
http://www.cnblogs.com/qi09/archive/2012/02/10/2344959.html

基本架構(gòu)

Scrapy基于事件驅(qū)動網(wǎng)絡(luò)框架 Twisted 編寫。因此，Scrapy基于并發(fā)性考慮由非阻塞(即異步)的實(shí)現(xiàn)。Scrapy中的數(shù)據(jù)流由執(zhí)行引擎控制，其過程如下:

引擎打開一個網(wǎng)站(open a domain)，找到處理該網(wǎng)站的Spider并向該spider請求第一個要爬取的URL(s)。

引擎從Spider中獲取到第一個要爬取的URL并在調(diào)度器(Scheduler)以Request調(diào)度。
引擎向調(diào)度器請求下一個要爬取的URL。
調(diào)度器返回下一個要爬取的URL給引擎，引擎將URL通過下載中間件(請求(request)方向)轉(zhuǎn)發(fā)給下載器(Downloader)。
一旦頁面下載完畢，下載器生成一個該頁面的Response，并將其通過下載中間件(返回(response)方向)發(fā)送給引擎。
引擎從下載器中接收到Response并通過Spider中間件(輸入方向)發(fā)送給Spider處理。
Spider處理Response并返回爬取到的Item及(跟進(jìn)的)新的Request給引擎。
引擎將(Spider返回的)爬取到的Item給Item Pipeline，將(Spider返回的)Request給調(diào)度器。
(從第二步)重復(fù)直到調(diào)度器中沒有更多地request，引擎關(guān)閉該網(wǎng)站。

Scrapy架構(gòu)

文件目錄結(jié)構(gòu)

在Windows的命令窗口中輸入tree /f dqd命令，出現(xiàn)以下文件目錄結(jié)構(gòu)：

C:\Python27\Scripts>tree /f  dqd
文件夾 PATH 列表
卷序列號為 A057-81B6
C:\PYTHON27\SCRIPTS\DQD
│  docker-compose.yml
│  Dockerfile
│  mongodb2mysql.py
│  process_items.py
│  scrapy.cfg
│
├─.idea
│      dqd.iml
│      misc.xml
│      modules.xml
│      workspace.xml
│
├─dqd
│  │  image_pipelines.py
│  │  image_pipelines.pyc
│  │  items.py
│  │  mongo_pipelines.py
│  │  mongo_pipelines.pyc
│  │  mysql_pipelines.py
│  │  mysql_pipelines.pyc
│  │  redis_pipelines.py
│  │  redis_pipelines.pyc
│  │  settings.py
│  │  settings.pyc
│  │  __init__.py
│  │  __init__.pyc
│  │
│  └─spiders
│          dqdspider.py
│          dqdspider.pyc
│          __init__.py
│          __init__.pyc
│
└─Image
    └─full
        │  full.rar
        │
        └─女球迷采訪：由萌yolanda
                480-150605104925433.jpg
                480-150605104940P1.jpg
                480-15060510522UT.jpg
                480-150605105242F9.jpg
                480-15060510525X18.jpg
                480-150605105312V0.jpg

下載和存儲管理

settings.py設(shè)置

BOT_NAME = 'dqd'

SPIDER_MODULES = ['dqd.spiders']
NEWSPIDER_MODULE = 'dqd.spiders'

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"

ITEM_PIPELINES = {
    # 'dqd.image_pipelines.DownloadImagesPipeline':1,   #下載圖片
    'dqd.redis_pipelines.DqdPipeline': 200,
    'scrapy_redis.pipelines.RedisPipeline': 300,
    'dqd.mongo_pipelines.MongoDBPipeline':400,
    'dqd.mysql_pipelines.MySQLPipeline': 1
}
IMAGES_STORE='.\Image'

# redis 在process_items.py文件中進(jìn)行設(shè)置

#################    MONGODB     #############################
MONGODB_SERVER='localhost'
MONGODB_PORT=27017
MONGODB_DB='dqd_db'
MONGODB_COLLECTION='dqd_collection'

####################    MYSQL      #############################
MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'dqd_database'
MYSQL_USER = 'root'
MYSQL_PASSWD = '******'

LOG_LEVEL = 'DEBUG'
DEPTH_LIMIT=1
# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
DOWNLOAD_DELAY = 0.2

當(dāng)Item在Spider中被收集之后，它將會被傳遞到Item Pipeline，一些組件會按照一定的順序執(zhí)行對Item的處理。

每個item pipeline組件(有時稱之為“Item Pipeline”)是實(shí)現(xiàn)了簡單方法的Python類。他們接收到Item并通過它執(zhí)行一些行為，同時也決定此Item是否繼續(xù)通過pipeline，或是被丟棄而不再進(jìn)行處理。

清理HTML數(shù)據(jù)
驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段)
查重(并丟棄)
將爬取結(jié)果保存到數(shù)據(jù)庫中

image_pipelines.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
import codecs

class DownloadImagesPipeline(ImagesPipeline):
    def get_media_requests(self,item,info): #下載圖片
        for image_url in item['image_urls']:
            yield Request(image_url,meta={'item':item,'index':item['image_urls'].index(image_url)}) #添加meta是為了下面重命名文件名使用

    def file_path(self,request,response=None,info=None):
        item=request.meta['item'] #通過上面的meta傳遞過來item
        index=request.meta['index'] #通過上面的index傳遞過來列表中當(dāng)前下載圖片的下標(biāo)

        #圖片文件名 
        image_guid = request.url.split('/')[-1]
        #圖片下載目錄  
        filename = u'full/{0}/{1}'.format(item['news_title'], image_guid)
        return filename

以下圖片為下載內(nèi)容

圖片下載

redis_pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from datetime import datetime

class DqdPipeline(object):
    def process_item(self, item, spider):
        item["crawled"] = datetime.utcnow()
        item["spider"] = spider.name
        return item

在截圖中，dqdspider中應(yīng)該有3個隊(duì)列，但是因?yàn)槲乙呀?jīng)下載完畢，所以dqdspider:request隊(duì)列自動刪除了。

dqdspider:request待爬隊(duì)列
dqdspider:dupefilter用來過濾重復(fù)的請求
dqdspider:items爬取的信息內(nèi)容

redis

mongo_pipelines.py

# -*- coding:utf-8 -*-
import pymongo
from scrapy.exceptions import DropItem
from scrapy.conf import settings
# from scrapy import log


class MongoDBPipeline(object):
    #Connect to the MongoDB database
    def __init__(self):
        connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]



    def process_item(self, item, spider):
        valid=True
        for data in item:
            if not data:
                valid=False
                raise DropItem('Missing{0}!'.format(data))
        if valid:

            self.collection.insert(dict(item))
            log.msg('question added to mongodb database!',
                    level=log.DEBUG,spider=spider)
        return item

為了展示MongoDB中的數(shù)據(jù)內(nèi)容使用了管理工具Robomongo查看爬取的內(nèi)容。

Robomongo.png

mysql_pipelines.py

# -*- coding:utf-8 -*-
from scrapy.conf import settings
import MySQLdb

_DEBUG=True

class MySQLPipeline(object):
    #Connect to the MySQL database
    def __init__(self):
        self.conn =  MySQLdb.connect(
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWD'],
            db=settings['MYSQL_DBNAME'],
            host=settings['MONGODB_SERVER'],
            charset='utf8',
            use_unicode = True
        )
        self.cursor=self.conn.cursor()
        #清空表：注意區(qū)分和delete的區(qū)別
        self.cursor.execute("truncate table news_main;") #清空表的信息
        self.cursor.execute("truncate table news_comment;") #清空表的信息
        self.conn.commit()

    def process_item(self, item, spider):
        try:
            self.insert_news(item)     #將文章信息插入到數(shù)據(jù)庫中
            self.insert_comment(item,item["source_url"])     # 將評論信息信息插入到數(shù)據(jù)庫中
            self.conn.commit()

        except MySQLdb.Error as e:
                print (("Error %d: %s") % (e.args[0],e.args[1]))
        return item

    #將文章信息插入到數(shù)據(jù)庫中
    def insert_news(self,item):
        args = (item["source_url"], item["news_title"], item["news_author"],
                item["news_time"], item["news_content"],item["news_source"],
                item["news_allCommentAllCount"],  item["news_hotCommentHotCount"])

        newsSqlText = "insert into news_main(" \
                      "news_url,news_title,news_author,news_time,news_content,news_source," \
                      "news_commentAllCount,news_commentHotCount) " \
                      "values ('%s','%s','%s','%s','%s','%s','%s','%s')" % args
        self.cursor.execute(newsSqlText)
        self.conn.commit()

    # 將評論信息信息插入到數(shù)據(jù)庫中
    def insert_comment(self, item,url):
        #因?yàn)樵u論是列表，以下為并列迭代
        for comment_content,comment_author,comment_time,comment_likeCount \
                in zip(item["news_hotCommentContent"],item["news_hotCommentAuthor"],
                     item["news_hotCommentTime"],item["news_hotCommentLikeCount"]):
            newsSqlText = "insert into news_comment(comment_content,comment_author,comment_time,comment_likeCount,source_url) " \
                          "values (\"%s\",'%s','%s','%s','%s')" % (comment_content,comment_author,comment_time,comment_likeCount[2:-1],url)
            # #加入調(diào)試代碼 監(jiān)視newsSqlText的取值
            # if _DEBUG == True:
            #     import pdb
            #     pdb.set_trace()
            # self.cursor.execute(newsSqlText.encode().decode("unicode-escape").replace('\\','').replace('\']','').replace('[u\'','').strip())
            self.cursor.execute(newsSqlText.encode().decode("unicode-escape").replace('\\',''))
            self.conn.commit()

在這里將爬取的信息進(jìn)行清洗和轉(zhuǎn)儲。

MySQL

數(shù)據(jù)分析與展示

我懂（感覺好奇怪，人家還是很含蓄的，額，隔壁老王要噴我了，在他面前且叫你懂吧~）的每篇文章很有特色，每篇文章按主鍵自增，對應(yīng)的URL都是唯一，所以我直接暴力爬取了全站的文章，但是這里為了快速加載數(shù)據(jù)只隨機(jī)統(tǒng)計(jì)了部分爬取存入到MySQL中全部的文章數(shù)量。作為一名足球界的小菜鳥，當(dāng)然要仔細(xì)分析數(shù)據(jù)，向老司機(jī)們學(xué)習(xí)，爭取早日拿到駕照，安全駕駛。

文章數(shù)量

爬取文章數(shù)量

發(fā)表文章作者

懂球帝的快速發(fā)展是離不開內(nèi)部員工以及球迷們的辛勤耕耘的，且看這些帶領(lǐng)懂球帝一路扶搖直上的老司機(jī)們都是哪些人，有時間就關(guān)注他們領(lǐng)略他們的“風(fēng)騷”~

作者發(fā)文數(shù)量

創(chuàng)業(yè)不易，不光要寫文章，之前在懂球帝直播里，看了你懂的老司機(jī)陳老板帶領(lǐng)的懂球帝足球隊(duì)與寶坁碧水源的足球比賽，文武兼?zhèn)洌钊肴罕姲?/p>

陳老板

內(nèi)容來源

作為國內(nèi)以內(nèi)容運(yùn)營為主的最大足球媒介，除了自身實(shí)力過硬之外，還博采眾長，從其他站點(diǎn)引進(jìn)優(yōu)質(zhì)的“外援”。

懂球帝原創(chuàng)

你可以感受下原創(chuàng)的文章數(shù)量與國外轉(zhuǎn)載所占的比例，就知道為什么你懂在短短幾年間吸引了這么多的用戶。

轉(zhuǎn)載來源

從以上圖表可以看出，文章主要還是來自自身原創(chuàng)文章，所以這里主要選取了其他網(wǎng)站來源的文章，從上圖可以看出我懂轉(zhuǎn)載的文章主要來自于推特、新華社、阿斯報以及天空體育等，這在一定程度上是對這些站點(diǎn)文章質(zhì)量的認(rèn)可。

文章評論分析

文章全部評論

我們不光要分析作者的發(fā)文數(shù)量，還要分析用戶的關(guān)注度，尋找出最具價值的老司機(jī)，很顯然，GreatWall、elfiemini、鷹旗百夫長以微弱優(yōu)勢占據(jù)三甲。恩恩，懂球帝最受歡迎老司機(jī)新鮮出爐啦。

作者	全部評論數(shù)量	熱評數(shù)量
GreatWall	238738	21873
elfiemini	224058	23014
鷹旗百夫長	200386	10337

各年度評論數(shù)據(jù)

數(shù)據(jù)不能完全反應(yīng)發(fā)展的實(shí)際情況，但不會撒謊，在一定程度上反應(yīng)了懂球帝的快速發(fā)展。接下來，再單選2016年，評測各個月份的數(shù)據(jù)信息。

2016年度評論數(shù)量

乍看一眼，我驚呆了！為毛從6月開始，這評論數(shù)量就增長的這么高，匪夷所思。實(shí)際上，在7月初歐洲杯開始，球迷的關(guān)注度提高，各種話題不斷展開以至于評論數(shù)量突飛猛進(jìn)。

還有許多數(shù)據(jù)信息可挖掘，其他信息下次再擼，最后供上評論區(qū)的老司機(jī)們。

評論獲贊

參考資料

scrapy-redis文檔
https://scrapy-redis.readthedocs.io/en/stable/readme.html
redis-py文檔
http://redis-py.readthedocs.io/en/latest/
https://github.com/rolando/scrapy-redis
[Python下用Scrapy和MongoDB構(gòu)建爬蟲系統(tǒng)
http://www.cnblogs.com/rrxc/p/4478936.html?utm_source=tuicool&utm_medium=referral
圖片下載
http://doc.scrapy.org/en/latest/topics/item-pipeline.html
http://www.cnblogs.com/moon-future/p/5545828.html

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

基于scrapy-redis分布式網(wǎng)絡(luò)爬蟲存儲數(shù)據(jù)分析

基于scrapy-redis分布式網(wǎng)絡(luò)爬蟲存儲數(shù)據(jù)分析

基本設(shè)置

配置環(huán)境

安裝

調(diào)試

基本架構(gòu)

文件目錄結(jié)構(gòu)

下載和存儲管理

數(shù)據(jù)分析與展示

文章數(shù)量

發(fā)表文章作者

內(nèi)容來源

文章評論分析

參考資料

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

基于scrapy-redis分布式網(wǎng)絡(luò)爬蟲存儲數(shù)據(jù)分析

基本設(shè)置

配置環(huán)境

安裝

調(diào)試

基本架構(gòu)

文件目錄結(jié)構(gòu)

下載和存儲管理

數(shù)據(jù)分析與展示

文章數(shù)量

發(fā)表文章作者

內(nèi)容來源

文章評論分析

參考資料

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频