Python爬取騰訊視頻電影信息并繪制散點(diǎn)圖

Python爬取騰訊視頻電影信息并繪制散點(diǎn)圖

??昨天用python爬了一堆電影信息并繪制成了散點(diǎn)圖,感覺(jué)很有意思,發(fā)上來(lái)分享一下。先上圖:


騰訊視頻電影評(píng)分-時(shí)長(zhǎng)散點(diǎn)圖

??橫軸是電影的時(shí)長(zhǎng)(分鐘),縱軸是電影的評(píng)分。簡(jiǎn)單畫(huà)的圖沒(méi)有做什么標(biāo)注,后續(xù)可以加上去。接下來(lái)是工程從頭到尾的過(guò)程:

找到目標(biāo)網(wǎng)站

??首先自然是需要找到目標(biāo)網(wǎng)站并觀察網(wǎng)頁(yè)結(jié)構(gòu):


騰訊視頻網(wǎng)頁(yè)結(jié)構(gòu)

??嗯。。。我們來(lái)觀察一下網(wǎng)址:https://v.qq.com/x/list/movie?&offset=0;看來(lái)是比較有規(guī)律的那種,在接下來(lái)request編寫(xiě)的時(shí)候可能會(huì)簡(jiǎn)單一些。

??接下來(lái)看一下第二頁(yè)的網(wǎng)址:https://v.qq.com/x/list/movie?&offset=30;果然是這種有規(guī)律的。騰訊視頻的offset是每過(guò)一頁(yè)偏置量+30,最后一頁(yè)是4980:https://v.qq.com/x/list/movie?&offset=4980

爬蟲(chóng)的準(zhǔn)備部分

??好的,接下來(lái)就是建立爬蟲(chóng)工程了。建立爬蟲(chóng)工程可以通過(guò):

scrapy startproject (文件名)—沒(méi)有括號(hào)

的命令來(lái)創(chuàng)建,接下來(lái)是進(jìn)入相應(yīng)的目錄中:

scrapy genspider (文件名)(網(wǎng)址)

來(lái)創(chuàng)建爬蟲(chóng)文件。

??首先我們先來(lái)定位我們需要爬取的信息。可以看到首頁(yè)中陳列電影的頁(yè)面是沒(méi)有電影的時(shí)間長(zhǎng)度信息的,需要進(jìn)入到每一個(gè)電影的播放頁(yè)面里面來(lái)爬取。不過(guò)電影的名稱和評(píng)分則可以在首頁(yè)上進(jìn)行爬取。

??不過(guò)在這里遇到了一個(gè)問(wèn)題就是沒(méi)辦法爬完所有首頁(yè)上電影的二級(jí)頁(yè)面之后再yield item,每次都是爬了一個(gè)電影名稱和評(píng)分信息的list,但是評(píng)分卻只有第一個(gè)的。怎么修改都出現(xiàn)了報(bào)錯(cuò)的情況,所以不得已只能都放在二級(jí)頁(yè)面進(jìn)行爬取。好在信息在二級(jí)頁(yè)面上都是全的hhh。

播放頁(yè)面信息展示

??可以看到頁(yè)面上的電影名稱、電影評(píng)分和電影的時(shí)長(zhǎng)信息都有。在二級(jí)頁(yè)面上則需要一個(gè)單獨(dú)的xpath來(lái)保存相應(yīng)的電影的url信息,方便我們進(jìn)行遍歷的Request。

爬蟲(chóng)的調(diào)試

??首先我們用shell進(jìn)行調(diào)試,從而找到最合適的xpath進(jìn)行信息的提取。一般來(lái)說(shuō)進(jìn)入shell進(jìn)行調(diào)試只需要scrapy shell + 網(wǎng)址就可以進(jìn)入了,但是騰訊視頻的網(wǎng)站則遇到了一點(diǎn)問(wèn)題,如下圖:


重定向問(wèn)題的shell界面

遇到了這種情況,不再繼續(xù)進(jìn)入了,然后按一下回車(chē):


重定向問(wèn)題的shell界面

就出現(xiàn)了stopped的情況。查詢后得知這是網(wǎng)頁(yè)遇到了重定向問(wèn)題,但是shell好像可以通過(guò)參數(shù)配置來(lái)解決,這時(shí)候就需要通過(guò)以下命令:

scrapy shell
from scrapy import Request
response=Request("https://v.qq.com/x/list/movie?&offset=4980",meta = { 'dont_redirect': True})
re = fetch(response)

??參數(shù)被配置為dont redirect:True之后,就可以繞過(guò)重定向問(wèn)題。這時(shí)候我們就可以成功進(jìn)入網(wǎng)站的shell調(diào)試頁(yè)面了。
接下來(lái)我們尋找url的xpath,觀察可知:

“//*[@class="figure_title"]/a/@href"

在xpath中,屬性需要/@,標(biāo)簽則/就好了。其中需要通過(guò)\將其中的引號(hào)轉(zhuǎn)義掉。


Xpath調(diào)試get到所有的URL

??可以看到我們通過(guò)這個(gè)xpath成功地拿到了第一頁(yè)所有電影的url信息。以此類推,拿到二級(jí)頁(yè)面上的幾個(gè)xpath并通過(guò)scrapy shell來(lái)確認(rèn)信息無(wú)誤即可。

爬蟲(chóng)編寫(xiě)

??在class的一開(kāi)始,我們首先知道domain就是騰訊視頻的網(wǎng)站,而網(wǎng)址又是這么的有規(guī)律,那我們的start_url寫(xiě)起來(lái)就很容易了:

allowed_domains = ['qq.com']
start_urls = []
start_urls_1 = ['https://v.qq.com/x/list/movie?&offset=']
for i in range(4981):
    if i % 30 == 0:
        post_url = start_urls_1[0] + str(i)
        start_urls.append(post_url)

通過(guò)這種方法可以輕而易舉的遍歷所有的網(wǎng)站。

??接下來(lái)為了防止href提供的網(wǎng)址不全的情況,我在第一個(gè)parse中進(jìn)行了urljoin的練習(xí),其實(shí)目前這個(gè)頁(yè)面并不需要這樣的操作,直接request就可以了。

def parse(self, response):
    post_url_1 = ''#練習(xí)一下urljoin的方法,為網(wǎng)頁(yè)上href提供的網(wǎng)址不全的情況做準(zhǔn)備
    yield scrapy.Request(url=parse.urljoin(response.url, post_url_1), callback=self.parse_detail,
                              dont_filter=True)

其中dont_filter = True的意思是遇到相同的網(wǎng)址不要開(kāi)啟過(guò)濾器,也就是說(shuō)不為url去重。比如我們?cè)谂廊ヌ詫毜牡赇佇畔⒌臅r(shí)候,因?yàn)橥粋€(gè)店家的商品有可能反復(fù)出現(xiàn),這里就可以設(shè)置為False從而使我們的爬蟲(chóng)不再爬取相同的店鋪。

??接下來(lái)我們?cè)诘谝粋€(gè)parse_deatil中對(duì)我們拿到的url進(jìn)行request:

cl = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
for i in range(len(cl)):
    yield scrapy.Request(url="http://" + re.findall(r"http://(.*)", cl[i])[0], meta={'items': urlitem}, callback=self.parse_detail_2,
                    dont_filter=False)

就可以輕而易舉的通過(guò)parse_deatil_2函數(shù)來(lái)遍歷第一層二級(jí)頁(yè)面,更深的頁(yè)面也可以這樣操作,但是要注意yield item一定要在合適的位置,不然自己的信息和自己的信息就對(duì)應(yīng)不上了。

爬蟲(chóng)的設(shè)置

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

這一部分是一個(gè)君子協(xié)議,我們?cè)O(shè)置為False,可以繞過(guò)很多網(wǎng)站的封鎖。

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

此處的100搭配下面的continue request可以提高爬取速度。

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100  #13.02
# Disable cookies (enabled by default)
COOKIES_ENABLED = False

這里據(jù)說(shuō)設(shè)置為False之后也可以繞過(guò)一些網(wǎng)站的封鎖,不過(guò)目前還沒(méi)有發(fā)現(xiàn)實(shí)際功效。

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'urls_10_2.middlewares.Urls102DownloaderMiddleware': 543,
   #'randoms.rotate_useragent.RotateUserAgentMiddleware': 400
   'urls_10_2.rotate_useragent.RotateUserAgentMiddleware': 400,
   #'urls_10_2.middlewares.ProxyMiddleware': 102,
}

下面的ProxyMiddleware是寫(xiě)在Middleware中的一個(gè)類,在接下來(lái)的代碼中會(huì)奉上。因?yàn)楹芏嗑W(wǎng)站被爬煩了以后會(huì)把你的IP封掉,因此需要這樣的一個(gè)類提供代理IP,但是因?yàn)槟壳霸诰W(wǎng)上找到的很多免費(fèi)的國(guó)內(nèi)IP代理用不了,因此就暫時(shí)注釋掉了。感興趣的可以去西刺免費(fèi)代理IP網(wǎng)站上去看一下。

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'urls_10_2.pipelines.Urls102Pipeline': 300,
   'urls_10_2.pipelines.UrlsPipeline': 1,
}

這一部分是控制Pipeline里面寫(xiě)文件保存數(shù)據(jù)的,后面的數(shù)字據(jù)說(shuō)是控制在加載的時(shí)候(以靠近爬蟲(chóng)內(nèi)核core為最先)的先后順序的。

爬蟲(chóng)代碼-urls(爬蟲(chóng)部分)

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from urls_10_2.items import UrlItem
import re


class UrlsSpider(scrapy.Spider):
    name = 'urls'
    allowed_domains = ['qq.com']
    start_urls = []
    start_urls_1 = ['https://v.qq.com/x/list/movie?&offset=']
    for i in range(4981):
        if i % 30 == 0:
            post_url = start_urls_1[0] + str(i)
            start_urls.append(post_url)

    def parse(self, response):
        post_url_1 = ''#練習(xí)一下urljoin的方法,為網(wǎng)頁(yè)上href提供的網(wǎng)址不全的情況做準(zhǔn)備
        yield scrapy.Request(url=parse.urljoin(response.url, post_url_1), callback=self.parse_detail,
                                  dont_filter=True)

    def parse_detail(self, response):
        urlitem = UrlItem()
        # urlitem["url_name"] = response.xpath("http://*[@class=\"figure_title\"]/a/text()").extract()
        # urlitem["url"] = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
        # urlitem["mark_1"] = response.xpath("http://*[@class=\"score_l\"]/text()").extract()
        # urlitem["mark_2"] = response.xpath("http://*[@class=\"score_s\"]/text()").extract()
        cl = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
        for i in range(len(cl)):
            yield scrapy.Request(url="http://" + re.findall(r"http://(.*)", cl[i])[0], meta={'items': urlitem}, callback=self.parse_detail_2,
                            dont_filter=False)

    def parse_detail_2(self, response):
        urlitem=response.meta['items']
        urlitem["time"] = response.xpath("http://*[@class=\"figure_count\"]/span/text()").extract()[0]
        urlitem["url_name"] = response.xpath("http://*[@class=\"video_title _video_title\"]/text()").extract()[0].strip()
        urlitem["mark_1"] = response.xpath("http://*[@class=\"units\"]/text()").extract()[0]
        urlitem["mark_2"] = response.xpath("http://*[@class=\"decimal\"]/text()").extract()[0]
        yield urlitem

爬蟲(chóng)代碼-Middleware部分

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


# class ProxyMiddleware(object):
#     def process_request(self, request, spider):
#         request.meta['proxy'] = "http://118.190.95.35:9001"


class Urls102SpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield I

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class Urls102DownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

爬蟲(chóng)代碼-settings部分

# -*- coding: utf-8 -*-

# Scrapy settings for urls_10_2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'urls_10_2'

SPIDER_MODULES = ['urls_10_2.spiders']
NEWSPIDER_MODULE = 'urls_10_2.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'urls_10_2 (+http://www.yourdomain.com)'
#user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",       "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",       "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",       "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"  ]


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100  #13.02

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'urls_10_2.middlewares.Urls102SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'urls_10_2.middlewares.Urls102DownloaderMiddleware': 543,
   #'randoms.rotate_useragent.RotateUserAgentMiddleware': 400
   'urls_10_2.rotate_useragent.RotateUserAgentMiddleware': 400,
   #'urls_10_2.middlewares.ProxyMiddleware': 102,

}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'urls_10_2.pipelines.Urls102Pipeline': 300,
   'urls_10_2.pipelines.UrlsPipeline': 1,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

爬蟲(chóng)代碼-Pipeline部分

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re


class Urls102Pipeline(object):
    def process_item(self, item, spider):
        return item



class UrlsPipeline(object):
    def __init__(self):
        self.file = open("urls.csv", "a+")
        self.file.write("電影名稱,電影時(shí)長(zhǎng),電影評(píng)分\n")

    def process_item(self, item, spider):
        # 類被加載時(shí)要?jiǎng)?chuàng)建一個(gè)文件
        # 判斷文件是否為空
        # 為空則寫(xiě)title
        # 不為空則追加寫(xiě)文件
        if 1 : #os.path.getsize("executive_prep.csv"):
            self.write_content(item)#開(kāi)始寫(xiě)文件
        else:
            self.file.write("電影名稱,電影時(shí)長(zhǎng),電影評(píng)分\n")
        self.file.flush()
        return item

    def write_content(self, item):
    #url = item["url"]
        time_s = []
        time_final = 0.0

        url_name = item["url_name"]
        mark_1 = item["mark_1"]
        mark_2 = item["mark_2"]
        time = item["time"]
        if url_name.find(",") != -1:
            url_name = url_name.replace(",", "-")
        if url_name.find(",") != -1:
            url_name = url_name.replace(",", "-")

        time_s = time.split(':')
        time_final = float(time_s[0])*60 + float(time_s[1]) + float(time_s[2])/60

        result_1 = url_name + ',' + str(time_final) + ',' + mark_1 + mark_2 + '\n'
        self.file.write(result_1)

爬蟲(chóng)代碼-主程序

from scrapy.cmdline import execute

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy", "crawl", "urls"])

爬取到的文件

??爬到的文件是以csv文件格式存儲(chǔ)的,可以在Pycharm中直接打開(kāi),下面是效果圖:

CSV文件的內(nèi)容

中間部分是把時(shí)長(zhǎng)都轉(zhuǎn)換成分鐘數(shù)了,名字+分鐘數(shù)+評(píng)分的格式。

??接下來(lái)我們通過(guò)調(diào)用matplotlib庫(kù)中的

scatter(**)

函數(shù)的讀取文件并繪制的程序?qū)⑵洚?huà)成散點(diǎn)圖。注意在Mac下面Python2和Python3共存的情況下,安裝需要

pip3 install - -user xxxx

才可以被成功import進(jìn)來(lái)。

散點(diǎn)圖的繪制

import seaborn as sns
import matplotlib.pyplot as plt
import re

mark = []
time = []
name = []
ray = ''
with open("urls.csv") as file:
    for line in file:
        ray = re.findall(r"(.*)\n", line)[0]
        ray = ray.split(',')
        mark.append(ray[2])
        #time.append(ray[1])#以小數(shù)方式繪制-更精確
        time.append(ray[1].split(".")[0])
        name.append(ray[0])

for i in range(len(time)):
    plt.scatter(x=int(time[i]), y=float(mark[i]), s=5, c='r')
plt.show()

效果圖

騰訊視頻電影評(píng)分-時(shí)長(zhǎng)散點(diǎn)圖

??這時(shí)候我們就成功地畫(huà)出來(lái)了一張騰訊視頻的電影評(píng)分-時(shí)長(zhǎng)散點(diǎn)圖。接下來(lái)可以用各種分類器算法通過(guò)get電影的時(shí)間長(zhǎng)度來(lái)預(yù)估今年后續(xù)出的電影評(píng)分什么的hhh。

備注

??防止被網(wǎng)址封所引入的rotate_useragent模塊(在settings里面有開(kāi)啟):

# -*- coding: utf-8 -*-
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware


class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        # 這句話用于隨機(jī)選擇user-agent
        ua = random.choice(self.user_agent_list)
        if ua:
            print('User-Agent:' + ua)
            request.headers.setdefault('User-Agent', ua)

    # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"    ,
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容