Python爬取騰訊視頻電影信息并繪制散點(diǎn)圖
??昨天用python爬了一堆電影信息并繪制成了散點(diǎn)圖,感覺(jué)很有意思,發(fā)上來(lái)分享一下。先上圖:
??橫軸是電影的時(shí)長(zhǎng)(分鐘),縱軸是電影的評(píng)分。簡(jiǎn)單畫(huà)的圖沒(méi)有做什么標(biāo)注,后續(xù)可以加上去。接下來(lái)是工程從頭到尾的過(guò)程:
找到目標(biāo)網(wǎng)站
??首先自然是需要找到目標(biāo)網(wǎng)站并觀察網(wǎng)頁(yè)結(jié)構(gòu):
??嗯。。。我們來(lái)觀察一下網(wǎng)址:https://v.qq.com/x/list/movie?&offset=0;看來(lái)是比較有規(guī)律的那種,在接下來(lái)request編寫(xiě)的時(shí)候可能會(huì)簡(jiǎn)單一些。
??接下來(lái)看一下第二頁(yè)的網(wǎng)址:https://v.qq.com/x/list/movie?&offset=30;果然是這種有規(guī)律的。騰訊視頻的offset是每過(guò)一頁(yè)偏置量+30,最后一頁(yè)是4980:https://v.qq.com/x/list/movie?&offset=4980。
爬蟲(chóng)的準(zhǔn)備部分
??好的,接下來(lái)就是建立爬蟲(chóng)工程了。建立爬蟲(chóng)工程可以通過(guò):
scrapy startproject (文件名)—沒(méi)有括號(hào)
的命令來(lái)創(chuàng)建,接下來(lái)是進(jìn)入相應(yīng)的目錄中:
scrapy genspider (文件名)(網(wǎng)址)
來(lái)創(chuàng)建爬蟲(chóng)文件。
??首先我們先來(lái)定位我們需要爬取的信息。可以看到首頁(yè)中陳列電影的頁(yè)面是沒(méi)有電影的時(shí)間長(zhǎng)度信息的,需要進(jìn)入到每一個(gè)電影的播放頁(yè)面里面來(lái)爬取。不過(guò)電影的名稱和評(píng)分則可以在首頁(yè)上進(jìn)行爬取。
??不過(guò)在這里遇到了一個(gè)問(wèn)題就是沒(méi)辦法爬完所有首頁(yè)上電影的二級(jí)頁(yè)面之后再yield item,每次都是爬了一個(gè)電影名稱和評(píng)分信息的list,但是評(píng)分卻只有第一個(gè)的。怎么修改都出現(xiàn)了報(bào)錯(cuò)的情況,所以不得已只能都放在二級(jí)頁(yè)面進(jìn)行爬取。好在信息在二級(jí)頁(yè)面上都是全的hhh。
??可以看到頁(yè)面上的電影名稱、電影評(píng)分和電影的時(shí)長(zhǎng)信息都有。在二級(jí)頁(yè)面上則需要一個(gè)單獨(dú)的xpath來(lái)保存相應(yīng)的電影的url信息,方便我們進(jìn)行遍歷的Request。
爬蟲(chóng)的調(diào)試
??首先我們用shell進(jìn)行調(diào)試,從而找到最合適的xpath進(jìn)行信息的提取。一般來(lái)說(shuō)進(jìn)入shell進(jìn)行調(diào)試只需要scrapy shell + 網(wǎng)址就可以進(jìn)入了,但是騰訊視頻的網(wǎng)站則遇到了一點(diǎn)問(wèn)題,如下圖:
遇到了這種情況,不再繼續(xù)進(jìn)入了,然后按一下回車(chē):
就出現(xiàn)了stopped的情況。查詢后得知這是網(wǎng)頁(yè)遇到了重定向問(wèn)題,但是shell好像可以通過(guò)參數(shù)配置來(lái)解決,這時(shí)候就需要通過(guò)以下命令:
scrapy shell
from scrapy import Request
response=Request("https://v.qq.com/x/list/movie?&offset=4980",meta = { 'dont_redirect': True})
re = fetch(response)
??參數(shù)被配置為dont redirect:True之后,就可以繞過(guò)重定向問(wèn)題。這時(shí)候我們就可以成功進(jìn)入網(wǎng)站的shell調(diào)試頁(yè)面了。
接下來(lái)我們尋找url的xpath,觀察可知:
“//*[@class="figure_title"]/a/@href"
在xpath中,屬性需要/@,標(biāo)簽則/就好了。其中需要通過(guò)\將其中的引號(hào)轉(zhuǎn)義掉。
??可以看到我們通過(guò)這個(gè)xpath成功地拿到了第一頁(yè)所有電影的url信息。以此類推,拿到二級(jí)頁(yè)面上的幾個(gè)xpath并通過(guò)scrapy shell來(lái)確認(rèn)信息無(wú)誤即可。
爬蟲(chóng)編寫(xiě)
??在class的一開(kāi)始,我們首先知道domain就是騰訊視頻的網(wǎng)站,而網(wǎng)址又是這么的有規(guī)律,那我們的start_url寫(xiě)起來(lái)就很容易了:
allowed_domains = ['qq.com']
start_urls = []
start_urls_1 = ['https://v.qq.com/x/list/movie?&offset=']
for i in range(4981):
if i % 30 == 0:
post_url = start_urls_1[0] + str(i)
start_urls.append(post_url)
通過(guò)這種方法可以輕而易舉的遍歷所有的網(wǎng)站。
??接下來(lái)為了防止href提供的網(wǎng)址不全的情況,我在第一個(gè)parse中進(jìn)行了urljoin的練習(xí),其實(shí)目前這個(gè)頁(yè)面并不需要這樣的操作,直接request就可以了。
def parse(self, response):
post_url_1 = ''#練習(xí)一下urljoin的方法,為網(wǎng)頁(yè)上href提供的網(wǎng)址不全的情況做準(zhǔn)備
yield scrapy.Request(url=parse.urljoin(response.url, post_url_1), callback=self.parse_detail,
dont_filter=True)
其中dont_filter = True的意思是遇到相同的網(wǎng)址不要開(kāi)啟過(guò)濾器,也就是說(shuō)不為url去重。比如我們?cè)谂廊ヌ詫毜牡赇佇畔⒌臅r(shí)候,因?yàn)橥粋€(gè)店家的商品有可能反復(fù)出現(xiàn),這里就可以設(shè)置為False從而使我們的爬蟲(chóng)不再爬取相同的店鋪。
??接下來(lái)我們?cè)诘谝粋€(gè)parse_deatil中對(duì)我們拿到的url進(jìn)行request:
cl = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
for i in range(len(cl)):
yield scrapy.Request(url="http://" + re.findall(r"http://(.*)", cl[i])[0], meta={'items': urlitem}, callback=self.parse_detail_2,
dont_filter=False)
就可以輕而易舉的通過(guò)parse_deatil_2函數(shù)來(lái)遍歷第一層二級(jí)頁(yè)面,更深的頁(yè)面也可以這樣操作,但是要注意yield item一定要在合適的位置,不然自己的信息和自己的信息就對(duì)應(yīng)不上了。
爬蟲(chóng)的設(shè)置
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
這一部分是一個(gè)君子協(xié)議,我們?cè)O(shè)置為False,可以繞過(guò)很多網(wǎng)站的封鎖。
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
此處的100搭配下面的continue request可以提高爬取速度。
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100 #13.02
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
這里據(jù)說(shuō)設(shè)置為False之后也可以繞過(guò)一些網(wǎng)站的封鎖,不過(guò)目前還沒(méi)有發(fā)現(xiàn)實(shí)際功效。
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'urls_10_2.middlewares.Urls102DownloaderMiddleware': 543,
#'randoms.rotate_useragent.RotateUserAgentMiddleware': 400
'urls_10_2.rotate_useragent.RotateUserAgentMiddleware': 400,
#'urls_10_2.middlewares.ProxyMiddleware': 102,
}
下面的ProxyMiddleware是寫(xiě)在Middleware中的一個(gè)類,在接下來(lái)的代碼中會(huì)奉上。因?yàn)楹芏嗑W(wǎng)站被爬煩了以后會(huì)把你的IP封掉,因此需要這樣的一個(gè)類提供代理IP,但是因?yàn)槟壳霸诰W(wǎng)上找到的很多免費(fèi)的國(guó)內(nèi)IP代理用不了,因此就暫時(shí)注釋掉了。感興趣的可以去西刺免費(fèi)代理IP網(wǎng)站上去看一下。
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'urls_10_2.pipelines.Urls102Pipeline': 300,
'urls_10_2.pipelines.UrlsPipeline': 1,
}
這一部分是控制Pipeline里面寫(xiě)文件保存數(shù)據(jù)的,后面的數(shù)字據(jù)說(shuō)是控制在加載的時(shí)候(以靠近爬蟲(chóng)內(nèi)核core為最先)的先后順序的。
爬蟲(chóng)代碼-urls(爬蟲(chóng)部分)
# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from urls_10_2.items import UrlItem
import re
class UrlsSpider(scrapy.Spider):
name = 'urls'
allowed_domains = ['qq.com']
start_urls = []
start_urls_1 = ['https://v.qq.com/x/list/movie?&offset=']
for i in range(4981):
if i % 30 == 0:
post_url = start_urls_1[0] + str(i)
start_urls.append(post_url)
def parse(self, response):
post_url_1 = ''#練習(xí)一下urljoin的方法,為網(wǎng)頁(yè)上href提供的網(wǎng)址不全的情況做準(zhǔn)備
yield scrapy.Request(url=parse.urljoin(response.url, post_url_1), callback=self.parse_detail,
dont_filter=True)
def parse_detail(self, response):
urlitem = UrlItem()
# urlitem["url_name"] = response.xpath("http://*[@class=\"figure_title\"]/a/text()").extract()
# urlitem["url"] = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
# urlitem["mark_1"] = response.xpath("http://*[@class=\"score_l\"]/text()").extract()
# urlitem["mark_2"] = response.xpath("http://*[@class=\"score_s\"]/text()").extract()
cl = response.xpath("http://*[@class=\"figure_title\"]/a/@href").extract()
for i in range(len(cl)):
yield scrapy.Request(url="http://" + re.findall(r"http://(.*)", cl[i])[0], meta={'items': urlitem}, callback=self.parse_detail_2,
dont_filter=False)
def parse_detail_2(self, response):
urlitem=response.meta['items']
urlitem["time"] = response.xpath("http://*[@class=\"figure_count\"]/span/text()").extract()[0]
urlitem["url_name"] = response.xpath("http://*[@class=\"video_title _video_title\"]/text()").extract()[0].strip()
urlitem["mark_1"] = response.xpath("http://*[@class=\"units\"]/text()").extract()[0]
urlitem["mark_2"] = response.xpath("http://*[@class=\"decimal\"]/text()").extract()[0]
yield urlitem
爬蟲(chóng)代碼-Middleware部分
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
# class ProxyMiddleware(object):
# def process_request(self, request, spider):
# request.meta['proxy'] = "http://118.190.95.35:9001"
class Urls102SpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield I
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class Urls102DownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
爬蟲(chóng)代碼-settings部分
# -*- coding: utf-8 -*-
# Scrapy settings for urls_10_2 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'urls_10_2'
SPIDER_MODULES = ['urls_10_2.spiders']
NEWSPIDER_MODULE = 'urls_10_2.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'urls_10_2 (+http://www.yourdomain.com)'
#user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 100
CONCURRENT_REQUESTS_PER_IP = 100 #13.02
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'urls_10_2.middlewares.Urls102SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'urls_10_2.middlewares.Urls102DownloaderMiddleware': 543,
#'randoms.rotate_useragent.RotateUserAgentMiddleware': 400
'urls_10_2.rotate_useragent.RotateUserAgentMiddleware': 400,
#'urls_10_2.middlewares.ProxyMiddleware': 102,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'urls_10_2.pipelines.Urls102Pipeline': 300,
'urls_10_2.pipelines.UrlsPipeline': 1,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
爬蟲(chóng)代碼-Pipeline部分
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re
class Urls102Pipeline(object):
def process_item(self, item, spider):
return item
class UrlsPipeline(object):
def __init__(self):
self.file = open("urls.csv", "a+")
self.file.write("電影名稱,電影時(shí)長(zhǎng),電影評(píng)分\n")
def process_item(self, item, spider):
# 類被加載時(shí)要?jiǎng)?chuàng)建一個(gè)文件
# 判斷文件是否為空
# 為空則寫(xiě)title
# 不為空則追加寫(xiě)文件
if 1 : #os.path.getsize("executive_prep.csv"):
self.write_content(item)#開(kāi)始寫(xiě)文件
else:
self.file.write("電影名稱,電影時(shí)長(zhǎng),電影評(píng)分\n")
self.file.flush()
return item
def write_content(self, item):
#url = item["url"]
time_s = []
time_final = 0.0
url_name = item["url_name"]
mark_1 = item["mark_1"]
mark_2 = item["mark_2"]
time = item["time"]
if url_name.find(",") != -1:
url_name = url_name.replace(",", "-")
if url_name.find(",") != -1:
url_name = url_name.replace(",", "-")
time_s = time.split(':')
time_final = float(time_s[0])*60 + float(time_s[1]) + float(time_s[2])/60
result_1 = url_name + ',' + str(time_final) + ',' + mark_1 + mark_2 + '\n'
self.file.write(result_1)
爬蟲(chóng)代碼-主程序
from scrapy.cmdline import execute
import sys
import os
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy", "crawl", "urls"])
爬取到的文件
??爬到的文件是以csv文件格式存儲(chǔ)的,可以在Pycharm中直接打開(kāi),下面是效果圖:
中間部分是把時(shí)長(zhǎng)都轉(zhuǎn)換成分鐘數(shù)了,名字+分鐘數(shù)+評(píng)分的格式。
??接下來(lái)我們通過(guò)調(diào)用matplotlib庫(kù)中的
scatter(**)
函數(shù)的讀取文件并繪制的程序?qū)⑵洚?huà)成散點(diǎn)圖。注意在Mac下面Python2和Python3共存的情況下,安裝需要
pip3 install - -user xxxx
才可以被成功import進(jìn)來(lái)。
散點(diǎn)圖的繪制
import seaborn as sns
import matplotlib.pyplot as plt
import re
mark = []
time = []
name = []
ray = ''
with open("urls.csv") as file:
for line in file:
ray = re.findall(r"(.*)\n", line)[0]
ray = ray.split(',')
mark.append(ray[2])
#time.append(ray[1])#以小數(shù)方式繪制-更精確
time.append(ray[1].split(".")[0])
name.append(ray[0])
for i in range(len(time)):
plt.scatter(x=int(time[i]), y=float(mark[i]), s=5, c='r')
plt.show()
效果圖
??這時(shí)候我們就成功地畫(huà)出來(lái)了一張騰訊視頻的電影評(píng)分-時(shí)長(zhǎng)散點(diǎn)圖。接下來(lái)可以用各種分類器算法通過(guò)get電影的時(shí)間長(zhǎng)度來(lái)預(yù)估今年后續(xù)出的電影評(píng)分什么的hhh。
備注
??防止被網(wǎng)址封所引入的rotate_useragent模塊(在settings里面有開(kāi)啟):
# -*- coding: utf-8 -*-
import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
# 這句話用于隨機(jī)選擇user-agent
ua = random.choice(self.user_agent_list)
if ua:
print('User-Agent:' + ua)
request.headers.setdefault('User-Agent', ua)
# the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" ,
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]