代碼:
import requests
from bs4 import BeautifulSoup
import pymongo
import re
client = pymongo.MongoClient('localhost', 27017)
douban = client['douban']
top250 = douban['top250']
urls = ['https://movie.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
def get_info(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
names = soup.select('div.hd > a')
times = re.findall('<br>(.*?) ',wb_data.text,re.S)
places = re.findall(' / (.*?) / ',wb_data.text)
levels = soup.select('span.rating_num')
quotes = soup.select('span.inq')
for name,time,place,level,quote in zip(names,times,places,levels,quotes):
info = {
'name': name.get_text().split('/')[0].split('\n')[1],
'time': time.split('\n')[1].replace(' ',''),
'place': place,
'level': level.get_text(),
'quote': quote.get_text()
}
top250.insert_one(info)
for url in urls:
get_info(url)
實(shí)際爬取243條電影,出了一點(diǎn)小問(wèn)題,建議大家爬取信息進(jìn)入網(wǎng)站里面去爬會(huì)保險(xiǎn)點(diǎn),我這里懶得再重寫(xiě)了。然后導(dǎo)出excel表格,進(jìn)行分析
簡(jiǎn)單分析
1.電影拿走不謝,請(qǐng)叫我雷鋒
2.美國(guó),日本,中國(guó)上榜電影拍前三
3.主要的電影內(nèi)容:信仰,青春,科幻,情懷等
4.電影數(shù)最多的幾年為1995~2013,近幾年電影較少,原因大概為:雖然制片投入和電影效果越來(lái)越好,但內(nèi)容卻沒(méi)以前那么好了。