要求:從B站編程類排名前1000的視頻中,抓取視頻標(biāo)簽制作成詞云,分析B站的小伙伴們都在學(xué)習(xí)些什么?
根據(jù)上節(jié)課爬下來(lái)的txt文檔,再對(duì)每一行數(shù)據(jù)中的每個(gè)URL進(jìn)行訪問(wèn),接下來(lái)就是用上面爬標(biāo)簽的腳本運(yùn)行,爬出來(lái)的標(biāo)簽保存為新的文件。
下面這個(gè)是生成標(biāo)簽文件的腳本:
import requests
import bs4
import re
def get_url():
with open("./綜合排序.txt", "r", encoding="utf-8") as file:
file = file.read()
s = re.findall("http://(.*?)\+", file)
return s
def get_html(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'}
res = requests.get(url, headers=headers)
return res.text
def get_tags(url):
text = get_html(url)
soup = bs4.BeautifulSoup(text, "html.parser")
tags = soup.select("ul[class = 'tag-area clearfix'] > li")
# v_tag > ul
tags = [each.a.text for each in tags]
return tags
def main():
i = 0
urls = get_url()
for i in range(1001):
url = urls[i]
url = "http://" + url
text = get_tags(url)
for i in text:
with open('span.txt', 'a', encoding="utf-8") as file:
file.write(str(i) + '\n')
file.close()
if __name__ == '__main__':
main()
詞云腳本:
import wordcloud
file = open("./span.txt", encoding="utf-8")
text = file.read()
stopwords = {
'野生技術(shù)協(xié)會(huì)','編程','課程','教育','講座','編程技術(shù)宅','教學(xué)','電腦','技術(shù)','編程教育','編程入門','開(kāi)發(fā)','科學(xué)',
'演示','軟件','編程視頻教程','編程課程','教學(xué)視頻','經(jīng)驗(yàn)分享','IT','編程語(yǔ)言','互聯(lián)網(wǎng)','考試','考研','科技','語(yǔ)言',
'技術(shù)宅','面試','自學(xué)','原創(chuàng)','公開(kāi)課','程序員','學(xué)習(xí)','課程','教程','計(jì)算機(jī)','線上課堂','視頻教程',
}
wc = wordcloud.WordCloud(font_path="/Users/liujin/project/spider/socker_test/simsun.ttf", stopwords=stopwords)
wc.generate(text)
image = wc.to_image()
image.show()
輸出:
image.png