最近有點(diǎn)閑(咸)然后就想復(fù)習(xí)下前段時(shí)間自學(xué)的python爬蟲,最近也天天在斗魚上看直播(Sli真猴看)就想著能不能爬個(gè)斗魚各個(gè)主播的觀看人數(shù)和總共的觀看人數(shù)下來。
然后就開始試水了,結(jié)果發(fā)現(xiàn)斗魚的網(wǎng)站全是動(dòng)態(tài)js,普通的urlib2還爬不了,Hhh這個(gè)時(shí)候前段時(shí)間學(xué)的selenium和phantomjs自動(dòng)模擬和無界面瀏覽器就派上用場(chǎng)辣,然后花了一個(gè)小時(shí)確定思路和敲完代碼吧,但是debug花了兩個(gè)小時(shí)。。有一些莫名其妙的錯(cuò)誤吧,比如:
有的循環(huán)會(huì)報(bào)錯(cuò):NoSuchElementException
stackover上的回答:Error is because fields you are locating are inside iFrame.
So, first you need to switch to iframe and then locate your elements. still, if didn't work, add time delay.
翻譯過來大概就是無法定位到界面元素,我找到了一篇解決這個(gè)問題的博客:
https://www.cnblogs.com/yufeihlf/p/5689042.html
根據(jù)這篇文章,可能是頁面還沒有加載出來,就對(duì)頁面上的元素進(jìn)行的操作,于是我設(shè)置了等待時(shí)間,果然報(bào)錯(cuò)少了很多,但是還是有
一個(gè)...不過還是算解決了問題。
經(jīng)過長達(dá)十幾分鐘的測(cè)試,所以最后。。。斗魚總共有:
煤錯(cuò),兩億多hhh,中國人民千千萬,六分之一在斗奶TV看直播2333
貼下代碼:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import traceback,time
class Douyu():
#初始化函數(shù)
def __init__(self): #加上r防止內(nèi)容里的轉(zhuǎn)義符號(hào)\產(chǎn)生意義,這段代碼是使用PhantomJS瀏覽器創(chuàng)建瀏覽器對(duì)象的意思
self.driver = webdriver.PhantomJS(r'C:\Users\Administrator\AppData\Local\Programs\Python\Python37\Lib\site-
packages\phantomjs-2.1.1-windows\bin\phantomjs.exe')
self.num = 0
self.num2 = 1
self.count = 0
def douyuSpider(self):
self.driver.get("https://www.douyu.com/directory/all")
# 使用get方法加載頁面
while True:
try :
#print(f'第{self.num2}頁') #測(cè)試用
soup = bs(self.driver.page_source, "html.parser")
# 房間名, 返回列表
names = soup.find_all("span", {"class" : "dy-name ellipsis fl"})
# 觀眾人數(shù), 返回列表
numbers = soup.find_all("span", {"class" :"dy-num fr"})
for name, number in zip(names, numbers): #zip() 函數(shù)用于將可迭代的對(duì)象作為參數(shù),將對(duì)象中對(duì)應(yīng)的元素打包成一個(gè)
個(gè)元組,然后返回由這些元組組成的列表。
print("觀眾人數(shù): -" + number.get_text().strip() + u"-\t房間名: " + name.get_text().strip())
self.num += 1
count = number.get_text().strip()
if count[-1]=="萬":
countNum = float(count[:-1])*10000
else:
countNum = float(count)
self.count += countNum
# class="shark-pager-next"是下一頁按鈕,click() 是模擬點(diǎn)擊
self.driver.find_element_by_class_name("shark-pager-next").click()
# 如果在頁面源碼里沒有找到"shark-pager-disable-next",其返回值為-1,可依次作為判斷條件
#self.num2 = self.num2 + 1
time.sleep(2)
if(self.driver.page_source.find("shark-pager-disable-next") != -1):
print('已經(jīng)到達(dá)最后一頁!')
break
except Exception as e:#異常捕獲
print(f'traceback.print_exc():{traceback.print_exc()}')
continue
if __name__ == '__main__':
d = Douyu()
d.douyuSpider()
'''
有的循環(huán)會(huì)報(bào)錯(cuò):NoSuchElementException
stackover上的回答:Error is because fields you are locating are inside iFrame.
So, first you need to switch to iframe and then locate your elements. still, if didn't work, add time delay.
翻譯過來大概就是無法定位到界面元素,我找到了一篇解決這個(gè)問題的博客:
https://www.cnblogs.com/yufeihlf/p/5689042.html
根據(jù)這篇文章,可能是頁面還沒有加載出來,就對(duì)頁面上的元素進(jìn)行的操作,于是我設(shè)置了等待時(shí)間,果然報(bào)錯(cuò)少了很多,但是還是有
一個(gè)...不過還是算解決了問題。
'''