PhantomJS無法獲取https網址的內容
PhantomJS是一個重要的爬蟲工具,能爬取動態加載上來的數據。
goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
from selenium import webdriver
browser = webdriver.PhantomJS(executable_path = './phantomjs',service_args=['--ssl-protocol=any'])
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找內容
print(res.text)
print(driver.page_source)
browser.quit()
今天我碰到一個很奇怪的問題,上述python爬蟲代碼,對好多個網址,都沒問題,對于上面的網址卻不行。res = browser.find_element_by_xpath(xpath0)
找不到內容,
然后print(driver.page_source)
的結果是
<html><head></head><body></body></html>
經過一番百度google,終于找到解決辦法
下面的代碼,對一些網址沒問題,對有的https網址有問題,因為ssl安全不過關呀(需要設置)
2.解決方法:PhantomJS的設置
http協議的網址是不會有問題的,對有的https網址有問題,因為ssl安全不過關呀(需要設置)
參考:https://www.cnblogs.com/fly-kaka/p/6656196.html
我去掉了帖子里冗余的設置
(網上多個帖子說設置 service_args=['--ignore-ssl-errors=true']
,但是,我通過測試是desired_capabilities=cap
(也就是headers)的問題 ,并且確切的說是
cap["phantomjs.page.customHeaders.User-Agent"] =ua
,不過,不同的網址可能不一樣,也算是個反爬蟲機制吧 ,都設置的話,肯定比較穩當
'''設置'''
from selenium import webdriver
#executable_path 指出phantomjs.exe的位置,如果它在path中有,就不需要此參數
browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 2000
cap["phantomjs.page.settings.loadImages"] = True
cap["phantomjs.page.settings.disk-cache"] = True
cap["phantomjs.page.settings.userAgent"] = ua
cap["phantomjs.page.customHeaders.User-Agent"] =ua
cap["phantomjs.page.customHeaders.Referer"] = "http://tj.ac.10086.cn/login/"
browser = webdriver.PhantomJS(desired_capabilities=cap,service_args=['--ignore-ssl-errors=true'])
'''開始調用'''
goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找內容
print(res.text)
print(browser.page_source)
browser.quit()