PhantomJS無法獲取https網址的內容

PhantomJS是一個重要的爬蟲工具,能爬取動態加載上來的數據。

goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
from selenium import webdriver
browser = webdriver.PhantomJS(executable_path = './phantomjs',service_args=['--ssl-protocol=any'])
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找內容
print(res.text)
print(driver.page_source)
browser.quit()

今天我碰到一個很奇怪的問題，上述python爬蟲代碼，對好多個網址，都沒問題，對于上面的網址卻不行。res = browser.find_element_by_xpath(xpath0) 找不到內容，

然后print(driver.page_source) 的結果是

<html><head></head><body></body></html>

經過一番百度google，終于找到解決辦法

下面的代碼,對一些網址沒問題,對有的https網址有問題,因為ssl安全不過關呀(需要設置)

2.解決方法:PhantomJS的設置

http協議的網址是不會有問題的，對有的https網址有問題,因為ssl安全不過關呀(需要設置)

參考:https://www.cnblogs.com/fly-kaka/p/6656196.html

我去掉了帖子里冗余的設置

(網上多個帖子說設置 service_args=['--ignore-ssl-errors=true'] ,但是,我通過測試是desired_capabilities=cap(也就是headers)的問題 ,并且確切的說是

cap["phantomjs.page.customHeaders.User-Agent"] =ua ,不過,不同的網址可能不一樣,也算是個反爬蟲機制吧 ,都設置的話,肯定比較穩當

'''設置'''
from selenium import webdriver
#executable_path 指出phantomjs.exe的位置,如果它在path中有,就不需要此參數
browser = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 2000
cap["phantomjs.page.settings.loadImages"] = True
cap["phantomjs.page.settings.disk-cache"] = True
cap["phantomjs.page.settings.userAgent"] = ua
cap["phantomjs.page.customHeaders.User-Agent"] =ua
cap["phantomjs.page.customHeaders.Referer"] = "http://tj.ac.10086.cn/login/"
browser = webdriver.PhantomJS(desired_capabilities=cap,service_args=['--ignore-ssl-errors=true'])
'''開始調用'''
goods_url = "https://xueqiu.com/u/5832323914"
xpath0 = "(//div[@id='app']/div[contains(@class,'container')]/div[@class='profiles__main']/div[@class='profiles__timeline__bd']/article[@class='timeline__item']/div[@class='timeline__item__main']/div[@class='timeline__item__bd']/div[@class='timeline__item__content']/div[contains(@class,'content')]/div)[2]"
browser.get(goods_url)
res = browser.find_element_by_xpath(xpath0) # 查找內容
print(res.text)
print(browser.page_source)
browser.quit()

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

PhantomJS無法獲取https網址的內容

PhantomJS無法獲取https網址的內容

2.解決方法:PhantomJS的設置

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频