自動化測試——Selenium
What is Selenium?
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.
Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.
應用背景
在許多場景下,測試人員需要自動化測試工具來提高測試效率,Selenium 就是一款專為瀏覽器自動化測試服務的工具。它可以完全模擬瀏覽器的各種操作,以此把程序員從繁重的 cookie、 header、 request 等等中解放出來。
為什么我要用到 Selenium ?在小燈神的心愿上接了個活,學妹要求爬取 IEEEXplore 網站上某個學者的所有論文(標題、來源、關鍵詞),而這個網站又是異步加載的,所以普通的爬蟲根本爬不到數據,在網上搜索了一下,需要抓去 js 包,然而我幾乎沒怎么學過 js,放棄這個方法,聽說還可以用 Selenium 自動化獲取,于是開始學習 Selenium。
環境搭建
在 Selenium 官網上下載對應瀏覽器的 driver ,比如我用的是 chrome 瀏覽器,就下載 chromedriver,下載地址:https://sites.google.com/a/chromium.org/chromedriver/downloads。可能需要FQ,自行備梯子,或者去找國內鏡像。
把 chromedriver.exe 放在項目根目錄下即可,接下來看看要如何操作這個驅動。
-
官網有 getting start:https://sites.google.com/a/chromium.org/chromedriver/getting-started,放上 Python 版本的代碼:
# Python: import time from selenium import webdriver import selenium.webdriver.chrome.service as service service = service.Service('/path/to/chromedriver') service.start() capabilities = {'chrome.binary': '/path/to/custom/chrome'} driver = webdriver.Remote(service.service_url, capabilities) driver.get('http://www.google.com/xhtml'); time.sleep(5) # Let the user actually see something! driver.quit()
-
實際上不需要官方教程那么復雜,如下代碼可以直接打開受自動化工具控制的 chrome:
from selenium import webdriver driver = webdriver.Chrome(executable_path='chromedriver.exe')
運行上面兩行代碼,且 exe 文件位于同一文件夾下,則可以看到 chrome 瀏覽器 打開:
20171118-auto 至此,環境搭建成功。
Selenium 基礎操作
有人做了 doc 中文文檔,可以參閱一下:http://python-selenium-zh.readthedocs.io/zh_CN/latest/
-
打開某個網頁:
driver.get("http://www.baidu.com")
其中 driver.get 方法會打開請求的URL,WebDriver 會等待頁面完全加載完成之后才會返回,即程序會等待頁面的所有內容加載完成,JS渲染完畢之后才繼續往下執行。注意:如果這里用到了特別多的 Ajax 的話,程序可能不知道是否已經完全加載完畢。
-
尋找某個網頁元素:
find_element_by_id find_element_by_name find_element_by_xpath find_element_by_link_text find_element_by_partial_link_text find_element_by_tag_name find_element_by_class_name find_element_by_css_selector
尋找某組網頁元素:
find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector
假設有這樣一個輸入框:
<input type="text" name="passwd" id="passwd-id" />
以下幾種方法都可以找到它(但不一定是唯一的):
element = driver.find_element_by_id("passwd-id") element = driver.find_element_by_name("passwd") element = driver.find_elements_by_tag_name("input") element = driver.find_element_by_xpath("http://input[@id='passwd-id']")
-
獲取元素后,元素本身并沒有價值,它包含的文本或者鏈接才有價值:
text = element.text link = element.get_attribute('href')
-
獲取了元素之后,下一步當然就是向文本輸入內容了,可以利用下面的方法
element.send_keys("some text")
同樣你還可以利用 Keys 這個類來模擬點擊某個按鍵。
element.send_keys("and some", Keys.ARROW_DOWN)
輸入的文本都會在原來的基礎上繼續輸入。你可以用下面的方法來清除輸入文本的內容。
element.clear()
-
下拉選項框可以利用 Select 方法:
from selenium.webdriver.support.ui import Select select = Select(driver.find_element_by_name('name')) select.select_by_index(index) select.select_by_visible_text("text") select.select_by_value(value) select.deselect_all() all_selected_options = select.all_selected_options
-
提交表單:
driver.find_element_by_id("submit").click()
-
Cookie 處理:
cookie = {‘name’ : ‘foo’, ‘value’ : ‘bar’} driver.add_cookie(cookie) driver.get_cookies()
-
頁面等待:
這是非常重要的一部分,現在的網頁越來越多采用了 Ajax 技術,這樣程序便不能確定何時某個元素完全加載出來了。這會讓元素定位困難而且會提高產生 ElementNotVisibleException 的概率。
所以 Selenium 提供了兩種等待方式,一種是隱式等待,一種是顯式等待。
隱式等待是等待特定的時間:
driver.implicitly_wait(10) # seconds
顯式等待是指定某一條件直到這個條件成立時繼續執行,常用的判斷條件:
title_is 標題是某內容 title_contains 標題包含某內容 presence_of_element_located 元素加載出,傳入定位元組,如(By.ID, 'p') visibility_of_element_located 元素可見,傳入定位元組 visibility_of 可見,傳入元素對象 presence_of_all_elements_located 所有元素加載出 text_to_be_present_in_element 某個元素文本包含某文字 text_to_be_present_in_element_value 某個元素值包含某文字 frame_to_be_available_and_switch_to_it frame加載并切換 invisibility_of_element_located 元素不可見 element_to_be_clickable 元素可點擊 staleness_of 判斷一個元素是否仍在DOM,可判斷頁面是否已經刷新 element_to_be_selected 元素可選擇,傳元素對象 element_located_to_be_selected 元素可選擇,傳入定位元組 element_selection_state_to_be 傳入元素對象以及狀態,相等返回True,否則返回False element_located_selection_state_to_be 傳入定位元組以及狀態,相等返回True,否則返回False alert_is_present 是否出現Alert
官方 API :http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions
-
瀏覽器的前進和后退:
driver.back() driver.forward()
IEEEXplore 實戰
-
20171118-zhangbo
它顯示了學者:Zhang Bo 的所有文章列表(分為兩頁),我們要爬取的首先是論文標題,這個比較簡單,來源也比較簡單,比如上圖的第一篇文章標題為:Smale Horseshoes and Symbolic Dynamics in the Buck–Boost DC–DC Converter,來源為:IEEE Transactions on Industrial Electronics。
-
可以通過 find_elements_by_css_selector 來找到這樣的一組元素:
article_name_ele_list = driver.find_elements_by_css_selector("h2 a.ng-binding.ng-scope") # 獲取該頁面所有文章標題的元素 for article_name_ele in article_name_ele_list: # 對每個文章標題元素,提取標題文本(字符串),以及文章 url article_name = article_name_ele.text article_link = article_name_ele.get_attribute('href') article_names.append(article_name) print("article_name = ", article_name) article_links.append(article_link) print("article_link = ", article_link) article_source_ele_list = driver.find_elements_by_css_selector("div.description.u-mb-1 a.ng-binding.ng-scope") # 獲取該頁面所有文章來源的元素 for article_source_ele in article_source_ele_list: # 對每個文章來源元素,提取來源文本(字符串) article_source = article_source_ele.text article_sources.append(article_source) print("article_source =", article_source)
-
它的翻頁操作比較蛋疼,底部雖然有頁碼工具條,但是都用到了 on-click 方法,然后方法內傳入一個自定義的函數,這又是 js 的內容,有點麻煩。后來我注意到 url 地址變化的規律。
入口(也就是第一頁)是這樣的:
http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&searchField=Search_All
第二頁是這樣的:
http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&pageNumber=2&searchField=Search_All
也就多了一個 pageNumber 的參數,如果手動輸入 pageNumber 是3的話,是什么樣的呢?
20171118-notfound -
這樣我就根本不用管頁碼工具條,靠 url 跳轉就可以實現翻頁的效果。
pageNumber = 1 while(True): driver.get( 'http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(%22Authors%22:Zhang%20Bo)&refinements=4224983357&matchBoolean=true&pageNumber=' + str(pageNumber) + '&searchField=Search_All') time.sleep(5) print("start to check if this is the last page !!!") try: driver.find_element_by_css_selector("p.List-results-none--lg.u-mb-0") # if this is NOT the last page, this will raise exception except Exception as e: print("This page is good to go !!!") else: print("The last page !!!") break article_name_ele_list = driver.find_elements_by_css_selector("h2 a.ng-binding.ng-scope") for article_name_ele in article_name_ele_list: article_name = article_name_ele.text article_link = article_name_ele.get_attribute('href') article_names.append(article_name) print("article_name = ", article_name) article_links.append(article_link) print("article_link = ", article_link) article_source_ele_list = driver.find_elements_by_css_selector("div.description.u-mb-1 a.ng-binding.ng-scope") for article_source_ele in article_source_ele_list: article_source = article_source_ele.text article_sources.append(article_source) print("article_source =", article_source) pageNumber += 1
-
解釋:
首先初始化為第一頁,然后進入 while 循環,首先會檢查當前頁面是否是 notfound 頁面,如果是,則證明上一頁已經是最后一頁了,跳出循環。如果不是才獲取文章標題、文章鏈接、文章來源,最后另 pageNumber 加一即可。
獲取文章關鍵詞
-
好的,萬事開頭難,我們已經有這位學者20篇論文的鏈接了,我們要一一打開這些鏈接,獲取其中的關鍵詞。但是我們打開第一篇文章的鏈接,發現默認可以看到“Abstract”,還需要點擊“Keywords”才行
20171118-abstract_url20171118-Keywords_url 但是觀察 url,真是天助我也,只需要加入‘/keywrods’就好了。
但是這些關鍵詞要在怎么獲取呢?值得一提的是,這篇文章的關鍵詞有兩類:IEEE Keywords, Author Keywords。有的文章不止這兩類,還有可能有:INSPEC: Controlled Indexing, INSPEC: Non-Controlled Indexing。
就算獲取到了這四個,但是關鍵詞并不是固定的,看上去,唯一和這些關鍵詞種類有關系的就是它們的層級結構了。
-
接下來,需要介紹一下 xpath 這個東西了。
XPath即為XML路徑語言(XML Path Language),它是一種用來確定XML文檔中某部分位置的語言。
XPath基于XML的樹狀結構,提供在數據結構樹中找尋節點的能力。起初XPath的提出的初衷是將其作為一個通用的、介于XPointer與XSL間的語法模型。但是XPath很快的被開發者采用來當作小型查詢語言。在這里,可以看到每個關鍵詞是屬于某個關鍵詞種類的下一組結點的,所以可以用 following-sibling 的屬性來獲取到這組關鍵詞元素。
-
上文已經通過 article_link 存儲了所有文章的 url,這里還需要通過正則表達式判斷文章的 article_id:
# get into articles page for article_link in article_links: driver.get(article_link + "keywords") article_id = re.findall("[0-9]+", article_link)[0] time.sleep(3)
-
創建四個字典,用來存儲四個關鍵詞種類:
# get into keywords page dic = {} dic['IEEE Keywords'] = [] dic['INSPEC: Controlled Indexing'] = [] dic['INSPEC: Non-Controlled Indexing'] = [] dic['Author Keywords'] = []
-
首先找到關鍵詞種類的元素,然后用 following-sibling 找到其下的具體關鍵詞:
keywords_type_list = driver.find_elements_by_css_selector("li.doc-keywords-list-item.ng-scope strong.ng-binding") # ['IEEE Keywords', 'INSPEC: Controlled Indexing', 'INSPEC: Non-Controlled Indexing', 'Author Keywords'] for i in range(len(keywords_type_list)): # 定位每個關鍵字種類,然后提取該關鍵字種類下的所有關鍵字 li = [] keywords_ele_list = driver.find_elements_by_xpath( ".//*[@id=" + article_id + "]/div/ul/li[" + str(i+1) +"]/strong/following-sibling::*/li/a") for j in keywords_ele_list: li.append(j.text) dic[keywords_type_list[i].text] = li article_keywords.append(dic)
-
最后輸出成 csv 文件即可:
# already get all data, now output to the csv file pprint(article_keywords) with open("ieee_zhangbo_.csv", "w", newline="")as f: csvwriter = csv.writer(f, dialect=("excel")) csvwriter.writerow(['article_name', 'article_source', 'article_link', 'IEEE Keywords', 'INSPEC: Controlled Indexing', 'INSPEC: Non-Controlled Indexing', 'Author Keywords']) for i in range(len(article_names)): csvwriter.writerow([article_names[i], article_sources[i], article_links[i], article_keywords[i]['IEEE Keywords'], article_keywords[i]['INSPEC: Controlled Indexing'], article_keywords[i]['INSPEC: Non-Controlled Indexing'], article_keywords[i]['Author Keywords']]
-
輸出:
"C:\Program Files\Python36\python.exe" D:/PythonProject/immoc/IEEEXplorer_get_article.py start to check if this is the last page !!! This page is good to go !!! article_name = Smale Horseshoes and Symbolic Dynamics in the Buck–Boost DC–DC Converter article_link = http://ieeexplore.ieee.org/document/7926377/ article_name = A Novel Single-Input–Dual-Output Impedance Network Converter article_link = http://ieeexplore.ieee.org/document/7827092/ article_name = A Z-Source Half-Bridge Converter article_link = http://ieeexplore.ieee.org/document/6494636/ article_name = Design of Analogue Chaotic PWM for EMI Suppression article_link = http://ieeexplore.ieee.org/document/5590287/ article_name = A novel H5-D topology for transformerless photovoltaic grid-connected inverter application article_link = http://ieeexplore.ieee.org/document/7512376/ article_name = A Common Grounded Z-Source DC–DC Converter With High Voltage Gain article_link = http://ieeexplore.ieee.org/document/7378484/ article_name = Frequency Splitting Phenomena of Magnetic Resonant Coupling Wireless Power Transfer article_link = http://ieeexplore.ieee.org/document/6971783/ article_name = Modeling and analysis of the stable power supply based on the magnetic flux leakage transformer article_link = http://ieeexplore.ieee.org/document/7037927/ article_name = On Thermal Impact of Chaotic Frequency Modulation SPWM Techniques article_link = http://ieeexplore.ieee.org/document/7736981/ article_name = Extended Switched-Boost DC-DC Converters Adopting Switched-Capacitor/Switched-Inductor Cells for High Step-up Conversion article_link = http://ieeexplore.ieee.org/document/7790823/ article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Transactions on Electromagnetic Compatibility article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Transactions on Magnetics article_source = 2014 International Power Electronics and Application Conference and Exposition article_source = IEEE Transactions on Industrial Electronics article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics start to check if this is the last page !!! This page is good to go !!! article_name = Common-Mode Electromagnetic Interference Calculation Method for a PV Inverter With Chaotic SPWM article_link = http://ieeexplore.ieee.org/document/7120165/ article_name = Stability Analysis of the Coupled Synchronous Reluctance Motor Drives article_link = http://ieeexplore.ieee.org/document/7460928/ article_name = A modified AGREE reliability allocation method research in power converter article_link = http://ieeexplore.ieee.org/document/7107251/ article_name = A single-switch high step-up converter without coupled inductor article_link = http://ieeexplore.ieee.org/document/7512635/ article_name = Hybrid Z-Source Boost DC–DC Converters article_link = http://ieeexplore.ieee.org/document/7563395/ article_name = A study of hybrid control algorithms for buck-boost converter based on fixed switching frequency article_link = http://ieeexplore.ieee.org/document/6566548/ article_name = Bifurcation and Border Collision Analysis of Voltage-Mode-Controlled Flyback Converter Based on Total Ampere-Turns article_link = http://ieeexplore.ieee.org/document/5729352/ article_name = Frequency, Impedance Characteristics and HF Converters of Two-Coil and Four-Coil Wireless Power Transfer article_link = http://ieeexplore.ieee.org/document/6783963/ article_name = Sneak circuit analysis for a DCM flyback DC-DC converter considering parasitic parameters article_link = http://ieeexplore.ieee.org/document/7512450/ article_name = Detecting bifurcation types in DC-DC switching converters by duplicate symbolic sequence article_link = http://ieeexplore.ieee.org/document/6572495/ article_source = IEEE Transactions on Magnetics article_source = IEEE Transactions on Circuits and Systems II: Express Briefs article_source = 2014 10th International Conference on Reliability, Maintainability and Safety (ICRMS) article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = IEEE Transactions on Industrial Electronics article_source = 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA) article_source = IEEE Transactions on Circuits and Systems I: Regular Papers article_source = IEEE Journal of Emerging and Selected Topics in Power Electronics article_source = 2016 IEEE 8th International Power Electronics and Motion Control Conference (IPEMC-ECCE Asia) article_source = 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013) start to check if this is the last page !!! The last page !!!
-
csv 文件:
20171118-zhangbocsv