訣竅,大局觀
- 找“打印該頁面鏈接”,找“移動端顯示”,會讓格式更容易
- 找在js里的信息
- 信息可能在url里
- 換個網站找同樣信息
get_text()
去掉所有tag部分,只留下text部分。留到最后再用這個功能。
pythonnameList = bsObj.findAll("span", {"class":"green"})for name in nameList: print(name.get_text())
findAll()pythonfindAll(tag, attributes, recursive, text, limit, keywords).findAll({"h1","h2","h3","h4","h5","h6"}) # 找tag屬于的.findAll("span", {"class":"green", "class":"red"}) # 找tag=span,class屬于的nameList = bsObj.findAll(text="the prince") # 找tag的text是“the price”的個數allText = bsObj.findAll(id="text") # keywords尋找對應關鍵詞的allText = bsObj.findAll("", {"id":"text"}) # 與上式同義bsObj.findAll(class_="green") # class關鍵詞時用class_,避免關鍵詞soup.findAll(lambda tag: len(tag.attrs) == 2) # 加lambda表達式
children(), descendants()pythonbsObj.find("tr",{"id":"gift1"}).children() # 滿足條件tag的直屬一級tagbsObj.find("tr",{"id":"gift1"}).descendants() # 滿足條件tag的包含的所有tag
next_siblings, previous_siblingspythonbsObj.find("table",{"id":"giftList"}).tr.next_siblings # 當前tr tag之后的并列tagbsObj.find("table",{"id":"giftList"}).previous_siblings # 當前tag之前的并列tag
parentpythonbsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text() # 定位到當前tag的parent
regular expressionspythonimages = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) # findAll加re
獲取tag屬性attributespythonmyImgTag.attrs # 得到字典,包括這個tag的所有屬性myImgTag.attrs['src'] # src屬性值
其他選擇,不用bs41. lxml:處理HTML,XML,很快。2. HTML Parser:buit-in