男男女女爽爽爽免费视频,亚洲综合一区国产精品,暗黑爆料官方入口

關于bs4，官方文檔的介紹已經非常詳細了,傳送：Beautifulsoup 4官方文檔,這里我把它組織成自己已經消化的筆記，你們最好看官方的，官方的全些可視化更強。本文從解釋器，DOM樹創建，遍歷DOM樹，修改DOM樹，Beautiful Soup3和4的區別，Soup結合Requests 這幾個主題整理。ps，其實如果掌握了javascript的DOM樹的話，會更好理解soup。

Beautiful Soup

將復雜HTML文檔轉換成一個復雜的樹形結構DOM,每個節點都是Python對象,所有對象可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment .

A、DOM樹的解釋器：

推薦使用lxml作為解析器,因為效率更高，解析器的好處是可以容錯比如沒有結束標簽。

B、創建DOM樹的三種類型：

1)打開本地文件with open("foo.html","r") as foo_file:soup_foo = BeautifulSoup(foo_file)

2)手動創建soup = BeautifulSoup(“hello world”，編碼類型選填)

3)打開外部文件url = "http://www.packtpub.com/books"??? page = urllib.urlopen(url)soup_packtpage = BeautifulSoup(page,'lxml')

C.DOM樹介紹：

節點【元素，屬性，內容（string，NavigableString，contents，Comment】，還有節點的家庭【父節點，同輩鄰接兄弟節點，同輩往右鄰接所有兄弟節點，同輩往左鄰接所有兄弟節點，直接子節點，所有子節點】

1）節點tag：類似javascript的元素-

訪問元素：soup = BeautifulSoup(html_tag1,'lxml')tag1 = soup.a

訪問元素的名字：tagname = tag1.name

訪問元素的屬性:tag1['class']或tag1.attrs

修改元素的屬性訪問：? tag['class'] = 'verybold'賦予多個屬性rel_soup.a['rel'] = ['index', 'contents']
是否有元素的屬性：tag1.has_attr('class')
獲取所有屬性：比如獲取a所有標簽的鏈接，for link in soup.find_all('a'): print(link.get('href'))

獲取元素的內容：soup.p.string .如果tag包含了多個子節點,tag就無法確定 .string 方法應該調用哪個子節點的內容, .string 的輸出結果是 None.? 如果tag中包含多個字符串 [2] ,可以使用 .strings 來循環獲取.for string in soup.strings: print(repr(string))

獲取元素的內容含編碼過的：head_tag.contents，#[u'Hello', u' there']

獲取元素帶標簽的內容：head_tag.NavigableString，#Hello there.

獲取元素帶注釋的內容Comment：創建對象時用Comment，輸出對象時用soup.b.prettify()

獲取所有文字內容:soup.get_text()

2）節點的家庭

訪問直接子節點：.children()? 和.contents()

.children() 不返回文本節點，如果需要獲得包含文本和注釋節點在內的所有子節點，請使用 .contents()。

遍歷所有子節點：.descendants

遍歷子節點的內容：.strings 和.stripped_strings（去除空格或空行）

訪問父節點：.parent

遍歷所有父節點：.parents

獲得匹配元素集合中所有元素的同輩元素：.next_siblings .previous_siblings

獲得匹配元素集合中的下或上一個同輩元素.next_element? .previous_element 屬性：

獲得匹配元素集合中的往后所有同輩或者往前所有元素：.next_elements? .previous_elements

D、遍歷DOM樹

可以通過方法或者css選擇器

方法：find_all( name , attrs , recursive , string, **kwargs )

參數：支持可以使用的參數值包括 字符串 , 正則表達式 , 列表, True，自定義包含一個參數的方法Lambda表達式（這個我放在正則表達式那篇文章了）。

如支持字符串soup.find_all('b')，正則表達式soup.find_all(re.compile("^b"))，列表soup.find_all(["a","b"])，True值如soup.find_all(True)，自定義包含一個參數的方法如soup.find_all(has_class_but_no_id)，而方法def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')。

name：是元素名，可以是一個元素find('head')，也可以是多個元素，多個元素可以用列表find(['a','b'])，字典find({'head':True, 'body':True})，或者lambda表達式find(lambda name: if len(name) == 1) 搜索長度為1的元素，或者正則表達式匹配結果find(re.compile('^p'))來表示，find(True) 搜索所有元素

attrs：是元素屬性。

搜索指定名字的屬性時可以使用的參數值包括字符串 , 正則表達式 , 列表, True?

按照屬性值搜索tag：一個屬性如find(id='xxx') ，soup.find_all("a", class_="sister"),soup.find_all("a", attrs={"class": "sister"})；也可以是多個屬性，比如正則表達式：find(attrs={id=re.compile('xxx'), p='xxx'})或者soup.find_all(class_=re.compile("itl"))，或者true方法：find(attrs={id=True, algin=None}),或者列表方法find_all(attrs={"data-foo": "value"})。

recursive和limit參數

recursive=False表示只搜索直接兒子，否則搜索整個子樹，默認為True。當使用findAll或者類似返回list的方法時，limit屬性用于限制返回的數量，如findAll('p', limit=2)：返回首先找到的兩個tag.

soup.html.find_all("title", recursive=False)

string 參數

通過 string參數可以搜搜文檔中的字符串內容.與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表, 自定義方法，True.

soup.find_all(string="Elsie")#字符串

soup.find_all(string=["Tillie", "Elsie", "Lacie"]) )#列表

soup.find_all(string=re.compile("Dormouse"))#正則表達式

def is_the_only_string_within_a_tag(s): #方法

""Return True if this string is the only child of its parent tag.""

return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)

關于Beautiful Soup樹的操作具體可以看官方文檔?

另外還有和find_all類似的方法：

find( name , attrs , recursive , text , **kwargs )

和find_all區別：返回第一個節點

find_parents()? find_parent()

find_next_siblings()? find_next_sibling()

find_previous_siblings()? find_previous_sibling()

find_all_next()? find_next()

find_all_previous() 和 find_previous()

除了find方法遍歷外，還有CSS選擇器

CSS選擇器：soup.select()

元素查找，比如元素名soup.select('a')，類名soup.select('.sister')，ID名.soupselect('#link1')或者soup.select("#link1,#link2")，組合查，比如soup.select('p #link1')，或者直接子元素選擇器soup.select("head > title")，所有子元素soup.select("body a")，篩選soup.select("p nth-of-type(3)")，soup.select("p > a:nth-of-type(2)")。

屬性查找soup.select('a[class="sister"]')，soup.select('p a[#link1 ~ .sister")，直接后面兄弟soup.select("#link1 + .sister")，soup.select('a[href*=".com/el"]')

查找到的元素的第一個soup.select_one(".sister")

E、修改DOM樹：

修改tag的名稱和屬性：tag.name = "blockquote" tag['class'] = 'verybold'

修改tag的內容：.string tag = soup.a?? tag.string = "New link text."

添加節點內容：append()或者NavigableString構造對象，new_string構造comment注釋

append()：soup.a.append("Bar")，輸出內容時soup.a.contents，結果從"<a>Foo</a>變成 <a>FooBar</a>"

NavigableString方法：new_string = NavigableString(" there") tag.append(new_string)

new_string構造comment注釋：from bs4 import Comment new_comment = soup.new_string("Nice to see you.", Comment) tag.append(new_comment)

添加子節點：new_tag(),第一個參數作為tag的name,是必填,其它參數選填

soup = BeautifulSoup("") original_tag = soup.b

new_tag = soup.new_tag("a", ) original_tag.append(new_tag)

#<a >

插入子節點或內容：Tag.insert(指定位置索引，節點或內容)，Tag.insert_before(節點或內容) 和 Tag.insert_after(節點或內容)

tag = soup.p?

insert()：tag.insert(1, "but did not")#從"haha<a href='www.baidu.com'>i like fish</a>"變成"haha but did not <a href='www.baidu.com'>"

insert_before()：如tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag)

insert_after()：如soup.b.i.insert_after(soup.new_string(" ever "))

刪除當前節點內容：tag = soup.a ? ?? tag.clear()

刪除當前節點并返回刪除后的內容：tag=soup.i.extract()，注意這個tag是新創建的，和dom樹soup有區別。

刪除當前tag并完全銷毀，不會創建新的：soup.i.decompose()
刪除當前節點并替換新的節點或內容：a_tag = soup.a new_tag = soup.new_tag("b") new_tag.string = "example.net" a_tag.i.replace_with(new_tag)

對指定節點進行包裝，添加標簽如div：soup.p.wrap(soup.new_tag("div"))

刪除指定節點的標簽如a標簽：tag.a.unwrap()

打印DOM樹或DOM樹的節點：

含標簽：soup.prettify() ,soup.p.prettify()? #i will

不含標簽：unicode() 或 str() 方法:#i will

特殊字符如ldquo轉換成‘\’：soup = BeautifulSoup("“Dammit!” he said.") unicode(soup)

獲取本文內容并以字符隔開：get_text(隔開字符可選填，去除空白符開關可選填)，soup.get_text()，soup.get_text("|")如‘i like | fish’，soup.get_text("|", strip=True)

判斷節點內容是否相同：first_b == second_b

判斷節點是否相同，即指向同一地址：first_b is second_b
判斷節點的類型是否是已知類型：isinstance(1, int)

復制節點：p_copy = copy.copy(soup.p)

F、Python3和Python2版本的Beautiful Soup區別：

G、Soup與Requests

附上可靠的網絡連接代碼：

把這段代碼

from urllib.request import urlopen

from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page1.html")

bsObj = BeautifulSoup(html.read())

print(bsObj.h1)

改成以下代碼

from urllib.request import urlopen

from urllib.error import HTTPError

from bs4 import BeautifulSoup

??????? def getTitle(url):

?????? ? ? ? ? try:

???????????????????? html = urlopen(url)

??????? ? ? ? ? except HTTPError as e:

???????????????????? return None

?????????????? try:

???????????????????? bsObj = BeautifulSoup(html.read())

????????????????????? title = bsObj.body.h1

???????????? ?? except AttributeError as e:

????????????? ? ? ? ? return None

???????????? ?? return title

??????? title = getTitle("http://www.pythonscraping.com/pages/page1.html")

??????? if title == None:

??????????? print("Title could not be found")

??????? else:

???????????? print(title)

不過，最終選擇了xpath

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Python爬蟲之Beautiful Soup用法

Python爬蟲之Beautiful Soup用法

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Python爬蟲之Beautiful Soup用法

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频