这里只有精品,久久久亚洲欧洲日产国码二区,18禁无遮挡羞羞污污污污网站

Python爬蟲——Beautiful Soup的用法

學習自崔慶才的個人博客靜覓
文章地址：http://cuiqingcai.com/1319.html

0. Beautiful Soup簡介及環境配置

Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據，所以可以用這個庫來實現爬蟲功能。

下載地址：https://pypi.python.org/pypi/beautifulsoup4/4.3.2

下載后解壓至硬盤里，然后打開命令行，進入對應文件夾，執行python setup.py install進行安裝。

Beautiful Soup中文文檔：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

1. 創建Beautiful Soup對象

導入bs4庫from bs4 import BeautifulSoup

然后創建BeautifulSoup對象soup=BeautifulSoup(html),這里的參數是一個網頁文本,或者是一個類文件對象，如open(),urlopen()都可以。另外現在需要在后面加一個參數'lxml'，所以實例化的格式如下：soup=BeautifulSoup(urllib2.urlopen('http://www.baidu.com').read(),'lxml')

下來是將soup對象的內容打印出來：print soup.prettify()

2. 四大對象種類

Beautiful Soup將HTML文檔轉換成了復雜的樹形結構，每個節點都是Python對象。共有四種對象，tag,NavigableString,BeautifulSoup,Comment

通過實例感受：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1"><!-- Elsie --></a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html)
#title tag之間的內容
print soup.title 
#head之間的內容
print soup.head 
#通過soup加標簽名獲取標簽之間的內容
#Tag對象有兩個屬性，name與attrs
print soup.head.name
#輸出是head
print soup.p.attrs
#輸出是字典{'class': ['title'], 'name': 'dromouse'}
#單獨獲取屬性
print soup.p['class']
print soup.p.get('class')
#修改這些屬性
soup.p['class']='newClass'
#刪除屬性
del soup.p['class']

NavigableString

通過這樣soup.p.string獲取標簽內部的文字

print soup.p.string
#輸出是The Dormouse's story

BeautifulSoup

該對象表示的是一個文檔的全部內容，大部分情況可以當成Tag對象，是一個特殊的Tag，實例感受：

print type(soup.name)
#輸出是<type 'unicode'>
print soup.name
 #輸出是[document]
print soup.attrs
#輸出是空字典[]

Comment

Comment對象是一個特殊類型的NavigableString對象，使用soup.a.string打印將不包括注釋符號，所以在打印之前，判斷是否是bs4.element.Comment，再進行其他操作。

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

3. 遍歷文檔樹

.contents .children .descendants屬性

仍然是上實例：

#.contents屬性將tag的子節點以列表方式輸出
print soup.head.contents
#輸出方式為列表，以最大的標簽內容為一個列表項
#對于html.contents來說列表包含head和body
#可以用列表索引獲取元素
print soup.head.contents[0]

#.children屬性返回的是列表迭代器，遍歷所有子節點
for child in soup.body.children:
    print child

#.descendants屬性將遍歷所有tag的子孫節點
for child in soup.descendants:
    print child
#重點是每一個標簽層層剝離

未完待續

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Python爬蟲——Beautiful Soup的用法

Python爬蟲——Beautiful Soup的用法

Python爬蟲——Beautiful Soup的用法

0. Beautiful Soup簡介及環境配置

1. 創建Beautiful Soup對象

2. 四大對象種類

3. 遍歷文檔樹

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Python爬蟲——Beautiful Soup的用法

Python爬蟲——Beautiful Soup的用法

0. Beautiful Soup簡介及環境配置

1. 創建Beautiful Soup對象

2. 四大對象種類

3. 遍歷文檔樹

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频