python模塊（包）之urllib

urllib is a package that collects several modules for working with URLs:

urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files

大體來說就是urllib是一個包含request、error、parse、robotparser四個模塊，關乎網絡資源請求的包。request模塊用來發起網絡資源請求；error模塊用來在request網絡資源過程中搜集異常報錯；parse模塊用來對url地址進行處理；robotparser模塊用來解析robots.txt文件（未知）。

1、request

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

1.1 方法

urllib.request.urlopen(url, data=None, [timeout, ], cafile=None, capath=None, cadefault=False, context=None)：

url，可以是url地址字符串，或者是Request對象（下面會提到）。
data，指定發送到服務器的數據對象。
cafile、capath，發起HTTPS請求時指定一組可信的CA證書。cafile應指向包含一系列CA證書的單個文件，而capath應指向散列證書文件的目錄。
context，該參數若被指定，必須是ssl.SSLContext對象。
timeout，請求超時時間。

這里一般url、data、timeout三個參數還比較常用。

該函數返回一個上下文管理對象，包含一下幾種方法獲取返回結果的相關信息：

geturl()：返回檢索的資源的URL，通常用于確定是否遵循重定向。
info()：以email.message_from_string()實例的形式返回頁面的元信息，如頭信息。
getcode()：返回http響應的狀態碼。

對于http和https，除上述的幾個函數獲取信息外，該函數返回也是對http.client.HTTPResponse稍加修改的對象，其詳細說明見下方官檔。

HTTPResponse Objects 官檔

HTTPResponse Objects這個對象常用方法有：

read()：讀取響應主體，數據格式為bytes類型，需要decode()解碼，要按編碼轉換成str類型。
msg：http.client.HTTPMessage包含響應標頭實例。
status：服務器的狀態碼。
reason：服務器返回的原因短語。
closed：數據流被關閉時為True。

在查看urllib的源碼request.py找到類OpenerDirector，其下面方法open可以找到下面幾行代碼：

if isinstance(fullurl, str):
    req = Request(fullurl, data)
else:
    req = fullurl
    if data is not None:
        req.data = data

urllib.request.urlopen() 方面介紹里面提到：請求資源可以是url地址字符串，或者是Request對象。再看上面一段代碼isinstance(fullurl, str)傳遞的fullurl是str的實例對象，就將fullurl等轉化為Request的實例對象。

所以請求資源是url地址字符串或是Request對象都殊途同歸，最終會轉為Request的實例對象進行資源請求。下面的子類會對Request對象進行詳細介紹。

附加這個方法對應的去注釋源碼。

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):
    global _opener
    if cafile or capath or cadefault:
        import warnings
        warnings.warn("cafile, cpath and cadefault are deprecated, use a "
                      "custom context instead.", DeprecationWarning, 2)
        if context is not None:
            raise ValueError(
                "You can't pass both context and any of cafile, capath, and "
                "cadefault"
            )
        if not _have_ssl:
            raise ValueError('SSL support not available')
        context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH,
                                             cafile=cafile,
                                             capath=capath)
        https_handler = HTTPSHandler(context=context)
        opener = build_opener(https_handler)
    elif context:
        https_handler = HTTPSHandler(context=context)
        opener = build_opener(https_handler)
    elif _opener is None:
        _opener = opener = build_opener()
    else:
        opener = _opener

這里面注意幾個變量。

a、https_handler：姑且稱為資源構造器，它相當于處理不同網絡資源的句柄對象，如HTTPHandler、HTTPSHandler、FileHandler、FTPHandler、UnknownHandler等類的實例對象。

b、opener = build_opener(https_handler)：姑且稱為資源鑰匙，它是一個OpenerDirector類的實例對象，其參數是上面說的資源構造器，用這把鑰匙可以打開網絡的任意資源。

下面繼續request模塊的方法介紹。

urllib.request.build_opener([handler, ...])：構造資源鑰匙，它是一個OpenerDirector類的實例對象，其參數是上面說的資源構造器。

handler，HTTPHandler、HTTPSHandler、FileHandler、FTPHandler、UnknownHandler等類的實例對象。

urllib.request.install_opener(opener)：插入資源鑰匙，載入OpenerDirector子類的實例對象，用來請求網絡資源。

opener，常為build_opener([handler, ...])方法得到的OpenerDirector類的實例對象。

1.2 request常用子類

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)：網絡資源請求的抽象。

url，網絡資源地址字符串。
data，請求攜帶數據，常為post表單數據。
headers，攜帶請求頭，一些http常用請求頭信息。
method，指明請求方法，GET、POST、PUT之類。

urllib.request.HTTPCookieProcessor(cookiejar=None)：處理http cookie。

cookiejar，一般為cookielib.CookieJar()方法保存的cookie文件。

urllib.request.ProxyHandler(proxies=None)：代理請求。

proxies，字典形式。如{'sock5': 'localhost:1080'}、{'https': '192.168.8.8:2365'}。

urllib.request.FileHandler()：一個文件對象。（不知是否可以作為上傳文件使用。）

這些子類又有一些自己的方法，大多暫且不介紹，附Request類的方法官檔鏈接。

Request類的方法官檔

2、error

處理由request請求產生的錯誤。

urllib.error.URLError：地址錯誤，有屬性如下：

reason，可能是錯誤字符串或其它的錯誤實例。

urllib.error.HTTPError：網絡請求錯誤，有屬性如下：

code，http狀態碼。
reason，錯誤原因。
headers，響應頭。

3、parse

The urllib.parse module defines functions that fall into two broad categories: URL parsing and URL quoting. These are covered in detail in the following sections.

這個模塊提供處理url的標準接口，兩種：解析處理和引用處理。

3.1 URL Parsing

The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)：url地址解析。

url地址通常標準格式如下：scheme://netloc/path;parameters?query#fragment詳細說明介紹可見http.md的介紹。返回是6個元素組成的元組。

urlstring，urlstring地址字符串。
scheme，指定默認協議。
allow_fragments，False時將不進行fragment解析，直接將其視作path、query或parameters的一部分。

urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')：解析url參數字符串。

qs，查詢子串。
keep_blank_values，百分比編碼查詢的空白值是否應視為空白字符串。
strict_parsing，如果解析錯誤，false為默認忽略，否則錯誤引發ValueError異常。
encoding，errors。可選的編碼和錯誤參數指定如何將百分比編碼的序列解碼為Unicode字符。

3.2 URL Quoting

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)：多參數元組拼接為百分比編碼后的字符串。

4、寫在后面

概念比較空洞，實踐出真知。簡單內容可以參見threading_douban.py。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

urllib

urllib

python模塊（包）之urllib

1、request

1.1 方法

1.2 request常用子類

2、error

3、parse

3.1 URL Parsing

3.2 URL Quoting

4、寫在后面

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

urllib

python模塊（包）之urllib

1、request

1.1 方法

1.2 request常用子類

2、error

3、parse

3.1 URL Parsing

3.2 URL Quoting

4、寫在后面

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频