Python爬蟲學習(3)爬取隨機外鏈

在前兩張前,我們所進行的行為是基于一個頁面的html結(jié)構(gòu)進行解析,但在實際的網(wǎng)絡(luò)爬蟲中,會順著一個鏈接跳轉(zhuǎn)到另一個鏈接,構(gòu)建出一張"網(wǎng)絡(luò)地圖",所以我們本次將對外鏈進行爬取
示例:http://oreilly.com

測試一下是否能拿到外鏈

from urllib.parse import urlparse
import random
import datetime
import re
pages = set()
random.seed(datetime.datetime.now())

#獲取頁面內(nèi)部鏈接
def getInternalLinks(bsObj,includeUrl):
    includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
    internalLinks = []
    for link in bsObj.findAll("a",href=re.compile("^(/|.*"+includeUrl+")")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.href['href'].startswith("/")):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

def followExtrenalOnly(startingPage):
    externalLink = "https://en.wikipedia.org/wiki/Intelligence_agency"
    print("Random extranal link is"+externalLink)
    followExtrenalOnly(externalLink)

# def main():
#     followExtrenalOnly("http://en.wikipedia.org")
#     print('End')
#     if __name__ == '__main__':
#         main()
followExtrenalOnly("http://en.wikipedia.org")

console output
遞歸迭代外鏈數(shù),一共56條


90890890.png

在網(wǎng)站首頁不保證一定能發(fā)現(xiàn)外鏈,根據(jù)第二章的console output實驗我們可以知道,html結(jié)構(gòu)不存在外鏈的情況
對比https://en.wikipedia.org/wiki/Main_Pagehttps://en.wikipedia.org/wiki/Auriscalpium_vulgare的html結(jié)構(gòu)如下

87878768.png
4545545.png

尋找該頁面外鏈的dfs邏輯如下:
當獲取頁面上的所有外鏈時,我們按照遞歸的方式去找,當遇到一個外鏈,視為達到一個葉子結(jié)點。若為遇到,修改此外鏈為內(nèi)鏈,結(jié)束本次遞歸,回溯從主頁面開始搜索

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import random
import datetime
import re
pages = set()
random.seed(datetime.datetime.now())
#獲取頁面內(nèi)部鏈接
def getInternalLinks(bsObj,includeUrl):
    includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc
    internalLinks = []
    for link in bsObj.findAll("a",href=re.compile("^(/|.*"+includeUrl+")")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.href['href'].startswith("/")):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks

def getExtrenalLinks(bsObj,excludeurl):
    extrenalLinks=[]
    #查找http開頭和www開頭的域名
    for link in bsObj.findAll("a",href =re.compile("^(http|www)((?!"+excludeurl+").)*$")):
        if link.attrs['href'] is not None:
            #如果內(nèi)連接包含跳轉(zhuǎn)到其他頁面的鏈接
            if link.attrs['href'] not in extrenalLinks:
                    extrenalLinks.append(link.attrs['href'])
    return extrenalLinks

def getRandomExtrnalLink(startingPage):
    html=urlopen(startingPage)
    bsObj= BeautifulSoup(html,"html.parser")
    extrenalLinks = getExtrenalLinks(bsObj,urlparse(startingPage).netloc)
    if len(extrenalLinks)==0:
        print("沒有找到外鏈")
        domain =urlparse(html).scheme+"://"+urlparse(startingPage).netloc
        internalLinks=getInternalLinks(bsObj,domain)
        return getRandomExtrnalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:
        return  extrenalLinks[random.randint(0,len(extrenalLinks)-1)]

def followExtrenalOnly(startingPage):
    externalLink =getRandomExtrnalLink(startingPage)
    #externalLink = "https://en.wikipedia.org/wiki/Intelligence_agency"
    print("Random extranal link is"+externalLink)
    followExtrenalOnly(externalLink)

# def main():
#     followExtrenalOnly("http://en.wikipedia.org")
#     print('End')
#     if __name__ == '__main__':
#         main()
followExtrenalOnly("https://en.wikipedia.org/wiki/Main_Page")

console output

9789789.png

Tips: 根據(jù)隨機外鏈,各位朋友可以參考一下時下最為流行的區(qū)塊鏈:

簡單易懂的區(qū)塊鏈: http://python.jobbole.com/88248/
阮一峰老師的區(qū)塊鏈入門: http://www.ruanyifeng.com/blog/2017/12/blockchain-tutorial.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容