目的:假設(shè)在抓取過(guò)程中因網(wǎng)絡(luò)問(wèn)題而導(dǎo)致程序停止,設(shè)計(jì)這樣一個(gè)程序,使得抓取的數(shù)據(jù)不會(huì)重復(fù)
在python的數(shù)據(jù)結(jié)構(gòu)中,set不能包含重復(fù)的元素,故采用set來(lái)實(shí)現(xiàn)
以下是代碼部分:
import requests
from bs4 import BeautifulSoup
url = 'http://bj.58.com/ershouche/pn2/'
L = []
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text, 'lxml')
links = soup.select('td.t a.t')
for link in links:
real_link = link.get('href')
L.append(real_link)
single_link = set(L)
#print(single_link)
上面的代碼要抓取的是網(wǎng)址http://bj.58.com/ershouche/pn2/
上的各個(gè)二手車(chē)的鏈接