Date:2016-10-7
By:Black Crow

前言：

本次作業為第二周第二節、第三節的作業合并，爬取的是58的手機號。
因為作業分為兩部分：第一部分是爬取頁面里的URL，第二部分爬取單個頁面的詳情。
第三節的斷點續傳使用的是find_one（），先檢查數據庫里是否存在，如過存在跳過，不存在寫入。

作業效果：

手機urls.png

手機詳情.png

我的代碼：

20161007代碼PART1：爬取列表

from bs4 import BeautifulSoup
import requests,time
from pymongo import MongoClient

p = 'http://bj.58.com/shoujihao/pn2/'

client = MongoClient('localhost',27017)
tongcheng = client['tongcheng']
mobile_pages = tongcheng['mobile_pages']
def counter(i=[0]):
next = i[-1] + 1
i.append(next)
return i[-1]
def get_shouji_urls(page_url):
wb_data= requests.get(page_url)
soup =BeautifulSoup(wb_data.text,'lxml')
phone_numbers = soup.select('a.t > strong')
phone_urls = soup.select('a.t')
# print(phone_numbers)
for phone_number,phone_url in zip(phone_numbers,phone_urls):
data ={
'phone_number':phone_number.get_text(),
'phone_url':phone_url.get('href').split('?')[0],
}
if 'jump' in list(data['phone_url'].split('//')[1].split('.')):
pass
else:
#print(data)
mobile_pages.insert_one(data)
print(counter())
def page_get():
for page_number in range(0,200):
page = 'http://bj.58.com/shoujihao/pn{}/'.format(str(page_number))
wb_data = requests.get(page)
soup = BeautifulSoup(wb_data.text, 'lxml')
pages_check = soup.select('#infocont > span > b')
for page_check in pages_check:
page_check = page_check.get_text()
# print(page_check)
if page_check =='0':
pass
else:
get_shouji_urls(page)
time.sleep(1)
page_get()

#####20161007代碼PART2：爬取詳情
>```
from bs4 import BeautifulSoup
import requests,time
from pymongo import MongoClient
client = MongoClient('localhost',27017)
tongcheng = client['tongcheng']
mobile_info1 = tongcheng['mobile_info1']
mobile_pages = tongcheng['mobile_pages']
# path= 'http://bj.58.com/shoujihao/27614539752242x.shtml'
def counter(i=[0]):
    next = i[-1] + 1
    i.append(next)
    return i[-1]
def get_shouji_info(url):
    wb_data= requests.get(url)
    soup =BeautifulSoup(wb_data.text,'lxml')
    titles = soup.select('h1')
    prices = soup.select('span.price')
    ymds = soup.select('li.time')
    # print(times)
    for title,price,ymd in zip(titles,prices,ymds):
        data={
            'title':title.get_text().strip(),
            'price':price.get_text().strip(),
            'ymd':ymd.get_text(),
            'url':url
        }
        if mobile_info1.find_one({'url':data['url']}):#如有相同的URL就提示，否則寫入
            # if mobile_info1.find_one({'title':data['title']}):
            print('already exsist')
        else:
            mobile_info1.insert_one(data)
            print(counter())
            time.sleep(1)
        #print(data)
#get_shouji_info(path)
for item in mobile_pages.find():
    get_shouji_info(item['phone_url'])

總結：

pool()函數尚未添加進去，速度有點慢；

find_one（）的效率如何？尚未測算。
爬取的結果中有空值，還需要檢查問題在哪。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

python實戰計劃：爬取手機號

python實戰計劃：爬取手機號

前言：

作業效果：

我的代碼：

20161007代碼PART1：爬取列表

p = 'http://bj.58.com/shoujihao/pn2/'

總結：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

python實戰計劃：爬取手機號

前言：

作業效果：

我的代碼：

20161007代碼PART1：爬取列表

p = 'http://bj.58.com/shoujihao/pn2/'

總結：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频