視頻重點
練習代碼
擴展
1. 視頻重點
- 操作與客戶端(本地的mongodb)進行鏈接(激活); 給客服打同電話
client = pymongo.MongoClint('localhost',27017)
創建數據庫命名 = 給excel命名;告訴客服要定一個酒店hotel = client['hotel']
給數據庫的文件創建表單(頁) = 給Excel創建sheet;而且定具體的房間room = hotel['room']
房間是存放東西的,而酒店可以有很多房間,客服又可以聯系到很多酒店,但每次都要聯系客服。 -
enumerate:會將數組或列表組成一個索引序列。
enumerate.jpg - 添加數據:room.insert_one()
- 查找:room.find() $lt/$gt/$lte/$gte/$!=分別是<,><=,>=,不等于
2.練習代碼
from bs4 import BeautifulSoup
import requests
import time
import pymongo
client = pymongo.MongoClient('localhost', 27017)
dz = client['dz']
dzinfo = dz['dzinfo']
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
"Cookie": "abtest_ABTest4SearchDate=b; xzuuid=d8302b2c; OZ_1U_2282=vid=v742d196332c56.0&ctime=1464069891<ime=1464069890; OZ_1Y_2282=erefer=-&eurl=http%3A//bj.xiaozhu.com/search-duanzufang-p1-0%3Ft%3D1463996817649&etime=1463996822&ctime=1464069891<ime=1464069890&compid=2282; _ga=GA1.2.679671644.1463996823; _gat_UA-33763849-7=1; __utmt=1; __utma=29082403.679671644.1463996823.1463996824.1464069893.2; __utmb=29082403.1.10.1464069893; __utmc=29082403; __utmz=29082403.1463996824.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"
}
urls = []
# url = 'http://bj.xiaozhu.com/search-duanzufang-p1-0/'
# urla = 'http://bj.xiaozhu.com/fangzi/1580034935.html'
def each_href(url):
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, 'lxml')
xiangqings = soup.select('a.resule_img_a')
for xiangqing in xiangqings:
urls.append(xiangqing.get('href'))
return (print(urls,len(urls)))
def each_info(urla):
wb_data = requests.get(urla)
time.sleep(2)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select('div.pho_info > h4 > em')
prices = soup.select('#pricePart > div.day_l > span')
adds = soup.select('span[class="pr5"]')
img_rooms = soup.select('#detailImageBox > div.pho_show_r > div > ul > li > img')
img_hosts = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
ffs = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')
names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
for title, price, add, img_room, img_host, ff, name in zip(titles, prices, adds, img_rooms, img_hosts, ffs, names):
if str(ff) == '<span class="member_girl_ico"></span>':
ff = "女性"
else:
ff = "男性"
data = {
'title': title.get_text(),
'price': price.get_text(),
'add': add.get_text().split("\n")[0],
'img_room': img_room.get('data-bigimg'),
'img_host': img_host.get('src'),
'ff': ff,
'name': name.get_text()
}
# print(data)
dzinfo.insert_one(data)
i = input("想爬取多少頁的信息?")
for k in range(1, int(i)+1):
for url in ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(k)]:
each_href(url)
print(urls, len(urls))
for urla in urls:
each_info(urla)
for info in dzinfo.find():
if int(info['price'])>=500:
print(info)
運行結果
運行結果.jpg
運行結果.jpg
大于500的房源.jpg
大于500的房源.jpg
注:在pycharm中的pymongo插件中查看爬到的結果,最多看到0~299,共300條信息