基于bs4+requests的藍(lán)房網(wǎng)爬蟲(進(jìn)階版)

1.代碼可以直接運行,請下載anaconda并安裝,用spyder方便查看變量
或者可以查看生成的excel文件
2.依賴庫,命令行運行(WIN10打開命令行快捷鍵:windows+x組合鍵,然后按a鍵):
pip install BeautifulSoup4
pip install requests
3.爬取的網(wǎng)站是藍(lán)房網(wǎng)(廈門)二手房,可以進(jìn)入http://xm.esf.lanfw.com/sell_zhuzhai/p1?keyword=/進(jìn)行觀察
4.關(guān)于如何判斷代碼是python2還是python3,print('')為python3,print ''為python2
簡而言之就是print需要用括號的就是python3,下面代碼如是。
5.爬取538個頁面并進(jìn)行解析,程序運行后需要等待大概500秒

# -*- coding: utf-8 -*-
"""
Created on Mon Jan 15 23:30:28 2018

@author: Administrator
"""

def getHousesDetails(url):
  import requests
  from bs4 import BeautifulSoup
  request = requests.get(url)
  request.encoding = 'utf-8'
  soup = BeautifulSoup(request.text,'lxml')
  houses = soup.select('.houseTxt')
  housesDetails = []
  for house in houses:
    title = house.select('.txtLeft h2 a')[0].text
    communityNameAndAddress = house.select('.txtLeft p')[0].text.strip('查看地圖').split()
    communityName = communityNameAndAddress[0]
    if(len(communityNameAndAddress) == 2 ):
      address = communityNameAndAddress[1]
    else:
      address =''
    details = house.select('.txtLeft p')[1].text.split(' | ')
    print(details)
    houseSizeType = details[0]
    houseFloor = details[1]
    houseDecoration = details[2]
    houseBuiltTime = details[3]
    if len(details) == 6 :
      houseOrientation = details[4]
      houseUnitPrice = details[5]
    elif len(details) == 5 :
      houseOrientation = ''
      houseUnitPrice = details[4]
    elif len(details) == 4 :
      houseDecoration = ''
      houseOrientation = ''
      houseBuiltTime = details[2]
      houseUnitPrice = details[3]
    price = house.select('.housePrice')[0].text
    squaremeter = house.select('.squaremeter')[0].text
    keywords = house.select('.houseTab')[0].text
    #上面是獲取房子的信息,下面將其做成字典
    houseDetails = {
        'title' : title,
        'communityName' : communityName,
        'address' : address,
        'houseSizeType': houseSizeType,
        'houseFloor' : houseFloor,
        'houseDecoration' : houseDecoration,
        'houseBuiltTime' : houseBuiltTime,
        'houseOrientation' : houseOrientation,
        'houseUnitPrice' : houseUnitPrice,
        'price' : price,
        'squaremeter' : squaremeter,
        'keywords' : keywords
        }
    housesDetails.append(houseDetails)
  return housesDetails

def getAllHousesDetails():
  maxPageNumber = 538
  urlBefore = 'http://xm.esf.lanfw.com/sell_zhuzhai/p{}?keyword='
  allHousesDetails = []
  for i in range(1,maxPageNumber+1):
    url = urlBefore.format(i)
    allHousesDetails.extend(getHousesDetails(url))
  import pandas
  dataFrame = pandas.DataFrame(allHousesDetails)
  return dataFrame

if __name__ == '__main__':
  allHousesDetails = getAllHousesDetails()
  allHousesDetails.to_excel('lanfwSecondHandHouseDetails2.xlsx')
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

推薦閱讀更多精彩內(nèi)容