成果

2個(gè)函數(shù)分開測(cè)試的時(shí)候應(yīng)該沒問題
實(shí)際抓取的時(shí)候，封禁太厲害了，無法抓取太多頁(yè)面和過多測(cè)試
path_detail ='./resut_detail.txt' 用于存放抓取到所有的具體頁(yè)面的具體細(xì)節(jié)結(jié)果
path_links ='./resut_links.txt' 用于存放抓取到所有的具體頁(yè)面的地址
with open(path_detail,'a+') as text 不知道是否應(yīng)該這樣寫，能保證不會(huì)把path_detail的文件給覆蓋了，
他在循環(huán)中的位置是否正確暫時(shí)未得到驗(yàn)證
with open(path_links,'a+') as text: 同上

代碼

from bs4 import BeautifulSoup
import requests #有s
import time
path_detail ='./resut_detail.txt'
path_links ='./resut_links.txt'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
    'Cookie':'abtest_ABTest4SearchDate=b; OZ_1U_2282=vid=v7f3c69fed80eb.0&ctime=1475593887&ltime=0; OZ_1Y_2282=erefer=-&eurl=http%3A//gz.xiaozhu.com/fangzi/2303611027.html&etime=1475593887&ctime=1475593887&ltime=0&compid=2282; _ga=GA1.2.1488476801.1475593889; gr_user_id=13bbe192-e386-4074-8ca0-a4a882ba66aa; gr_session_id_59a81cc7d8c04307ba183d331c373ef6=8d7a3db1-e35f-4f23-9ce3-e73afd78b45a; __utma=29082403.1488476801.1475593889.1475594056.1475594056.1; __utmb=29082403.1.10.1475594056; __utmc=29082403; __utmz=29082403.1475594056.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
}
def get_detail(url_detail='http://gz.xiaozhu.com/fangzi/2303611027.html'):
    time.sleep(15)
    code = requests.get(url_detail)
    print(code)
    web_content = requests.get(url_detail)#注意是headers
    soup = BeautifulSoup(web_content.text,'lxml')
    titles = soup.select('div.pho_info h4 em')
    addresses = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p')
    rentals = soup.select('div.day_l')
    images = soup.select('img#curBigImage')# id這樣寫?
    landlord_photos = soup.select('div.member_pic > a > img')
    landlord_genders = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')
    landlord_names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
    for title, address, rental, image, landlord_photo,landlord_gender, landlord_name in zip(titles, addresses, rentals, images, landlord_photos, landlord_genders, landlord_names):
        landlord_gender = str(landlord_gender.get('class'))#str多此一舉了。。
        if landlord_gender == '[\'member_ico\']':
            landlord_gender = '男'
        elif landlord_gender == '[\'member_ico1\']':
            landlord_gender = '女'
        else:
            landlord_gender = '未知'
        date = {
            'title': title.get_text(),
            'address':address.get('title'),
            'rental':rental.get_text(),
            'image':image.get('src'),
            'landlord_photo':landlord_photo.get('src'),
            'landlord_gender':landlord_gender,
            'landlord_name':landlord_name.get_text()
        }
        list_value = list(date.values())
    with open(path_detail,'a+') as text:  #如果是按特定的列要怎么排序？？，不斷的新增結(jié)果用a+?
        text.write(str(list_value)+'\n')
        print(date)
#get_detail()
url_list = ['http://gz.xiaozhu.com/tianhe-duanzufang-p{}-8/'.format(i) for i in range(1,2)] #先拿2頁(yè)
def get_moreurls():
    with open(path_links,'a+') as text:
        for link in url_list:
            time.sleep(2)
            web_content = requests.get(link)  # 注意是headers，如果要寫
            soup = BeautifulSoup(web_content.text, 'lxml')
            link_lists = soup.select('#page_list ul.pic_list.clearfix li a.resule_img_a')
            for detail_link in link_lists:
                print(detail_link.get('href'))
                text.write(str(detail_link.get('href')+'\n'))  #采集到的鏈接記錄下來
                get_detail(url_detail=str(detail_link.get('href')))#對(duì)具體的鏈接繼續(xù)信息采集
get_moreurls()

總結(jié)與問題

1.目前感覺這個(gè)采集還是不如用火車頭工具的方便，不過火車頭很難采集動(dòng)態(tài)加載的數(shù)據(jù)，而且也是自己學(xué)了皮毛，也許python更擅長(zhǎng)抓去數(shù)量級(jí)別更大的數(shù)據(jù)和其他的自動(dòng)處理？？
比如說我目前是希望采集“今日頭條”，“一點(diǎn)資訊”，"微博"某些媒體一周內(nèi)所有的文章，微博的傳播和閱讀，互動(dòng)量，因?yàn)檫@些頁(yè)面的頁(yè)面都有動(dòng)態(tài)加載,或許比較適合python
2.目前我們的案例只是“打印”出來，沒有記錄在txt或者csv里，自己應(yīng)該再多嘗試是否可行，也搞不清楚這里用字典的意義，其實(shí)用list去存儲(chǔ)更方便我們自己的后續(xù)操作（排序，篩選）。。（目前我也只是偷懶按字符串存儲(chǔ)，沒有在具體按什么順序去存儲(chǔ)操作）
3.目前這個(gè)作業(yè)還不是實(shí)際能用的，對(duì)于反爬的網(wǎng)站不太好用：

如何能自動(dòng)再嘗試失敗的抓取？或者繼續(xù)循環(huán)抓取下一條，對(duì)抓取失敗的鏈接進(jìn)行記錄（try,except,finally？？）
怎么反抓取？time.sleep會(huì)和后面的多線程抓取矛盾嗎？小豬網(wǎng)這個(gè)實(shí)在太變態(tài)，稍微抓取就失敗，404，感覺這個(gè)不適合作為抓取連續(xù)的例子，更適合后續(xù)的反爬取練習(xí)。。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

課時(shí)15第三節(jié)練習(xí)項(xiàng)目：爬取租房信息

課時(shí)15第三節(jié)練習(xí)項(xiàng)目：爬取租房信息

成果

代碼

總結(jié)與問題

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

課時(shí)15第三節(jié)練習(xí)項(xiàng)目：爬取租房信息

成果

代碼

總結(jié)與問題

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频