Python爬蟲入門-scrapy爬取拉勾網(wǎng)

之前就爬過拉勾網(wǎng)，但是遇到一些錯(cuò)誤一直沒有辦法解決，果斷放棄了，今天又重新試著寫寫看，對(duì)于一個(gè)菜鳥來說，真的都是處處是坑，寫篇文章記錄一些，供接下去學(xué)習(xí)參考。

首先就是打開拉勾網(wǎng)，在搜索欄中輸入Python，打開F12，刷新：

在這個(gè)原始的請(qǐng)求的response中是沒有我們要的數(shù)據(jù)的，一般這種情況下我就切換到XHR中取中取找：

URL:https://www.lagou.com/jobs/positionAjax.jsonneedAddtionalResult=false&isSchoolJob=0中可以找到我們想要的JSON數(shù)據(jù)。所以可以模擬瀏覽器對(duì)這個(gè)URL進(jìn)行請(qǐng)求，再對(duì)返回的JSON數(shù)據(jù)進(jìn)行解析就可以得到我們想要的結(jié)果。

所以在scrapy中的spider.py開始編寫代碼：

import scrapy

classLagouSpider(scrapy.Spider):

? ? name='lagou'

? ? def start_requests(self):

? ? ? ? url='https://www.lagou.com/jobs/positionAjax.jsonneedAddtionalResul

t=false&isSchoolJob=0'

? ? ? ? yield scrapy.FormRequest(url,formdata={'first':'true','pn':'1','kd':'python'},method='Post',meta={'pn':1},callback=self.parse)

? ? def parse(self,response):

? ? ? ? html=response.text

? ? ? ? data=json.loads(html)

? ? ? ? if data:

? ? ? ? ? ? content=data.get('content')

? ? ? ? ? ? positionResult=content.get('positionResult')

? ? ? ? ? ? results=positionResult.get('result')

? ? ? ? ? ? for result in results:

? ? ? ? ? ? ? ? companyFullName=result.get('companyFullName')

? ? ? ? ? ? ? ? print(companyFullName)

在settings.py下使用的是默認(rèn)的DEFAULT_REQUEST_HEADERS，并在里面我添加了隨機(jī)的User-Agent,然后我開始運(yùn)行代碼，結(jié)果出現(xiàn)報(bào)錯(cuò)：

File "E:\Python\pycharm\lagouposition\lagouposition\spiders\lagou.py", line 60, in parse

content=data['content']

KeyError: 'content'

明明代碼看起來沒有什么問題，為什么一直就是提示這個(gè)錯(cuò)誤呢，著實(shí)讓我很奔潰，后面在知乎上看到了有人回答說要把request headers全部加上（具體為什么回答的人也說還不知道），然后我就在settings.py設(shè)置如下：

?DEFAULT_REQUEST_HEADERS = {

? ? ?'Accept': 'application/json, text/javascript, */*; q=0.01',

? ? ?'Accept-Encoding':'gzip, deflate, br',

? ? ?'Accept-Language': 'zh-CN,zh;q=0.8',

? ? ?'Connection':'keep-alive',

? ? ?'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',

? ? ? 'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',

? ? ?'Host':'www.lagou.com',

? ? ?'Origin':'https://www.lagou.com',

? ? ?'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',

? # ? 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',

? ? ?'X-Anit-Forge-Code':'0',

? ? ?'X-Anit-Forge-Token':'None',

? ? ?'X-Requested-With':'XMLHttpRequest'}

然后運(yùn)行，上面的報(bào)錯(cuò)是消失了，但是卻出現(xiàn)了一個(gè)編碼的報(bào)錯(cuò)（我使用的是window7系統(tǒng)）:

同樣的在網(wǎng)上找了很多，試了一些方法還是沒什么用，一直報(bào)這個(gè)錯(cuò)誤，最后找到了一種解決方法，在spider.py中添加了如下代碼：

import sys，io

sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gbk')

解決了上面的編碼問題。

然后繼續(xù)編碼，在items.py:

from scrapy importItem,Field

classLagoupositionItem(Item):

? ? companyFullName=Field()

? ? companyId=Field()

? ? companyLabelList=Field()

? ? companyLogo=Field()

? ? companyShortName=Field()

? ? companySize=Field()

? ? createTime=Field()

? ? deliver=Field()

? ? district=Field()

? ? education=Field()

? ? explain=Field()

? ? financeStage=Field()

? ? firstType=Field()

? ? formatCreateTime=Field()

? ? gradeDescription=Field()

? ? industryField=Field()

? ? industryLables=Field()

? ? isSchoolJob=Field()

? ? jobNature=Field()

? ? positionAdvantage=Field()

? ? positionId=Field()

? ? positionLables=Field()

? ? positionName=Field()

? ? salary=Field()

? ? secondType=Field()

? ? workYear=Field()

在spider.py

def parse(self,response):

? ? html=response.text

? ? data=json.loads(html)

? ? ifdata:

? ? ? ? content=data.get('content')

? ? ? ? positionResult=content.get('positionResult')

? ? ? ? totalCount=positionResult.get('totalCount')

? ? ? ? pages=int(totalCount/15)

? ? ? ? if pages>=30:

? ? ? ? ? ? pages=30

? ? ? ? else:

? ? ? ? ? ? pages=pages

? ? ? ? results=positionResult.get('result')

? ? ? ? for result in results:

? ? ? ? ? ? item=LagoupositionItem()

? ? ? ? ? ? item['companyFullName']=result.get('companyFullName')

? ? ? ? ? ? item['companyId']=result.get('companyId')

? ? ? ? ? ? item['companyLabelList']=result.get('companyLabelList')

? ? ? ? ? ? item['companyLogo']=result.get('companyLogo')

? ? ? ? ? ? item['companyShortName']=result.get('companyShortName')

? ? ? ? ? ? item['companySize']=result.get('companySize')

? ? ? ? ? ? item['createTime']=result.get('createTime')

? ? ? ? ? ? item['deliver']=result.get('deliver')

? ? ? ? ? ? item['district']=result.get('district')

? ? ? ? ? ? item['education']=result.get('education')

? ? ? ? ? ? item['explain']=result.get('explain')

? ? ? ? ? ? item['financeStage']=result.get('financeStage')

? ? ? ? ? ? item['firstType']=result.get('firstType')

? ? ? ? ? ? item['formatCreateTime']=result.get('formatCreateTime')

? ? ? ? ? ? item['gradeDescription']=result.get('gradeDescription')

? ? ? ? ? ? item['industryField']=result.get('industryField')

? ? ? ? ? ? item['industryLables']=result.get('industryLables')

? ? ? ? ? ? item['isSchoolJob']=result.get('isSchoolJob')

? ? ? ? ? ? item['jobNature']=result.get('jobNature')

? ? ? ? ? ? item['positionAdvantage']=result.get('positionAdvantage')

? ? ? ? ? ? item['positionId']=result.get('positionId')

? ? ? ? ? ? item['positionLables']=result.get('positionLables')

? ? ? ? ? ? item['positionName']=result.get('positionName')

? ? ? ? ? ? item['salary']=result.get('salary')

? ? ? ? ? ? item['secondType']=result.get('secondType')

? ? ? ? ? ? item['workYear']=result.get('workYear')

? ? ? ? ? ? yield item

? ? ? ? ? ? pn=int(response.meta.get('pn'))+1

? ? ? ? ? ? if pn<=pages:

? ? ? ? ? ? ? ? yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},method='Post',meta{'pn':pn},callback=self.parse)

原本以為能夠把前面的30頁都抓取下來，沒想到只是抓取了一頁的內(nèi)容后，就可以報(bào)前面的錯(cuò)誤：

File "E:\Python\pycharm\lagouposition\lagouposition\spiders\lagou.py", line 60, in parse

content=data['content']

KeyError: 'content'

考慮到前面一開始也報(bào)這個(gè)錯(cuò)誤，我覺得是后面的：

yield scrapy.FormRequest(response.url,formdata{'first':'False','pn':str(pn),'kd':'python'},

method='Post',meta{'pn':pn},callback=self.parse)

沒有headers的緣故。所以做了如下的調(diào)整，將settings.py中的DEFAULT_REQUEST_HEADERS注釋掉然后在spider.py中添加如下：

headers={

'Accept': 'application/json, text/javascript, */*; q=0.01',

'Accept-Encoding':'gzip, deflate, br',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',

'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',

'Host':'www.lagou.com',

'Origin':'https://www.lagou.com',

'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',

# ? 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',

'X-Anit-Forge-Code':'0',

'X-Anit-Forge-Token':'None',

'X-Requested-With':'XMLHttpRequest'}

并修改：

yield scrapy.FormRequest(url,formdata{'first':'true','pn':'1','kd':'python'},method='Post',

meta{'pn':1},headers=self.headers,callback=self.parse)

同時(shí)修改：

yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},

method='Post',meta{'pn':pn},headers=self.headers,callback=self.parse)

然后運(yùn)行，終于可以跑起來了抓了30頁的內(nèi)容。這個(gè)過程中oooO ↘┏━┓ ↙ Oooo

( 踩)→┃你┃ ←(死 )\ ( →┃√┃ ← ) /\_)↗┗━┛ ↖(_/的坑比較多。

最后編輯于：2017.12.10 07:58:41

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明：文章內(nèi)容（如有圖片或視頻亦包括在內(nèi)）由作者上傳并發(fā)布，文章內(nèi)容僅代表作者本人觀點(diǎn)，簡(jiǎn)書系信息發(fā)布平臺(tái)，僅提供信息存儲(chǔ)服務(wù)。

人面猴
序言：七十年代末，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 230,321評(píng)論 6贊 543
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 99,559評(píng)論 3贊 429
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 178,442評(píng)論 0贊 383
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長(zhǎng)。經(jīng)常有香客問我，道長(zhǎng)，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 63,835評(píng)論 1贊 317
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 72,581評(píng)論 6贊 412
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 55,922評(píng)論 1贊 328
城市分裂傳說
那天，我揣著相機(jī)與錄音，去河邊找鬼。笑死，一個(gè)胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 43,931評(píng)論 3贊 447
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 43,096評(píng)論 0贊 290
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 49,639評(píng)論 1贊 336
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 41,374評(píng)論 3贊 358
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 43,591評(píng)論 1贊 374
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 39,104評(píng)論 5贊 364
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 44,789評(píng)論 3贊 349
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 35,196評(píng)論 0贊 28
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 36,524評(píng)論 1贊 295
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個(gè)月前我還...
沈念sama閱讀 52,322評(píng)論 3贊 400
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像，于是被迫代替她去往敵國和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 48,554評(píng)論 2贊 379

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

Python爬蟲入門-scrapy爬取拉勾網(wǎng)

Python爬蟲入門-scrapy爬取拉勾網(wǎng)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

Python爬蟲入門-scrapy爬取拉勾網(wǎng)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频