之前就爬過拉勾網(wǎng),但是遇到一些錯(cuò)誤一直沒有辦法解決,果斷放棄了,今天又重新試著寫寫看,對(duì)于一個(gè)菜鳥來說,真的都是處處是坑,寫篇文章記錄一些,供接下去學(xué)習(xí)參考。
首先就是打開拉勾網(wǎng),在搜索欄中輸入Python,打開F12,刷新:
在這個(gè)原始的請(qǐng)求的response中是沒有我們要的數(shù)據(jù)的,一般這種情況下我就切換到XHR中取中取找:
URL:https://www.lagou.com/jobs/positionAjax.jsonneedAddtionalResult=false&isSchoolJob=0中可以找到我們想要的JSON數(shù)據(jù)。所以可以模擬瀏覽器對(duì)這個(gè)URL進(jìn)行請(qǐng)求,再對(duì)返回的JSON數(shù)據(jù)進(jìn)行解析就可以得到我們想要的結(jié)果。
所以在scrapy中的spider.py開始編寫代碼:
import scrapy
classLagouSpider(scrapy.Spider):
? ? name='lagou'
? ? def start_requests(self):
? ? ? ? url='https://www.lagou.com/jobs/positionAjax.jsonneedAddtionalResul
t=false&isSchoolJob=0'
? ? ? ? yield scrapy.FormRequest(url,formdata={'first':'true','pn':'1','kd':'python'},method='Post',meta={'pn':1},callback=self.parse)
? ? def parse(self,response):
? ? ? ? html=response.text
? ? ? ? data=json.loads(html)
? ? ? ? if data:
? ? ? ? ? ? content=data.get('content')
? ? ? ? ? ? positionResult=content.get('positionResult')
? ? ? ? ? ? results=positionResult.get('result')
? ? ? ? ? ? for result in results:
? ? ? ? ? ? ? ? companyFullName=result.get('companyFullName')
? ? ? ? ? ? ? ? print(companyFullName)
在settings.py下使用的是默認(rèn)的DEFAULT_REQUEST_HEADERS,并在里面我添加了隨機(jī)的User-Agent,然后我開始運(yùn)行代碼,結(jié)果出現(xiàn)報(bào)錯(cuò):
File "E:\Python\pycharm\lagouposition\lagouposition\spiders\lagou.py", line 60, in parse
content=data['content']
KeyError: 'content'
明明代碼看起來沒有什么問題,為什么一直就是提示這個(gè)錯(cuò)誤呢,著實(shí)讓我很奔潰,后面在知乎上看到了有人回答說要把request headers全部加上(具體為什么回答的人也說還不知道),然后我就在settings.py設(shè)置如下:
?DEFAULT_REQUEST_HEADERS = {
? ? ?'Accept': 'application/json, text/javascript, */*; q=0.01',
? ? ?'Accept-Encoding':'gzip, deflate, br',
? ? ?'Accept-Language': 'zh-CN,zh;q=0.8',
? ? ?'Connection':'keep-alive',
? ? ?'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
? ? ? 'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',
? ? ?'Host':'www.lagou.com',
? ? ?'Origin':'https://www.lagou.com',
? ? ?'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
? # ? 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
? ? ?'X-Anit-Forge-Code':'0',
? ? ?'X-Anit-Forge-Token':'None',
? ? ?'X-Requested-With':'XMLHttpRequest'}
然后運(yùn)行,上面的報(bào)錯(cuò)是消失了,但是卻出現(xiàn)了一個(gè)編碼的報(bào)錯(cuò)(我使用的是window7系統(tǒng)):
同樣的在網(wǎng)上找了很多,試了一些方法還是沒什么用,一直報(bào)這個(gè)錯(cuò)誤,最后找到了一種解決方法,在spider.py中添加了如下代碼:
import sys,io
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gbk')
解決了上面的編碼問題。
然后繼續(xù)編碼,在items.py:
from scrapy importItem,Field
classLagoupositionItem(Item):
? ? companyFullName=Field()
? ? companyId=Field()
? ? companyLabelList=Field()
? ? companyLogo=Field()
? ? companyShortName=Field()
? ? companySize=Field()
? ? createTime=Field()
? ? deliver=Field()
? ? district=Field()
? ? education=Field()
? ? explain=Field()
? ? financeStage=Field()
? ? firstType=Field()
? ? formatCreateTime=Field()
? ? gradeDescription=Field()
? ? industryField=Field()
? ? industryLables=Field()
? ? isSchoolJob=Field()
? ? jobNature=Field()
? ? positionAdvantage=Field()
? ? positionId=Field()
? ? positionLables=Field()
? ? positionName=Field()
? ? salary=Field()
? ? secondType=Field()
? ? workYear=Field()
在spider.py
def parse(self,response):
? ? html=response.text
? ? data=json.loads(html)
? ? ifdata:
? ? ? ? content=data.get('content')
? ? ? ? positionResult=content.get('positionResult')
? ? ? ? totalCount=positionResult.get('totalCount')
? ? ? ? pages=int(totalCount/15)
? ? ? ? if pages>=30:
? ? ? ? ? ? pages=30
? ? ? ? else:
? ? ? ? ? ? pages=pages
? ? ? ? results=positionResult.get('result')
? ? ? ? for result in results:
? ? ? ? ? ? item=LagoupositionItem()
? ? ? ? ? ? item['companyFullName']=result.get('companyFullName')
? ? ? ? ? ? item['companyId']=result.get('companyId')
? ? ? ? ? ? item['companyLabelList']=result.get('companyLabelList')
? ? ? ? ? ? item['companyLogo']=result.get('companyLogo')
? ? ? ? ? ? item['companyShortName']=result.get('companyShortName')
? ? ? ? ? ? item['companySize']=result.get('companySize')
? ? ? ? ? ? item['createTime']=result.get('createTime')
? ? ? ? ? ? item['deliver']=result.get('deliver')
? ? ? ? ? ? item['district']=result.get('district')
? ? ? ? ? ? item['education']=result.get('education')
? ? ? ? ? ? item['explain']=result.get('explain')
? ? ? ? ? ? item['financeStage']=result.get('financeStage')
? ? ? ? ? ? item['firstType']=result.get('firstType')
? ? ? ? ? ? item['formatCreateTime']=result.get('formatCreateTime')
? ? ? ? ? ? item['gradeDescription']=result.get('gradeDescription')
? ? ? ? ? ? item['industryField']=result.get('industryField')
? ? ? ? ? ? item['industryLables']=result.get('industryLables')
? ? ? ? ? ? item['isSchoolJob']=result.get('isSchoolJob')
? ? ? ? ? ? item['jobNature']=result.get('jobNature')
? ? ? ? ? ? item['positionAdvantage']=result.get('positionAdvantage')
? ? ? ? ? ? item['positionId']=result.get('positionId')
? ? ? ? ? ? item['positionLables']=result.get('positionLables')
? ? ? ? ? ? item['positionName']=result.get('positionName')
? ? ? ? ? ? item['salary']=result.get('salary')
? ? ? ? ? ? item['secondType']=result.get('secondType')
? ? ? ? ? ? item['workYear']=result.get('workYear')
? ? ? ? ? ? yield item
? ? ? ? ? ? pn=int(response.meta.get('pn'))+1
? ? ? ? ? ? if pn<=pages:
? ? ? ? ? ? ? ? yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},method='Post',meta{'pn':pn},callback=self.parse)
原本以為能夠把前面的30頁都抓取下來,沒想到只是抓取了一頁的內(nèi)容后,就可以報(bào)前面的錯(cuò)誤:
File "E:\Python\pycharm\lagouposition\lagouposition\spiders\lagou.py", line 60, in parse
content=data['content']
KeyError: 'content'
考慮到前面一開始也報(bào)這個(gè)錯(cuò)誤,我覺得是后面的:
yield scrapy.FormRequest(response.url,formdata{'first':'False','pn':str(pn),'kd':'python'},
method='Post',meta{'pn':pn},callback=self.parse)
沒有headers的緣故。所以做了如下的調(diào)整,將settings.py中的DEFAULT_REQUEST_HEADERS注釋掉然后在spider.py中添加如下:
headers={
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'LGUID=20170624104910-b3421612-5887-11e7-805a-525400f775ce; user_trace_token=20170624104912-161b9c7475a6448381c393fd68935f6b; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAFCAAEGF2DB2AA232B68C2B16743FE83939C1E9; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; TG-TRACK-CODE=index_search; _gid=GA1.2.705404459.1505118253; _ga=GA1.2.1378071003.1498273550; LGSID=20170911225046-98307e76-9700-11e7-8f76-525400f775ce; LGRID=20170911225056-9dbaf56b-9700-11e7-9168-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504697344,1504751304,1504860546,1505142452; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1505142462; SEARCH_ID=1875185cf5904051845b74a20b82bebd',
'Host':'www.lagou.com',
'Origin':'https://www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
# ? 'User-Agent':'User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'X-Anit-Forge-Code':'0',
'X-Anit-Forge-Token':'None',
'X-Requested-With':'XMLHttpRequest'}
并修改:
yield scrapy.FormRequest(url,formdata{'first':'true','pn':'1','kd':'python'},method='Post',
meta{'pn':1},headers=self.headers,callback=self.parse)
同時(shí)修改:
yield scrapy.FormRequest(response.url,formdata={'first':'False','pn':str(pn),'kd':'python'},
method='Post',meta{'pn':pn},headers=self.headers,callback=self.parse)
然后運(yùn)行,終于可以跑起來了抓了30頁的內(nèi)容。這個(gè)過程中oooO ↘┏━┓ ↙ Oooo
( 踩)→┃你┃ ←(死 )\ ( →┃√┃ ← ) /\_)↗┗━┛ ↖(_/的坑比較多。