日本高清视频网站www,加勒比,亚洲成a∨人片在线观看不卡

0 引言

??一次簡(jiǎn)單的 Python 爬蟲練習(xí)：輸入 目標(biāo)城市 和 目標(biāo)職位，從拉勾網(wǎng) 爬取相關(guān)的職位列表數(shù)據(jù)（受拉勾網(wǎng)的展示機(jī)制限制，只能爬取 30 頁共 450 條記錄），并對(duì)數(shù)據(jù)進(jìn)行清洗，最后進(jìn)行簡(jiǎn)單的描述統(tǒng)計(jì)和回歸分析。
# 主要參考了閑庭信步的拉勾爬蟲代碼。

1 分析網(wǎng)頁

??首先，使用 chrome 打開拉勾網(wǎng)，選擇城市為廣州，搜索產(chǎn)品經(jīng)理職位，會(huì)進(jìn)入下圖的職位列表頁：

拉勾職位列表頁

??接下來使用開發(fā)者工具進(jìn)行發(fā)現(xiàn)，會(huì)發(fā)現(xiàn)頁面上的職位信息其實(shí)是存在 positionAjax.json 上的，查看 positionAjax.json 的 Headers，獲取爬蟲所需要的信息：

使用開發(fā)者工具分析網(wǎng)頁

# Request URL
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'

# Request Headers
my_headers = {  
  'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',  
  'Host':'www.lagou.com',  
  'Referer':'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86?city=%E5%B9%BF%E5%B7%9E&cl=false&fromSearch=true&labelWords=&suginput=',  
  'X-Anit-Forge-Code':'0',  
  'X-Anit-Forge-Token': 'None',  
  'X-Requested-With':'XMLHttpRequest'  
}

# Form Data
my_data = {  
  'first': 'true',  
  'pn': 1,  
  'kd': '產(chǎn)品經(jīng)理'
}

??有了這些信息，我們就可以使用 POST 請(qǐng)求抓取網(wǎng)頁了。但為了讓程序可以更加方便，不需要每次都復(fù)制粘貼那么多信息，我們應(yīng)該在這些信息中找出不變的和可由程序自行填充的，最好能達(dá)到只需要輸入目標(biāo)城市和職位，就可以自動(dòng)爬取的效果。

2 爬取網(wǎng)頁

2.1 導(dǎo)入相關(guān)包

??由于這次還會(huì)進(jìn)行簡(jiǎn)單的描述統(tǒng)計(jì)和回歸分析，因此需要導(dǎo)入較多包：

# 數(shù)據(jù)處理及導(dǎo)入導(dǎo)出
import pandas as pd  
# 爬蟲
import requests  
import math  
import time
import sys, urllib
# 數(shù)據(jù)可視化
import seaborn as sns
import matplotlib.pyplot as plt
# 統(tǒng)計(jì)建模
import statsmodels.api as sm
# 詞云
from wordcloud import WordCloud  
from imageio import imread
import jieba

2.2 構(gòu)建爬蟲函數(shù)

??這里一共構(gòu)建了 4 個(gè)函數(shù)：

get_json(url, num, encode_city, position)：從網(wǎng)頁獲取 JSON，使用 POST 請(qǐng)求，加上頭部信息
get_page_num(count)：根據(jù)職位數(shù)計(jì)算要抓取的頁數(shù)（最多 30 頁）
get_page_info(jobs_list)：對(duì)包含網(wǎng)頁職位信息的 JSON 進(jìn)行解析,返回列表
lagou_spider(city, position)：獲取鍵盤輸入的目標(biāo)城市和職位，調(diào)用其它 3 個(gè)函數(shù)，爬取拉勾職位列表數(shù)據(jù)并導(dǎo)出為 CSV 文件。

def get_json(url, num, encode_city, position):  
    '''
        從網(wǎng)頁獲取 JSON，使用 POST 請(qǐng)求，加上頭部信息
    '''
    # Request Headers
    my_headers = {  
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',  
        'Host':'www.lagou.com',  
        'Referer':'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86?city=' + encode_city + '&cl=false&fromSearch=true&labelWords=&suginput=',  
        'X-Anit-Forge-Code':'0',  
        'X-Anit-Forge-Token': 'None',  
        'X-Requested-With':'XMLHttpRequest'  
    }

    # Form Data
    my_data = {  
        'first': 'true',  
        'pn': num,  
        'kd': position
    }  
    
    res = requests.post(url, headers = my_headers, data = my_data)  
    res.raise_for_status()  
    res.encoding = 'utf-8'  
    
    # 得到包含職位信息的字典  
    page = res.json()  
    return page

def get_page_num(count):  
    '''
        計(jì)算要抓取的頁數(shù)
    '''  
    # 每頁15個(gè)職位,向上取整  
    res = math.ceil(count/15)  
    # 拉勾網(wǎng)最多顯示30頁結(jié)果  
    if res > 30:  
        return 30  
    else:  
        return res

def get_page_info(jobs_list):  
    ''''
        對(duì)一個(gè)網(wǎng)頁的職位信息進(jìn)行解析,返回列表
    '''  
    page_info_list = []  
    
    for i in jobs_list:  
        job_info = []  
        job_info.append(i['companyFullName'])  
        job_info.append(i['companyShortName'])  
        job_info.append(i['companySize']) 
        job_info.append(i['financeStage'])  
        job_info.append(i['district'])  
        job_info.append(i['industryField'])  
        job_info.append(i['positionName'])  
        job_info.append(i['jobNature']) 
        job_info.append(i['firstType']) 
        job_info.append(i['secondType']) 
        job_info.append(i['workYear'])  
        job_info.append(i['education'])  
        job_info.append(i['salary'])  
        job_info.append(i['positionAdvantage'])  
        page_info_list.append(job_info)  
    
    return page_info_list

def lagou_spider(city, position):  
    '''
        爬取拉勾職位列表數(shù)據(jù)
    '''
    # encode city
    encode_city = urllib.parse.quote(city)
    
    # Request URL
    url = 'https://www.lagou.com/jobs/positionAjax.json?city=' + encode_city + '&needAddtionalResult=false'  
    
    # 先設(shè)定頁數(shù)為1,獲取總的職位數(shù)  
    first_page = get_json(url,1, encode_city, position)  
    total_count = first_page['content']['positionResult']['totalCount']  
    num = get_page_num(total_count)  
    total_info = []  
    time.sleep(20)  
    print('職位總數(shù):{},頁數(shù):{}'.format(total_count,num))  

    for n in range(1,num+1):  
        # 對(duì)每個(gè)網(wǎng)頁讀取JSON, 獲取每頁數(shù)據(jù)  
        page = get_json(url, n, encode_city, position)  
        jobs_list = page['content']['positionResult']['result']  
        page_info = get_page_info(jobs_list)  
        total_info += page_info  
        print('已經(jīng)抓取第{}頁, 職位總數(shù):{}'.format(n, len(total_info)))  
        # 每次抓取完成后,暫停一會(huì),防止被服務(wù)器拉黑  
        time.sleep(30)  
       
    #將總數(shù)據(jù)轉(zhuǎn)化為 DataFrame再輸出  
    df = pd.DataFrame(data = total_info,columns = ['公司全名', '公司簡(jiǎn)稱', '公司規(guī)模', '融資階段', '區(qū)域', '行業(yè)', 
                                                   '職位名稱', '工作性質(zhì)', '一級(jí)類別', '二級(jí)類別', '工作經(jīng)驗(yàn)', '學(xué)歷要求', '工資','職位福利'])   
    data_output = 'data\\' + city + '—' + position + '.csv'
    df.to_csv(data_output, index = False)  
    
    print('已保存為csv文件.')

2.3 運(yùn)行爬蟲

??完成爬蟲函數(shù)之后，就可以開始運(yùn)行爬蟲了，在鍵盤輸入工作城市和目標(biāo)職位，爬蟲開始運(yùn)行：

# 輸入工作城市和目標(biāo)職位
city = input("工作城市：\n")
position = input("目標(biāo)職位：\n")

工作城市：
廣州
目標(biāo)職位：
產(chǎn)品經(jīng)理

# 運(yùn)行爬蟲
lagou_spider(city, position)

職位總數(shù):1176,頁數(shù):30
已經(jīng)抓取第1頁, 職位總數(shù):15
已經(jīng)抓取第2頁, 職位總數(shù):30
已經(jīng)抓取第3頁, 職位總數(shù):45
已經(jīng)抓取第4頁, 職位總數(shù):60
已經(jīng)抓取第5頁, 職位總數(shù):75
已經(jīng)抓取第6頁, 職位總數(shù):90
已經(jīng)抓取第7頁, 職位總數(shù):105
已經(jīng)抓取第8頁, 職位總數(shù):120
已經(jīng)抓取第9頁, 職位總數(shù):135
已經(jīng)抓取第10頁, 職位總數(shù):150
已經(jīng)抓取第11頁, 職位總數(shù):165
已經(jīng)抓取第12頁, 職位總數(shù):180
已經(jīng)抓取第13頁, 職位總數(shù):195
已經(jīng)抓取第14頁, 職位總數(shù):210
已經(jīng)抓取第15頁, 職位總數(shù):225
已經(jīng)抓取第16頁, 職位總數(shù):240
已經(jīng)抓取第17頁, 職位總數(shù):255
已經(jīng)抓取第18頁, 職位總數(shù):270
已經(jīng)抓取第19頁, 職位總數(shù):285
已經(jīng)抓取第20頁, 職位總數(shù):300
已經(jīng)抓取第21頁, 職位總數(shù):315
已經(jīng)抓取第22頁, 職位總數(shù):330
已經(jīng)抓取第23頁, 職位總數(shù):345
已經(jīng)抓取第24頁, 職位總數(shù):360
已經(jīng)抓取第25頁, 職位總數(shù):375
已經(jīng)抓取第26頁, 職位總數(shù):390
已經(jīng)抓取第27頁, 職位總數(shù):405
已經(jīng)抓取第28頁, 職位總數(shù):420
已經(jīng)抓取第29頁, 職位總數(shù):435
已經(jīng)抓取第30頁, 職位總數(shù):450
已保存為csv文件.

??為了防止被封，因此爬蟲在每次完成抓取后，都會(huì)暫停一會(huì)，這導(dǎo)致爬蟲需要較多時(shí)間。爬蟲運(yùn)行完成后，將數(shù)據(jù)保存為 CSV 文件，使用 Pandas 打開：

# 讀取數(shù)據(jù)  
file = open('data\\' + city + '—' + position + '.csv', 'rb')
df = pd.read_csv(file, encoding = 'utf-8')

??查看頭 5 行的數(shù)據(jù)：

df.head()

爬蟲結(jié)果頭 5 行數(shù)據(jù)

3 數(shù)據(jù)清洗

??首先剔除非全職的職位：

# 剔除實(shí)習(xí)崗位  
df.drop(df[df['工作性質(zhì)'].str.contains('實(shí)習(xí)')].index, inplace=True)

??將工作經(jīng)驗(yàn)和工資由字符串形式轉(zhuǎn)換為列表：

pattern = '\d+'  
df['工作年限'] = df['工作經(jīng)驗(yàn)'].str.findall(pattern)

avg_work_year = []  
for i in df['工作年限']:  
    # 如果工作經(jīng)驗(yàn)為 '不限' 或 '應(yīng)屆畢業(yè)生' ,那么匹配值為空,工作年限為 0  
    if len(i) == 0:  
        avg_work_year.append(0)  
   # 如果匹配值為一個(gè)數(shù)值,那么返回該數(shù)值  
    elif len(i) == 1:  
        avg_work_year.append(int(''.join(i)))  
   # 如果匹配值為一個(gè)區(qū)間,那么取平均值  
    else:  
        num_list = [int(j) for j in i]  
        avg_year = sum(num_list) / 2  
        avg_work_year.append(avg_year)
        
df['經(jīng)驗(yàn)要求'] = avg_work_year
        
# 將字符串轉(zhuǎn)化為列表,再取區(qū)間的前25%，比較貼近現(xiàn)實(shí)  
df['salary'] = df['工資'].str.findall(pattern)  

avg_salary = []  
for k in df['salary']:  
    int_list = [int(n) for n in k]  
    avg_wage = int_list[0] + (int_list[1] - int_list[0]) / 4  
    avg_salary.append(avg_wage) 

df['月工資'] = avg_salary

??保存再打開清洗后的數(shù)據(jù)：

# 將清洗后的數(shù)據(jù)保存
df.to_csv('data\\' + city + '—' + position + '（已清洗）' + '.csv', index = False)

# 讀取清洗后的數(shù)據(jù)  
file = open('data\\' + city + '—' + position + '（已清洗）' + '.csv', 'rb')
df2 = pd.read_csv(file, encoding = 'utf-8')

4 描述統(tǒng)計(jì)

# 描述統(tǒng)計(jì)  
print('工資描述：\n{}'.format(df2['月工資'].describe()))

工資描述：
count 450.000000
mean 12.885556
std 4.888166
min 1.250000
25% 9.750000
50% 12.000000
75% 16.000000
max 31.250000
Name: 月工資, dtype: float64

plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默認(rèn)字體  
plt.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負(fù)號(hào)'-'顯示為方塊的問題 
sns.set(font='SimHei')  # 解決 Seaborn 中文顯示問題

# 繪制頻率直方圖并保存  
sns.distplot(df2['月工資'], bins=15, kde=True, rug=True)
plt.title("工資直方圖")
plt.xlabel('工資 (千元)')   
plt.ylabel('頻數(shù)')      
plt.savefig('output\histogram.jpg')  
plt.show();

histogram.jpg

count = df2['區(qū)域'].value_counts()

# 繪制餅圖并保存  
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pie(count, labels = count.keys(), labeldistance=1.4, autopct='%2.1f%%')  
plt.axis('equal')  # 使餅圖為正圓形  
plt.legend(loc='upper left', bbox_to_anchor=(-0.1, 1))  
plt.savefig('output\pie_chart.jpg')  
plt.show()

pie_chart.jpg

# 繪制詞云,將職位福利中的字符串匯總  
text = ''  
for line in df2['職位福利']:  
    text += line  
# 使用jieba模塊將字符串分割為單詞列表      
cut_text = ' '.join(jieba.cut(text))  
color_mask = imread('img\jobs.jpg')  #設(shè)置背景圖  
cloud = WordCloud(  
    font_path = 'fonts\FZBYSK.ttf',   
    background_color = 'white',  
    mask = color_mask,  
    max_words = 1000,  
    max_font_size = 100          
)  

word_cloud = cloud.generate(cut_text)  

# 保存詞云圖片  
word_cloud.to_file('output\word_cloud.jpg')  
plt.imshow(word_cloud)  
plt.axis('off')  
plt.show()

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaiusyao\AppData\Local\Temp\jieba.cache
Loading model cost 0.796 seconds.
Prefix dict has been built succesfully.

word_cloud.jpg

5 實(shí)證統(tǒng)計(jì)

# 實(shí)證統(tǒng)計(jì),將學(xué)歷不限的職位要求認(rèn)定為最低學(xué)歷:大專  
df['學(xué)歷要求'] = df['學(xué)歷要求'].replace('不限', '大專')

# 學(xué)歷分為大專\本科\碩士,將它們?cè)O(shè)定為虛擬變量  
dummy_edu = pd.get_dummies(df['學(xué)歷要求'], prefix = '學(xué)歷')

# 構(gòu)建回歸數(shù)組  
df_with_dummy = pd.concat([df['月工資'], df['經(jīng)驗(yàn)要求'], dummy_edu], axis = 1)

# 建立多元回歸模型  
y = df_with_dummy['月工資']  
X = df_with_dummy[['經(jīng)驗(yàn)要求','學(xué)歷_大專','學(xué)歷_本科','學(xué)歷_碩士']]  
X = sm.add_constant(X)   
model = sm.OLS(y, X.astype(float))  
results = model.fit()  
print('回歸方程的參數(shù)：\n{}\n'.format(results.params))  
print('回歸結(jié)果：\n{}'.format(results.summary()))

回歸方程的參數(shù)：
const 7.302655
經(jīng)驗(yàn)要求 1.192419
學(xué)歷大專 0.159035
學(xué)歷本科 2.069740
學(xué)歷_碩士 5.073880
dtype: float64

回歸結(jié)果：
OLS Regression Results
==============================================================================
Dep. Variable: 月工資 R-squared: 0.240
Model: OLS Adj. R-squared: 0.235
Method: Least Squares F-statistic: 46.89
Date: Fri, 29 Jun 2018 Prob (F-statistic): 2.35e-26
Time: 20:57:54 Log-Likelihood: -1290.4
No. Observations: 450 AIC: 2589.
Df Residuals: 446 BIC: 2605.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.3027 0.569 12.824 0.000 6.184 8.422
經(jīng)驗(yàn)要求 1.1924 0.116 10.321 0.000 0.965 1.419
學(xué)歷大專 0.1590 0.567 0.281 0.779 -0.954 1.272
學(xué)歷本科 2.0697 0.532 3.891 0.000 1.024 3.115
學(xué)歷_碩士 5.0739 1.444 3.515 0.000 2.237 7.911
==============================================================================
Omnibus: 97.185 Durbin-Watson: 2.145
Prob(Omnibus): 0.000 Jarque-Bera (JB): 195.765
Skew: 1.166 Prob(JB): 3.09e-43
Kurtosis: 5.236 Cond. No. 1.23e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.61e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

6 小結(jié)

??這是一次很簡(jiǎn)單的爬蟲和數(shù)據(jù)分析練習(xí)，最后回歸分析的結(jié)果不能令人滿意，這與產(chǎn)品經(jīng)理這一職位較注重工作經(jīng)驗(yàn)而非學(xué)歷有關(guān)。
??但僅有職位的工作年限要求，并不能真正體現(xiàn)企業(yè)對(duì)于產(chǎn)品經(jīng)理的需求，企業(yè)對(duì)產(chǎn)品經(jīng)理更準(zhǔn)確更詳細(xì)的要求應(yīng)在職位描述和工作要求這些文本數(shù)據(jù)中。其實(shí)不僅是產(chǎn)品經(jīng)理，絕大部分的職位，僅憑結(jié)構(gòu)化的數(shù)據(jù)，是不能很好地把握企業(yè)對(duì)于人才的真正需求。結(jié)構(gòu)化數(shù)據(jù)僅是水面上的冰山，水面下的文本數(shù)據(jù)則是一個(gè)更大的寶礦，需要我們對(duì)其進(jìn)行挖掘。
??如需完善這次的練習(xí)項(xiàng)目，除了進(jìn)一步完善爬蟲和統(tǒng)計(jì)分析以外，如何進(jìn)行文本挖掘，也是一個(gè)重要的方向。另外，定時(shí)爬取職位數(shù)據(jù)，對(duì)數(shù)據(jù)集進(jìn)行增刪和更新，并生成自動(dòng)化報(bào)告，也將是一個(gè)有趣的嘗試。

源代碼傳送門

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

拉勾爬蟲實(shí)戰(zhàn)

拉勾爬蟲實(shí)戰(zhàn)

0 引言

1 分析網(wǎng)頁

2 爬取網(wǎng)頁

2.1 導(dǎo)入相關(guān)包

2.2 構(gòu)建爬蟲函數(shù)

2.3 運(yùn)行爬蟲

3 數(shù)據(jù)清洗

4 描述統(tǒng)計(jì)

5 實(shí)證統(tǒng)計(jì)

6 小結(jié)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

拉勾爬蟲實(shí)戰(zhàn)

0 引言

1 分析網(wǎng)頁

2 爬取網(wǎng)頁

2.1 導(dǎo)入相關(guān)包

2.2 構(gòu)建爬蟲函數(shù)

2.3 運(yùn)行爬蟲

3 數(shù)據(jù)清洗

4 描述統(tǒng)計(jì)

5 實(shí)證統(tǒng)計(jì)

6 小結(jié)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频