亚洲精品无码永久在线观看性色,黑粗硬大欧美在线播放,播播开心

<small>
最近研究了一下網站爬蟲，覺得python和nodejs都有優點，所以我決定實現一個爬蟲，用python來抓取網頁的源代碼，用nodejs的cheerio模塊來獲取源代碼內的數據。正好我有明年換工作的打算，于是我選擇爬智聯招聘的網站。
代碼地址：https://github.com/duan602728596/ZhiLianUrllib</small>

1.用python進行一個http請求

# coding: utf-8
# http.py

import sys
import types
import urllib
import urllib2

# 獲取傳遞的參數
# @param argv[0]{string}：腳本名稱
# @param argv[1]{string}：請求方式，get或post
# @param argv[2]{string}：請求地址
# @param argv[3]{string}：請求的數據
argv = {
    'filename': sys.argv[0],
    'method': sys.argv[1],
    'url': sys.argv[2],
    'data': sys.argv[3],
}


class Http:
    # 初始化數據
    def __init__(self, method, url, data = ''):
        self.method = method            # 請求的類型
        self.url = url                  # 請求的地址
        self.data = self.getData(data)  # 請求的數據
        # 請求頭
        self.header = {
            'Accept-Encoding': 'deflate',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
            'cache-control': 'no-cache',
        }
    # 獲取請求數據的
    def getData(self, data):
        if type(data) is types.StringType:
            gd = data
        elif type(data) is types.DictionaryType:
            gd = urllib.urlencode(data)
        else:
            gd = ''
        return gd
    # get
    def get(self):
        if self.data == '':
            u = self.url
        else:
            u = self.url + '?' + self.data
        request = urllib2.Request(u)
        response = urllib2.urlopen(request)
        return response.read()
    # post
    def post(self):
        request = urllib2.Request(self.url, self.data, self.header)
        response = urllib2.urlopen(request)
        return response.read()
    # init
    def init(self):
        if self.method == 'get':
            self.result = self.get()
        elif self.method == 'post':
            self.result = self.post()
        else:
            self.result = ''

# 初始化請求
http = Http(argv['method'], argv['url'], argv['data'])
http.init()
text = http.result

# 輸出請求
print(text)

在該腳本中，使用sys庫獲取命令行傳遞的各種參數，使用types庫進行數據類型的判斷，使用urllib庫和urllib2庫進行網頁內容的抓取。傳遞的參數有請求的方式、請求的url地址、請求的數據。初始化后，根據傳遞的請求方式決定執行get請求還是post請求，執行請求后將結果輸出出來，傳回nodejs程序中。

2.nodejs和python實現通信

/**
 * pyhttp.js
 *
 * 與python腳本通信，進行一個請求
 * @param info{object}：與python腳本通信的配置
 * @param callback{function}：通信完成后執行的事件，傳遞參數為返回的數據
 */

const childProcess = require('child_process');


function pyhttp(info, callback){
    /* 發送請求 */
    return new Promise((resolve, reject)=>{
        // cmd
        const cps = childProcess.spawn('python', [
            // avgs
            info.file,
            info.method,
            info.url,
            info.data
        ]);
        // 儲存文本
        let txt = '';

        // 錯誤
        cps.stderr.on('data', function(data){
            reject(data);
        });

        // 獲取數據
        cps.stdout.on('data', function(data){
            txt += data;
        });

        // 獲取完數據
        cps.on('exit', function(code){
            resolve(txt);
        });

    }).then(callback).catch((error)=>{
        console.log(error);
    });
}

module.exports = pyhttp;

在nodejs腳本中執行其他腳本并返回執行結果，使用child_process模塊，語法為** child_process.spawn(command, [args], [options]) ，command是命令，args是參數。在這里我遇到了一個小小的坑，我之前是用的child_process.exec(command, [options], callback)，但是這個的返回值是有大小限制的，因為網站的源代碼比較大，導致報錯。用child_process.spawn(command, [args], [options])**或者重新設置返回值大小可解決。調用pyhttp.js需要傳遞兩個參數，第一個參數是運行python腳本的命令配置，第二個參數是回調函數，，傳遞腳本的運行結果。

3.對源代碼進行處理


/**
 * deal.js
 *
 * 處理數據
 * @param dealText{string}：獲取到的頁面源代碼
 * @param ishref{boolean}：是否獲取下一頁的地址，默認為false，不獲取
 */

const cheerio = require('cheerio');


/* 提取冒號后面的文本 */
const mhtext = text => text.replace(/.+：/, '');

function each($, ishref = false){
    const a = [];
    // 獲取table
    const $table = $('#newlist_list_content_table').children('table');
    for(let i = 0, j = $table.length; i < j; i++){
        const $this = $table.eq(i);
        const $tr = $this.children('tr'),
            $tr0 = $tr.eq(0),
            $tr1 = $tr.eq(1);
        const $span =  $tr1.children('td').children('div').children('div').children('ul').children('li').children('span');

        if($this.children('tr').children('th').length <= 0){
            a.push({
                // 職位招聘
                'zwzp': $tr0.children('.zwmc').children('div').children('a').html(),
                // 招聘地址
                'zpdz': $tr0.children('.zwmc').children('div').children('a').prop('href'),
                // 反饋率
                'fklv': $tr0.children('.fk_lv').children('span').html(),
                // 公司名稱
                'gsmc': $tr0.children('.gsmc').children('a').html(),
                // 工作地點
                'gzdd': $tr0.children('.gzdd').html(),
                // 進入地址
                'zldz': $tr0.children('.gsmc').children('a').prop('href'),
                // 公司性質
                'gsxz': mhtext($span.eq(1).html()),
                // 公司規模
                'gsgm': mhtext($span.eq(2).html())
            });
        }
    }

    const r = {};
    r['list'] = a;
    if(ishref != false){
        r['href'] = $('.pagesDown').children('ul').children('li').children('a').eq(2).prop('href').replace(/&p=\d/, '');
    }
    return r;
}

function deal(dealText, ishref = false){
    const $ = cheerio.load(dealText, {
        decodeEntities: false
    });


    return each($, ishref);
}

module.exports = deal;

deal.js用cheerio模塊來對抓取到的源代碼進行處理。傳遞參數dealText為源代碼，ishref 為是否抓取分頁的地址。
注意，在用cheerio模塊來獲取數據時有一個問題，

const cheerio = require('cheerio');
const html = `<div id="demo">
                <ul>
                  <li>1</li>
                  <li>2</li>
                  <li>3</li>
                </ul> 
              </div>`;
const $ = cheerio.load(html);
/*  獲取li */
$('#demo').children('li');                // 這樣是獲取不到li的
$('#demo').children('ul').children('li'); // 獲取到了li

雖然cheerio的語法和jquery一樣，但是原理千差萬別，因為網頁的數據被解析成了object對象，所以必須通過子節點一級一級向下查找，不能跳級。
數據處理：公司性質和公司規模刪除掉了：和：前面的文字，下一頁的url地址刪除掉&p=\d參數，該參數是分頁參數。

4.nodejs和python實現通信

/* app.js */
const fs = require('fs');
const pyhttp = require('./pyhttp');
const deal = require('./deal');
const _result = {};

/**
 * 請求地址和參數
 *
 * jl：地點
 * kw：職位關鍵字
 * sf：工資范圍下限
 * st：工資范圍上限
 * el：學歷
 * et：職位類型
 * pd：發布時間
 * p:  分頁page
 * ct：公司性質
 * sb：相關度
 * we: 工作經驗
 *
 */

const info = (url, method = 'get', data = '')=>{
    return {
        // python腳本
        file: 'http.py',
        // 請求類型
        method: method,
        // 請求地址
        url: url,
        // 請求數據
        data: data
    }
};

const page = 4; // 循環次數

// 回調
const callback = (text)=>{
    return new Promise((resolve, reject)=>{
        resolve(text);
    });
};

pyhttp(info(encodeURI('http://sou.zhaopin.com/jobs/searchresult.ashx?' +
                       'jl=北京&kw=web前端&sm=0&sf=10001&st=15000&el=4&we=0103&isfilter=1&p=1&et=2')), function(text){

    const p0 = deal(text, true);
    _result.list = p0.list;

    const n = [];
    for(let i = 0; i < page; i++){
        n.push(pyhttp(info(`${p0.href}&p=${i + 2}`)), callback);
    }

    Promise.all(n).then((result)=>{
        for(let i in result){
            _result.list = _result.list.concat(deal(result[i]).list);
        }
    }).then(()=>{
        fs.writeFile('./result/result.js', `window._result = ${JSON.stringify(_result, null, 4)};`, (error)=>{
            if(error){
                console.log(error);
            }else{
                console.log('寫入數據成功！');
            }
        });
    });
});

將pyhttp.js和deal.js包含進來后，首先對智聯的搜索頁進行一次請求，回調函數內處理返回的源代碼，將第一頁數據添加到數組，并且獲取到了分頁的地址，使用Promise.all并行請求第2頁到第n頁，回調函數內對數據進行處理并添加到數組中，將數據寫入result.js里面（選擇js而不是json是為了便于數據在html上展現）。
獲取到的數據：

1.jpg

5.頁面上展現數據

/* 渲染單個數據 */
const Group = React.createClass({
    // 處理a標簽
    dela: str => str.replace(/<a.*>.*<\/a>/g, ''),
    // 處理多出來的標簽
    delb: str => str.replace(/<\/?[^<>]>/g, '),
    render: function(){
        return (<tr key={this.props.key}>
            <td><a href={this.props.obj.zpdz} target='_blank'>{this.delb(this.props.obj.zwzp)}</a></td>
            <td>{this.props.obj.fklv}</td>
            <td>{this.dela(this.props.obj.gsmc)}</td>
            <td>{this.props.obj.gzdd}</td>
            <td><a href={this.props.obj.zldz} target='_blank'>{decodeURI(this.props.obj.zldz)}</a></td>
            <td>{this.props.obj.gsxz}</td>
            <td>{this.props.obj.gsgm}</td>
        </tr>);
    }
});

/* 表格類 */
const Table = React.createClass({
    // 渲染組
    group: function(){
        return window._result.list.map((object, index)=>{
            return (<Group key={index} obj={object} />);
        });
    },
    render: function(){
        return (
            <table className='table table-bordered table-hover table-condensed table-striped'>
                <thead>
                    <tr className='info'>
                        <th className='td0'>職位</th>
                        <th className='td1'>反饋率</th>
                        <th className='td2'>公司名稱</th>
                        <th className='td3'>工作地點</th>
                        <th className='td4'>智聯地址</th>
                        <th className='td5'>公司性質</th>
                        <th className='td6'>公司規模</th>
                    </tr>
                </thead>
                <tbody>{this.group()}</tbody>
            </table>
        );
    }
});

ReactDOM.render(
    <Table />,
    document.getElementById('result')
);

在頁面上展示數據，使用react和bootstrap。其中在展示時，公司名稱發現有無用a標簽，職位內有b標簽，使用正則表達式刪除它們。
頁面結果：

2.jpg

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

用nodejs和python實現一個爬蟲來爬網站（智聯招聘）的信息

用nodejs和python實現一個爬蟲來爬網站（智聯招聘）的信息

1.用python進行一個http請求

2.nodejs和python實現通信

3.對源代碼進行處理

4.nodejs和python實現通信

5.頁面上展現數據

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

用nodejs和python實現一個爬蟲來爬網站（智聯招聘）的信息

1.用python進行一個http請求

2.nodejs和python實現通信

3.對源代碼進行處理

4.nodejs和python實現通信

5.頁面上展現數據

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频