夜夜偷天天爽夜夜爱,亚洲aⅴ久久精品蜜桃小仓由菜,年轻的妺妺伦理hd中文

最近在搗鼓一個仿簡書的開源項目，從前端到后臺，一戰(zhàn)擼到底。就需要數(shù)據(jù)支持，最近mock數(shù)據(jù)，比較費勁。簡書的很多數(shù)據(jù)都是后臺渲染的，很難快速抓api請求數(shù)據(jù)，本人又比較懶，就想到用寫個簡易爬蟲系統(tǒng)。

項目初始化

安裝nodejs，官網(wǎng)，中文網(wǎng)。根據(jù)自己系統(tǒng)安裝，這里跳過，表示你已經(jīng)安裝了nodejs。

選擇一款順手拉風的編輯器，用來寫代碼。推薦webstorm最新版。

webstorm創(chuàng)建一個工程，起一個喜歡的名字。創(chuàng)建一個package.json文件，webstorm快捷創(chuàng)建package.json非常簡單。還是用命令行創(chuàng)建，打開Terminal，默認當前項目根目錄，npm init，一直下一步。

可以看這里npm常用你應該懂的使用技巧

主要技術棧

superagent 頁面數(shù)據(jù)下載
cheerio 頁面數(shù)據(jù)解析

這是2個npm包，我們先下載在接著繼續(xù)，下載需要時間的。

npm install superagent cheerio --save

接下啦簡單說說這2個是啥東西

superagent 頁面數(shù)據(jù)下載

superagent是nodejs里一個非常方便的客戶端請求代碼模塊，superagent是一個輕量級的，漸進式的ajax API，可讀性好，學習曲線低，內(nèi)部依賴nodejs原生的請求API,適用于nodejs環(huán)境下。

請求方式

get (默認)
post
put
delete
head

語法：request(RequestType, RequestUrl).end(callback(err, res));

寫法：

request
    .get('/login')
    .end(function(err, res){
        // code
    });

設置Content-Type

application/json (默認)
form
json
png
xml
...

設置方式：

1. 
request
    .get('/login')
    .set('Content-Type', 'application/json');
2. 
request
    .get('/login')
    .type('application/json');
3. 
request
    .get('/login')
    .accept('application/json');

以上三種方效果一樣。

設置參數(shù)

query
send

query

設置請求參數(shù)，可以寫json對象或者字符串形式。

json對象{key,value}

可以寫多組key,value


request
    .get('/login')
    .query({
        username: 'jiayi',
        password: '123456'
    });

字符串形式key=value

可以寫多組key=value，需要用&隔開


request
    .get('/login')
    .query('username=jiayi&password=123456');

sned

設置請求參數(shù)，可以寫json對象或者字符串形式。

json對象{key,value}

可以寫多組key,value


request
    .get('/login')
    .sned({
        username: 'jiayi',
        password: '123456'
    });

字符串形式key=value

可以寫多組key=value，需要用&隔開


request
    .get('/login')
    .sned('username=jiayi&password=123456');

上面兩種方式可以使用在一起


request
    .get('/login')
    .query({
        id: '100'
    })
    .sned({
          username: 'jiayi',
          password: '123456'
      });

響應屬性Response

Response text

Response.text包含未解析前的響應內(nèi)容，一般只在mime類型能夠匹配text/json、x-www-form-urlencoding的情況下，默認為nodejs客戶端提供，這是為了節(jié)省內(nèi)存，因為當響應以文件或者圖片大內(nèi)容的情況下影響性能。

Response header fields

Response.header包含解析之后的響應頭數(shù)據(jù)，鍵值都是node處理成小寫字母形式，比如res.header('content-length')。

Response Content-Type

Content-Type響應頭字段是一個特列，服務器提供res.type來訪問它，默認res.charset是空的，如果有的化，則自動填充，例如Content-Type值為text/html;charset=utf8，則res.type為text/html；res.charset為utf8。

Response status

http響應規(guī)范

cheerio 頁面數(shù)據(jù)解析

cheerio是一個node的庫，可以理解為一個Node.js版本的jquery，用來從網(wǎng)頁中以 css selector取數(shù)據(jù)，使用方式和jquery基本相同。

相似的語法:Cheerio 包括了 jQuery 核心的子集。Cheerio 從jQuery庫中去除了所有 DOM不一致性和瀏覽器尷尬的部分，揭示了它真正優(yōu)雅的API。
閃電般的塊:Cheerio 工作在一個非常簡單，一致的DOM模型之上。解析，操作，呈送都變得難以置信的高效。基礎的端到端的基準測試顯示Cheerio 大約比JSDOM快八倍(8x)。
巨靈活: Cheerio 封裝了兼容的htmlparser。Cheerio 幾乎能夠解析任何的 HTML 和 XML document。

需要先loading一個需要加載html文檔，后面就可以jQuery一樣使用操作頁面了。

const cheerio = require('cheerio');
const $ = cheerio.load('<ul id="fruits">...</ul>');
$('#fruits').addClass('newClass');

基本所有選擇器基本和jQuery一樣，就不一一列舉。具體怎么使用看官網(wǎng)。

上面已經(jīng)基本把我們要用到東西有了基本的了解了，我們用到比較簡單，接下來就開始寫代碼了，爬數(shù)據(jù)了哦。

抓取首頁文章列表20條數(shù)據(jù)

根目錄創(chuàng)建一個app.js文件。

實現(xiàn)思路步驟

引入依賴
定義一個地址
發(fā)起請求
頁面數(shù)據(jù)解析
分析頁面數(shù)據(jù)
生成數(shù)據(jù)

1. 引入依賴：

const superagent = require('superagent');
const cheerio = require('cheerio');

2. 定義一個地址

const reptileUrl = "http://www.lxweimin.com/";

3. 發(fā)起請求

superagent.get(reptileUrl).end(function (err, res) {
    // 拋錯攔截
     if(err){
         return throw Error(err);
     }
    // 等待 code
});

這個時候我們會向簡書首頁發(fā)一個請求，只要不拋錯，走if，那么就可以繼續(xù)往下看了。

4. 頁面數(shù)據(jù)解析

superagent.get(reptileUrl).end(function (err, res) {
    // 拋錯攔截
     if(err){
         return throw Error(err);
     }
   /**
   * res.text 包含未解析前的響應內(nèi)容
   * 我們通過cheerio的load方法解析整個文檔，就是html頁面所有內(nèi)容，可以通過console.log($.html());在控制臺查看
   */
   let $ = cheerio.load(res.text);
});

注釋已經(jīng)說明這行代碼的意思，就不在說明了。接下來就比較難了。

5. 分析頁面數(shù)據(jù)

你需在瀏覽器打開簡書官網(wǎng)，簡書是后臺渲染部分可見的數(shù)據(jù)，后續(xù)數(shù)據(jù)是通過ajax請求，使用js填充。我們爬數(shù)據(jù)，一般只能爬到后臺渲染的部分，js渲染的是爬不到，如果ajax，你可以直接去爬api接口，那個日后再說。

言歸正傳，簡書首頁文章列表，默認會加載20條數(shù)據(jù)，這個已經(jīng)夠我用了，你每次刷新，如果有更新就會更新，最新的永遠在最上面。

這20條數(shù)據(jù)存在頁面一個類叫.note-list的ul里面，每條數(shù)據(jù)就是一個li，ul父級有一個id叫l(wèi)ist-container，學過html的都知道id是唯一，保證不出錯，我選擇id往下查找。

$('#list-container .note-list li')

上面就是cheerio幫我們獲取到說有需要的文章列表的li，是不是和jq寫一樣。我要獲取li里面內(nèi)容就需要遍歷 Element.each(function(i, elem) {}) 也是和jq一樣

$('#list-container .note-list li').each(function(i, elem) {
   // 拿到當前l(fā)i標簽下所有的內(nèi)容，開始干活了
});

以上都比較簡單，復雜的是下面的，數(shù)據(jù)結構。我們需要怎么拼裝數(shù)據(jù)，我大致看了一下頁面，根據(jù)經(jīng)驗總結了一個結構，還算靠譜。

{
     id：  每條文章id
    slug：每條文章訪問的id （加密的id）
    title： 標題
    abstract： 描述
    thumbnails： 縮略圖 （如果文章有圖，就會抓第一張，如果沒有圖就沒有這個字段）
   collection_tag：文集分類標簽
   reads_count： 閱讀計數(shù)
   comments_count： 評論計數(shù)
   likes_count：喜歡計數(shù)
   author： {    作者信息
      id：沒有找到
      slug： 每個用戶訪問的id （加密的id）
      avatar：會員頭像
      nickname：會員昵稱（注冊填的那個）
      sharedTime：發(fā)布日期
   }
}

基本數(shù)據(jù)結構有了，先定義一個數(shù)組data，來存放拼裝的數(shù)據(jù)，留給后面使用。

隨便截取一條文章數(shù)據(jù)

<li id="note-12732916" data-note-id="12732916" class="have-img">
    <a class="wrap-img" href="/p/b0ea2ac2d5c4" target="_blank">
      ![](//upload-images.jianshu.io/upload_images/1996705-7e00331b8f3dbc5d.jpg?imageMogr2/auto-orient/strip|imageView2/1/w/375/h/300)
    </a>
  <div class="content">
    <div class="author">
      <a class="avatar" target="_blank" href="/u/652fbdd1e7b3">
        ![](//upload.jianshu.io/users/upload_avatars/1996705/738ba2908445?imageMogr2/auto-orient/strip|imageView2/1/w/96/h/96)
</a>      <div class="name">
        <a class="blue-link" target="_blank" href="/u/652fbdd1e7b3">xxx</a>
        <span class="time" data-shared-at="2017-05-24T08:05:12+08:00"></span>
      </div>
    </div>
    <a class="title" target="_blank" href="/p/b0ea2ac2d5c4">xxxxxxx</a>
    <p class="abstract">
     xxxxxxxxx...
    </p>
    <div class="meta">
        <a class="collection-tag" target="_blank" href="/c/8c92f845cd4d">xxxx</a>
      <a target="_blank" href="/p/b0ea2ac2d5c4">
        <i class="iconfont ic-list-read"></i> 414
</a>        <a target="_blank" href="/p/b0ea2ac2d5c4#comments">
          <i class="iconfont ic-list-comments"></i> 2
</a>      <span><i class="iconfont ic-list-like"></i> 16</span>
        <span><i class="iconfont ic-list-money"></i> 1</span>
    </div>
  </div>
</li>

我們就拿定義的數(shù)據(jù)結構和實際的頁面dom去一一比對，去獲取我們想要的數(shù)據(jù)。

id：每條文章id

li上有一個 data-note-id="12732916"這個東西就是文章的id，
怎么獲取：$(elem).attr('data-note-id')，這樣就完事了

slug：每條文章訪問的id （加密的id）

如果你點文章標題，或者帶縮略圖的位置，都會跳轉(zhuǎn)一個新頁面 http://www.lxweimin.com/p/xxxxxx 這樣的格式。標題是一個a鏈接，鏈接上有一個href屬性，里面有一段 /p/xxxxxx 這樣的 /p/是文章詳情一個標識，xxxxxx是標識哪片文章。而我們slug就是這個xxxxxx，就需要處理一下。$(elem).find('.title').attr('href').replace(//p//, "")，這樣就可以得到xxxxxx了。

title：標題

這個簡單，$(elem).find('.title').text()就好了。

abstract：描述

這個簡單，$(elem).find('.abstract').text()就好了。

thumbnails：縮略圖（如果文章有圖，就會抓第一張，如果沒有圖就沒有這個字段）

這個存在.wrap-img這a標簽里面img里，如果沒有就不顯示，$(elem).find('.wrap-img img').attr('src')，如果取不到就是一個undefined，那正合我意。

下面4個都在.meta的div里面（我沒有去打賞的數(shù)據(jù)，因為我不需要這個數(shù)據(jù)）

collection_tag：文集分類標簽

有對應的class，$(elem).find('.collection-tag').text()

reads_count：閱讀計數(shù)

這個就比較麻煩了，它的結構是這樣的

<a target="_blank" href="/p/b0ea2ac2d5c4">
        <i class="iconfont ic-list-read"></i> 414
</a>

還要有一個字體圖標的class可以使用，不然還真不好玩，那需要怎么獲取了，$(elem).find('.ic-list-read').parent().text()，先去查找這個字體圖標i標簽，然后去找它的父級a標簽，獲取里面text文本，標簽就不被獲取了，只剩下數(shù)字。

接下來2個一樣處理的。

comments_count：評論計數(shù)

$(elem).find('.ic-list-comments').parent().text()

likes_count：喜歡計數(shù)

$(elem).find('.ic-list-like').parent().text()

接來就是會員信息，全部都在.author這個div里面

id：沒有找到

slug：每個用戶訪問的id （加密的id）

這個處理方式和文章slug一樣，$(elem).find('.avatar').attr('href').replace(//u//, "")，唯一不同的需要吧p換成u。

avatar：會員頭像

$(elem).find('.avatar img').attr('src')

nickname：會員昵稱（注冊填的那個）

昵稱存在一個叫.blue-link標簽里面，$(elem).find('.blue-link').text()

sharedTime：發(fā)布日期

這個發(fā)布日期，你看到頁面是個性化時間，xx小時前啥的，如果直接取就是一個坑爹的事了，在.time的span上有一個data-shared-at="2017-05-24T08:05:12+08:00"這個才是正真的時間，你會發(fā)現(xiàn)它一上來是空的，是js來格式化的。$(elem).find('.time').attr('data-shared-at')

以上就是所有字段來源的。接下來要說一個坑爹的事，text()獲取出來的，有回車符/n和空格符/s。所以需要寫一個方法把它們?nèi)サ簟?/p>

function replaceText(text){
    return text.replace(/\n/g, "").replace(/\s/g, "");
}

組裝起來的數(shù)據(jù)代碼：

let data = [];
// 下面就是和jQuery一樣獲取元素，遍歷，組裝我們需要數(shù)據(jù)，添加到數(shù)組里面
$('#list-container .note-list li').each(function(i, elem) {
    let _this = $(elem);
    data.push({
       id: _this.attr('data-note-id'),
       slug: _this.find('.title').attr('href').replace(/\/p\//, ""),
       author: {
           slug: _this.find('.avatar').attr('href').replace(/\/u\//, ""),
           avatar: _this.find('.avatar img').attr('src'),
           nickname: replaceText(_this.find('.blue-link').text()),
           sharedTime: _this.find('.time').attr('data-shared-at')
       },
       title: replaceText(_this.find('.title').text()),
       abstract: replaceText(_this.find('.abstract').text()),
       thumbnails: _this.find('.wrap-img img').attr('src'),
       collection_tag: replaceText(_this.find('.collection-tag').text()),
       reads_count: replaceText(_this.find('.ic-list-read').parent().text()) * 1,
       comments_count: replaceText(_this.find('.ic-list-comments').parent().text()) * 1,
       likes_count: replaceText(_this.find('.ic-list-like').parent().text()) * 1
   });
});

let _this = $(elem); 先把$(elem);存到一個變量里面，jq寫習慣了。

有幾個*1是吧數(shù)字字符串轉(zhuǎn)成數(shù)字，js小技巧，不解釋。

6. 生成數(shù)據(jù)

數(shù)據(jù)已經(jīng)可以獲取了，都存在data這個數(shù)據(jù)里面，現(xiàn)在是20條數(shù)據(jù)，我們理想的數(shù)據(jù)，那么放在node里面，我們還是拿不到，怎么辦，一個存在數(shù)據(jù)庫（還沒有弄到哪里，我都還沒有想好怎么建數(shù)據(jù)庫表設計），一個就存在本地json文件。

那就存在本地json文件。nodejs是一個服務端語言，就說可以訪問本地磁盤，添加文件和訪問文件。需要引入nodejs內(nèi)置的包fs。

const fs = require('fs');

它的其他用法不解釋了，只說一個創(chuàng)建一個文件，并且在里面寫內(nèi)容

這是寫文件的方法：

fs.writeFile(filename,data,[options],callback); 
/**
 * filename, 必選參數(shù)，文件名
 * data, 寫入的數(shù)據(jù)，可以字符或一個Buffer對象
 * [options],flag 默認‘2’,mode(權限) 默認‘0o666’,encoding 默認‘utf8’
 * callback  回調(diào)函數(shù)，回調(diào)函數(shù)只包含錯誤信息參數(shù)(err)，在寫入失敗時返回。
 */

我們需要這樣來寫了：

// 寫入數(shù)據(jù), 文件不存在會自動創(chuàng)建
fs.writeFile(__dirname + '/data/article.json', JSON.stringify({
    status: 0,
    data: data
}), function (err) {
    if (err) throw err;
    console.log('寫入完成');
});

注意事項

我方便管理數(shù)據(jù)，放在data文件夾，如果你也是這樣，記得一定先要在根目錄建一個data文件夾不然就會報錯
默認utf-8編碼;
寫json文件一定要JSON.stringify()處理，不然就是[object Object]這貨了。
如果是文件名可以直接article.json會自動生成到當前項目根目錄里，如果要放到某個文件里，例如data，一定要加上__dirname + '/data/article.json'。千萬不能寫成'/data/article.json'不然就會拋錯，找不到文件夾，因為文件夾在你所在的項目的盤符里。例如G:/data/article.json。

以上基本就完成一個列表頁面的抓取。看下完整代碼：

/**
 * 獲取依賴
 * @type {*}
 */
const superagent = require('superagent');
const cheerio = require('cheerio');
const fs = require('fs');
/**
 * 定義請求地址
 * @type {*}
 */
const reptileUrl = "http://www.lxweimin.com/";
/**
 * 處理空格和回車
 * @param text
 * @returns {string}
 */
function replaceText(text) {
  return text.replace(/\n/g, "").replace(/\s/g, "");
}
/**
 * 核心業(yè)務
 * 發(fā)請求，解析數(shù)據(jù)，生成數(shù)據(jù)
 */
superagent.get(reptileUrl).end(function (err, res) {
    // 拋錯攔截
    if (err) {
        return throw Error(err);
    }
    // 解析數(shù)據(jù)
    let $ = cheerio.load(res.text);
    /**
     * 存放數(shù)據(jù)容器
     * @type {Array}
     */
    let data = [];
    // 獲取數(shù)據(jù)
    $('#list-container .note-list li').each(function (i, elem) {
        let _this = $(elem);
        data.push({
            id: _this.attr('data-note-id'),
            slug: _this.find('.title').attr('href').replace(/\/p\//, ""),
            author: {
                slug: _this.find('.avatar').attr('href').replace(/\/u\//, ""),
                avatar: _this.find('.avatar img').attr('src'),
                nickname: replaceText(_this.find('.blue-link').text()),
                sharedTime: _this.find('.time').attr('data-shared-at')
            },
            title: replaceText(_this.find('.title').text()),
            abstract: replaceText(_this.find('.abstract').text()),
            thumbnails: _this.find('.wrap-img img').attr('src'),
            collection_tag: replaceText(_this.find('.collection-tag').text()),
            reads_count: replaceText(_this.find('.ic-list-read').parent().text()) * 1,
            comments_count: replaceText(_this.find('.ic-list-comments').parent().text()) * 1,
            likes_count: replaceText(_this.find('.ic-list-like').parent().text()) * 1
        });
    });
   // 生成數(shù)據(jù)
    // 寫入數(shù)據(jù), 文件不存在會自動創(chuàng)建
    fs.writeFile(__dirname + '/data/article.json', JSON.stringify({
        status: 0,
        data: data
    }), function (err) {
        if (err) throw err;
        console.log('寫入完成');
    });
});

一個簡書首頁文章列表的爬蟲就大工告成了，運行代碼，打開Terminal運行node app.js或者node app都行。或者在package.json的scripts對象下添加一個"dev": "node app.js"，然后用webstorm的npm面板運行。

有文章列表就有對應的詳情頁面，后面繼續(xù)講解怎么爬詳情。

抓取首頁文章列表對應的20條詳情數(shù)據(jù)

有了上面抓取文章列表的經(jīng)驗，接下來就好辦多了，完事開頭難。

實現(xiàn)思路步驟

引入依賴
定義一個地址
發(fā)起請求
頁面數(shù)據(jù)解析
分析頁面數(shù)據(jù)
生成數(shù)據(jù)

1. 引入依賴

這個就不用引入，在一個文件里面，因為比較簡單的，代碼不多，懶得分文件寫。導入導出模塊麻煩，人懶就這樣的。

但我們需要寫一個函數(shù)，來處理爬詳情的方法。

function getArticle(item){
   // 等待code
}

2. 定義一個地址

注意這個地址，是有規(guī)律的，不是隨便的地址，隨便點開一篇文章就可以看到地址欄，http://www.lxweimin.com/p/xxxxxx，我們定義的reptileUrl = "http://www.lxweimin.com/";那么就需要拼地址了，還記得xxxxxx我們存在哪里嗎，存在slug里面。請求地址：reptileUrl + 'p/' + item.slug。

3. 發(fā)起請求

superagent.get(reptileUrl + 'p/' + item.slug).end(function (err, res) {
    // 拋錯攔截
     if(err){
         return throw Error(err);
     }
});

你懂的

4. 頁面數(shù)據(jù)解析

superagent.get(reptileUrl + 'p/' + item.slug).end(function (err, res) {
    // 拋錯攔截
     if(err){
         return throw Error(err);
     }
   /**
   * res.text 包含未解析前的響應內(nèi)容
   * 我們通過cheerio的load方法解析整個文檔，就是html頁面所有內(nèi)容，可以通過console.log($.html());在控制臺查看
   */
   let $ = cheerio.load(res.text);
});

5. 分析頁面數(shù)據(jù)

你可能會按上面的方法，打開一個頁面，然后就去獲取標簽上面的class，id。我開始也在這個上面遇到一個坑，頁面上有閱讀，評論，喜歡這三個數(shù)據(jù)，我一開始以為都是直接load頁面就有數(shù)據(jù)，在獲取時候，并沒有數(shù)據(jù)，是一個空。我就奇怪，然后我就按了幾次f5刷新，發(fā)現(xiàn)問題了，這幾個數(shù)據(jù)的是頁面加載完成以后才顯示出來的，那么就是說這個有可能是js渲染填充的。那就說明的我寫的代碼沒有錯。

有問題要解決呀，如果是js渲染，要么會有網(wǎng)絡加載，刷新幾次，沒有這個數(shù)據(jù)，那就只能存在頁面里，寫的內(nèi)聯(lián)的script標簽里面了，右鍵查看源碼，ctrl+f搜索，把閱讀，評論，喜歡的數(shù)字，隨便挑一個，找到了最底部data-name="page-data"的script標簽里面，有一個json對象，里面有些字段，和我文章列表定義很像，就是這個。有了這個就好辦了，省的我去截取一大堆操作。

解析script數(shù)據(jù)

let note = JSON.parse($('script[data-name=page-data]').text());

script里面數(shù)據(jù)

{"user_signed_in":false,"locale":"zh-CN","os":"windows","read_mode":"day","read_font":"font2","note_show":{"is_author":false,"is_following_author":false,"is_liked_note":false,"uuid":"7219e299-034d-4051-b995-a6a4344038ef"},"note":{"id":12741121,"slug":"b746f17a8d90","user_id":6126137,"notebook_id":12749292,"commentable":true,"likes_count":59,"views_count":2092,"public_wordage":1300,"comments_count":29,"author":{"total_wordage":37289,"followers_count":221,"total_likes_count":639}}}

把script里面內(nèi)容都獲取出來，然后用 JSON方法，字符串轉(zhuǎn)對象。

接下來依舊是要定義數(shù)據(jù)結構：

article: {   文章信息
     id:  文章id
     slug:  每條文章訪問的id （加密的id）
    title: 標題
    content: 正文（記得要帶html標簽的）
    publishTime: 更新時間
     wordage: 字數(shù)
     views_count: 閱讀計數(shù)
    comments_count: 評論計數(shù)
    likes_count: 喜歡計數(shù)
},
author: {
    id: 用戶id
   slug: 每個用戶訪問的id （加密的id）
   avatar: 會員頭像
   nickname: 會員昵稱（注冊填的那個）
   signature: 會員昵稱簽名
   total_wordage: 總字數(shù)
   followers_count: 總關注計數(shù)
   total_likes_count: 總喜歡計數(shù)
}

還要專題分類和評論列表我沒有累出來，有興趣可以自己去看看怎么爬出來。它們是單獨api接口，數(shù)據(jù)結構就不需要了。

因為有了note 這個對象很多數(shù)據(jù)都簡單了，還是一個一個說明來源

article 文章信息

id: 文章id

主要信息都存在note.note里面，文章id就是note.note.id,

slug: 每條文章訪問的id （加密的id）

note.note.slug

title: 標題
所有的正文都存在.post下的.article里，那么獲取title就是$('div.post').find('.article .title').text()

content: 正文（記得要帶html標簽的）

注意正文不是獲取text文本是要獲取html標簽，需要用到html來獲取而不是text，$('div.post').find('.article .show-content').html() 返回都是轉(zhuǎn)義字符。到時候前端需要處理就會顯示了。雖然我們看不懂，瀏覽器看得懂就行了。

publishTime: 更新時間

這時間直接顯示出來了，不是個性化時間，直接取就好了$('div.post').find('.article .publish-time').text()

wordage: 字數(shù)

這個是一個標簽里面<字數(shù) 1230>這樣的，我們肯定不能要這樣的，需要吧數(shù)字提取出來，$('div.post').find('.article .wordage').text().match(/\d+/g)[0]*1 用正則獲取數(shù)字字符串，然后轉(zhuǎn)成數(shù)字。

views_count: 閱讀計數(shù)

note.note.views_count

comments_count: 評論計數(shù)

note.note.comments_count

likes_count: 喜歡計數(shù)

note.note.likes_count

author 用戶信息

id: 用戶id

前面的文章列表我們并沒有拿到用戶id，note.note發(fā)現(xiàn)了一個user_id，反正不管是不是先存了再說，別空著，note.note.user_id

slug: 每個用戶訪問的id （加密的id）

文章列表怎么獲取，這個就怎么獲取$('div.post').find('.avatar').attr('href').replace(//u//, "")

avatar: 會員頭像

$('div.post').find('.avatar img').attr('src')

nickname: 會員昵稱（注冊填的那個）

$('div.post').find('.author .name a').text()

signature: 會員昵稱簽名

這個簽名在上面位置了，就在文章正文下面，評論和打賞上面，有個很大關注按鈕那個灰色框框里面，最先一段文字。$('div.post').find('.signature').text()

total_wordage: 總字數(shù)

note.note.author.total_wordage

followers_count: 總關注計數(shù)

note.note.author.followers_count

total_likes_count: 總喜歡計數(shù)

note.note.author.total_likes_count

有些字段命名就是從note.note這個json對象里面獲取的，一開始我也不知道取什么名字。

最終拼接的數(shù)據(jù)

/**
         * 存放數(shù)據(jù)容器
         * @type {Array}
         */
        let data = {
            article: {
                id: note.note.id,
                slug: note.note.slug,
                title: replaceText($post.find('.article .title').text()),
                content: replaceText($post.find('.article .show-content').html()),
                publishTime: replaceText($post.find('.article .publish-time').text()),
                wordage: $post.find('.article .wordage').text().match(/\d+/g)[0]*1,
                views_count: note.note.views_count,
                comments_count: note.note.comments_count,
                likes_count: note.note.likes_count
            },
            author: {
                id: note.note.user_id,
                slug: $post.find('.avatar').attr('href').replace(/\/u\//, ""),
                avatar: $post.find('.avatar img').attr('src'),
                nickname: replaceText($post.find('.author .name a').text()),
                signature: replaceText($post.find('.signature').text()),
                total_wordage: note.note.author.total_wordage,
                followers_count: note.note.author.followers_count,
                total_likes_count: note.note.author.total_likes_count
            }
        };

6. 生成數(shù)據(jù)

和列表生成數(shù)據(jù)基本一樣，有一個區(qū)別。文件需要加一個標識，article_+ item.slug（文章訪問的id）

 // 寫入數(shù)據(jù), 文件不存在會自動創(chuàng)建
        fs.writeFile(__dirname + '/data/article_' + item.slug + '.json', JSON.stringify({
            status: 0,
            data: data
        }), function (err) {
            if (err) throw err;
            console.log('寫入完成');
        });

基本就擼完了，看獲取詳情的完整代碼：

function getArticle(item) {
// 拼接請求地址
  let url = reptileUrl + '/p/' + item.slug;
   /**
 * 核心業(yè)務
 * 發(fā)請求，解析數(shù)據(jù)，生成數(shù)據(jù)
 */
    superagent.get(url).end(function (err, res) {
        // 拋錯攔截
    if (err) {
        return throw Error(err);
    }
      // 解析數(shù)據(jù)
        let $ = cheerio.load(res.text);
    // 獲取容器，存放在變量里，方便獲取
        let $post = $('div.post');
    // 獲取script里的json數(shù)據(jù)
        let note = JSON.parse($('script[data-name=page-data]').text());
        /**
         * 存放數(shù)據(jù)容器
         * @type {Array}
         */
        let data = {
            article: {
                id: note.note.id,
                slug: note.note.slug,
                title: replaceText($post.find('.article .title').text()),
                content: replaceText($post.find('.article .show-content').html()),
                publishTime: replaceText($post.find('.article .publish-time').text()),
                wordage: $post.find('.article .wordage').text().match(/\d+/g)[0]*1,
                views_count: note.note.views_count,
                comments_count: note.note.comments_count,
                likes_count: note.note.likes_count
            },
            author: {
                id: note.note.user_id,
                slug: $post.find('.avatar').attr('href').replace(/\/u\//, ""),
                avatar: $post.find('.avatar img').attr('src'),
                nickname: replaceText($post.find('.author .name a').text()),
                signature: replaceText($post.find('.signature').text()),
                total_wordage: note.note.author.total_wordage,
                followers_count: note.note.author.followers_count,
                total_likes_count: note.note.author.total_likes_count
            }
        };
       // 生成數(shù)據(jù)
        // 寫入數(shù)據(jù), 文件不存在會自動創(chuàng)建
        fs.writeFile(__dirname + '/data/article_' + item.slug + '.json', JSON.stringify({
            status: 0,
            data: data
        }), function (err) {
            if (err) throw err;
            console.log('寫入完成');
        });
    });
}

你肯定要問了，在哪里調(diào)用了，
在上面獲取文章列表的請求end里面底部隨便找個位置加上:

data.forEach(function (item) {
        getArticle(item);
    });

運行，你就會在data文件夾里看到21個json文件。源文件，歡迎指正Bug。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

10分鐘教你擼一個nodejs爬蟲系統(tǒng)

10分鐘教你擼一個nodejs爬蟲系統(tǒng)

項目初始化

主要技術棧

superagent 頁面數(shù)據(jù)下載

請求方式

設置Content-Type

設置參數(shù)

query

json對象{key,value}

字符串形式key=value

sned

json對象{key,value}

字符串形式key=value

響應屬性Response

Response text

Response header fields

Response Content-Type

Response status

cheerio 頁面數(shù)據(jù)解析

抓取首頁文章列表20條數(shù)據(jù)

實現(xiàn)思路步驟

1. 引入依賴：

2. 定義一個地址

3. 發(fā)起請求

4. 頁面數(shù)據(jù)解析

5. 分析頁面數(shù)據(jù)

6. 生成數(shù)據(jù)

抓取首頁文章列表對應的20條詳情數(shù)據(jù)

實現(xiàn)思路步驟

1. 引入依賴

2. 定義一個地址

3. 發(fā)起請求

4. 頁面數(shù)據(jù)解析

5. 分析頁面數(shù)據(jù)

6. 生成數(shù)據(jù)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

10分鐘教你擼一個nodejs爬蟲系統(tǒng)

項目初始化

主要技術棧

superagent 頁面數(shù)據(jù)下載

請求方式

設置Content-Type

設置參數(shù)

query

json對象{key,value}

字符串形式key=value

sned

json對象{key,value}

字符串形式key=value

響應屬性Response

Response text

Response header fields

Response Content-Type

Response status

cheerio 頁面數(shù)據(jù)解析

抓取首頁文章列表20條數(shù)據(jù)

實現(xiàn)思路步驟

1. 引入依賴：

2. 定義一個地址

3. 發(fā)起請求

4. 頁面數(shù)據(jù)解析

5. 分析頁面數(shù)據(jù)

6. 生成數(shù)據(jù)

抓取首頁文章列表對應的20條詳情數(shù)據(jù)

實現(xiàn)思路步驟

1. 引入依賴

2. 定義一個地址

3. 發(fā)起請求

4. 頁面數(shù)據(jù)解析

5. 分析頁面數(shù)據(jù)

6. 生成數(shù)據(jù)

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频