亚洲人成无码网站久久99热国产,国产精品爽爽v在线观看无码,成全高清免费完整观看

思沃大講堂培訓，要求我們把自己學習的心得感悟輸出在簡書上，公司還會統計大家的文章，包括文章數量、評論量、被喜歡量等等。這么多人，人工統計起來自然很麻煩，當然程序員會把這么艱巨光榮繁瑣的工作交給代碼，于是他們就寫了一個爬蟲。適值極客人正在學習Ruby，所以就突發奇想寫了一個Ruby爬蟲統計簡書用戶的文章，帶動自己的Ruby學習。

如果讓我抓取一個網站的內容，我的第一想法可能會是抓取它的HTML，不過也會反過來問自己一句這個網站有沒有Rss訂閱源地址，RSS的訂閱源的內容是xml，相比html更加簡潔和高效，而且由于xml的結構穩定一點（html可能那天換一個css可能就會導致我的爬蟲用不了啦），解析起來會更加方便一點。在考察完簡書沒有提供RSS后，我就決定選擇html來暴力地抓取簡書了。

分析簡書網址

首頁：http://www.lxweimin.com/
用戶主頁：http://www.lxweimin.com/users/用戶ID（暫估計這么說）

我們可以獲取用戶的關注、粉絲、文章、字數、收獲喜歡等信息

Paste_Image.png

用戶最新文章http://www.lxweimin.com/users/用戶ID（暫估計這么說）/latest_articles

我們可以獲取用戶文章列表，以此統計用戶文章的評論量、閱讀量等等，通過遍歷文章列表將評論量、閱讀量相加即可獲取評論總量、閱讀總量

需要指出的是，由于文章列表頁不能把用戶的文章全部列出來，而是列出10條，用戶在瀏覽器中滾動到文章列表底部會自動加載，是用js向后臺請求數據然后在前端多次拼接出來，所以想一次性地抓一次就把用戶的評論總量、閱讀總量是不行的，用戶列表頁分頁的。所以我采取分頁抓取的方式，那么怎么知道用戶文章一共有多少頁呢？我們從用戶主頁中獲取了用戶的文章總數，所以除以10加1可以獲取頁數

用戶列表頁分頁的，10條/頁，其中第 m 頁URL：

http://www.lxweimin.com/users/用戶ID（暫估計這么說）/latest_articles?page=m

Paste_Image.png

抓取網頁，獲取html

Ruby提供的HTTP訪問方法十分簡潔高效，當然方法不止一種，對其他方法感興趣的同學我自行Google，在此我貼出自己的代碼：

h = Net::HTTP.new("www.lxweimin.com", 80)
resp = h.get "/users/#{authorInfo.id}/latest_articles"
latest_articles_html = resp.body.to_s

顧問生義，我想不需要解釋代碼的意思了吧
根據上面介紹的簡書網址規則，就可以通過上述代碼抓取到相應網頁的HTML

分析抓取內容的結構

獲取完相應網頁的HTML內容后要做的就是分析HTML的內容和結構。我們用眼睛很容易看出網頁上的內容，但是爬蟲看到的只有html源代碼。下面我從抓取的HTML中提取了下列有用的代碼：

用于提取用戶的關注量、粉絲量、文章數、字數、收獲喜歡數

<div class="user-stats">
  <ul class="clearfix">
    <li>
      <a href="/users/ef49e6b7ec1e/subscriptions"><b>38</b><span>關注</span></a>
    </li>
    <li>
      <a href="/users/ef49e6b7ec1e/followers"><b>22</b><span>粉絲</span></a>
    </li>
    <br>
    <li>
      <a href="/users/ef49e6b7ec1e"><b>9</b><span>文章</span></a>
    </li>
    <li>
      <a><b>9938</b><span>字數</span></a>
    </li>
    <li>
      <a><b>41</b><span>收獲喜歡</span></a>
    </li>
  </ul>
</div>

用于提取用戶文章評論總量、閱讀總量

<ul class="article-list latest-notes"><li>
    <div>
      <p class="list-top">
        <a class="author-name blue-link" target="_blank" href="/users/ef49e6b7ec1e">極客人</a>
        <em>·</em>
        <span class="time" data-shared-at="2016-12-12T15:39:07+08:00">4天之前</span>
      </p>
      <h4 class="title"><a target="_blank" href="/p/f11d1fca16c6">Hello,Ruby!</a></h4>
      <div class="list-footer">
        <a target="_blank" href="/p/f11d1fca16c6">
          閱讀 23
</a>        <a target="_blank" href="/p/f11d1fca16c6#comments">
           · 評論 4
</a>        <span> · 喜歡 1</span>
        
      </div>
    </div>
  </li>
  <li class="have-img">
      <a class="wrap-img" href="/p/3d43727e04a5"><img src="http://upload-images.jianshu.io/upload_images/2154287-86190de5fd3071f7.png?imageMogr2/auto-orient/strip%7CimageView2/1/w/300/h/300" alt="300"></a>
    <div>
      <p class="list-top">
        <a class="author-name blue-link" target="_blank" href="/users/ef49e6b7ec1e">極客人</a>
        <em>·</em>
        <span class="time" data-shared-at="2016-12-08T00:02:08+08:00">9天之前</span>
      </p>
      <h4 class="title"><a target="_blank" href="/p/3d43727e04a5">Html5語義化標簽的啟示</a></h4>
      <div class="list-footer">
        <a target="_blank" href="/p/3d43727e04a5">
          閱讀 182
</a>        <a target="_blank" href="/p/3d43727e04a5#comments">
           · 評論 1
</a>        <span> · 喜歡 10</span>
        
      </div>
    </div>
  </li>
.....
  
  <li>
    <div>
      <p class="list-top">
        <a class="author-name blue-link" target="_blank" href="/users/ef49e6b7ec1e">極客人</a>
        <em>·</em>
        <span class="time" data-shared-at="2016-11-28T19:35:45+08:00">18天之前</span>
      </p>
      <h4 class="title"><a target="_blank" href="/p/114c27b6456c">網站自動跳轉到Cjb.Net的驚險之旅</a></h4>
      <div class="list-footer">
        <a target="_blank" href="/p/114c27b6456c">
          閱讀 21
</a>        <a target="_blank" href="/p/114c27b6456c#comments">
           · 評論 3
</a>        <span> · 喜歡 3</span>
        
      </div>
    </div>
  </li>

</ul>

正則匹配，摳出關鍵信息

上面我已經提取出有用的關鍵的HTML，現在要做的是讓爬蟲做同樣的事情。所以我用到啦正則匹配。

正則匹配出粉絲", "關注", "文章", "字數", "收獲喜歡"

#從html中加載基本用戶信息
def loadAuthorBaseInfoFromHtml(authorInfo, latest_articles_html)  
infoKeys=["粉絲", "關注", "文章", "字數", "收獲喜歡"]  
infoValues = Array.new(infoKeys.length) 
 if /<ul class=\"clearfix\">([\s\S]*?)<\/ul>/ =~ latest_articles_html  then            authorInfoHtml= $1.force_encoding("UTF-8")   
 for i in 0 .. infoKeys.length-1     
 if /#{"<b>([0-9]*)</b><span>#{infoKeys[i]}</span>".force_encoding("UTF-8")}/=~ authorInfoHtml        
infoValues[i]= $1      
     end    
   end 
 end  
authorInfo.setBaseInfo(infoValues[0], infoValues[1], infoValues[2], infoValues[3], infoValues[4])
end

其他匹配代碼請參看源代碼

整合信息，多樣化地輸出成果物

當統計出用戶的文章信息后，就是把統計信息輸出來。為了讓輸出的產物更加豐富和自定義程度更高，所以我采取了渲染模板的方式，將數據和界面分離。
模板文件：

<body>
<section>
    <header>
        <h1>@{title}</h1>
        <section>統計時間：@{time}</section>
    </header>
    <section id="content">
    <table>
        <thead>
        <tr>
            <th>序號</th>
            <th>姓名</th>
            <th>文章數</th>
            <th>字數</th>
            <th>閱讀量</th>
            <th>收到評論</th>
            <th>收到喜歡</th>
            <th>小buddy姓名</th>
        </tr>
        </thead>
        <tbody>
        @{content}
        </tbody>
    </table>
    </section>
    <footer>@{footer}</footer>
</section>
</body>

然后在Ruby代碼中加載模板文件，并將@{title}、@{time}、 @{content}、 @{content}替換真實的統計信息

  def out2Html(title)
    tplFile = open @tpl
    tplContent = tplFile.read
    tplFile.close
    content =""
    for i in 0 .. @authorList.length-1
      author = @authorList[i]
      content+=format(" <tr>
            <td>%s</td>
            <td><a target= \"_blank\" href=\"http://jianshu.com/users/%s\">%s</a></td>
            <td>%s</td>
            <td>%s</td>
            <td>%s</td>
            <td>%s</td>
            <td>%s</td>
            <td>%s</td>
        </tr>", i, author.id, author.name, author.post_count, author.word_count, author.read_count, author.comment_count, author.liked_count, author.buddy)
    end

    today = Time.new;
    timeStr= today.strftime("(%Y-%m-%d %H:%M:%S)");
    footer="Powered By <a target=\"_blank\" href=\"http://wangbaiyuan.cn\">BrainWang@ThoughtWorks</a>"
    out = tplContent.gsub(/@\{title\}/, title)
    out = out.gsub(/@\{content\}/, content)
    out = out.gsub(/@\{footer\}/, footer)
    out = out.gsub(/@\{time\}/, timeStr)
    timeStr= today.strftime("(%Y-%m-%d)");
    file=open("output/#{title+timeStr}.html","w")
    file.write out
    print "\n輸出文件位于", Pathname.new(File.dirname(__FILE__)).realpath,"/",file.path
    file.close

  end

當然，那天只要加一個out2json就可輕松做一個API，實現更高的定制化效果啦

項目主頁

https://github.com/geekeren/jianshu_spider

使用方法

下載項目代碼并運行

cd jianshu_spider/
 ruby main.rb

更詳細的項目介紹請移步Github項目主頁

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

用Ruby簡書爬蟲統計用戶文章信息

用Ruby簡書爬蟲統計用戶文章信息

分析簡書網址

抓取網頁，獲取html

分析抓取內容的結構

正則匹配，摳出關鍵信息

整合信息，多樣化地輸出成果物

項目主頁

使用方法

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

用Ruby簡書爬蟲統計用戶文章信息

分析簡書網址

抓取網頁，獲取html

分析抓取內容的結構

正則匹配，摳出關鍵信息

整合信息，多樣化地輸出成果物

項目主頁

使用方法

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频