噼里啪啦免费观看高清全集,飘香影院午夜理论片a片,晚上偷偷看的www视频软件

起因

事情的起因是這樣的，我是全棧營的會員，可以在全棧營的網站進行學習。但是網站的運維時間到今年2月份就結束了，意味著2月份之后我將不能再在上面繼續學習了。

要知道那上面的知識含金量是特別高的，我萌生將全棧營網站的內容抓取下來，制作成PDF供以后學習的想法。

行動

考慮到這件事很有價值，我從本周一就開始琢磨這件事，由于周內白天都在上班，所以只能在晚上空閑時間到google兜兜轉轉、找找方法。還好，今天周六折騰了一天終于讓我找到了方法啦。嘻嘻！

先來看看我的成果吧！截圖如下：

Snip20180113_3.png

Snip20180114_4.png

Snip20180114_5.png

方法

在說具體步驟之前，我先說下，大致的思路：

第一步：先用爬蟲的方法，將網頁內容抓取下來，寫入本地文件；
第二步：利用在線網站將抓取的html文件轉化成pdf;

一、抓取網頁html、寫入文件

腳本初探
首先這里需要先裝兩個gem:
gem install rest-client 用于發送請求
gem install nokogiri 用于解析html

然后發送請求時，由于全棧營的網站是要求要登錄驗證身份的，簡單的處理方法是，發送請求時帶上cookie參數
有同學反應不知道哪找cookie參數，下面簡單介紹下：

瀏覽器右鍵進入檢查（inspect)

Snip20180115_11.png

Snip20180115_12.png

Snip20180115_14.png

下面我們新建一個腳本文件(任意一個.rb文件),試試看能不能成功抓取數據。

# test.rb 
require 'rest-client'
require 'nokogiri'

url       = 'https://fullstack.qzy.camp/posts/860'        # 隨意測試一個url
cookie    = '_quanzhan_session=你復制的cookie值放這里'       # 你在登錄全棧營時，瀏覽器中cookie值
response  = RestClient.get url, {Cookie: cookie}          # 必須傳cookie參數（如果需要登錄）
doc       = Nokogiri::HTML.parse(response.body)           # 解析
puts doc

終端執行：ruby test.rb
如果看到下面畫面，表示成功了。

Snip20180113_5.png

分析網頁源碼，確定抓取部分
讓我們先來看看全棧營網頁的源碼：

# 第一個片段，發現這里的大標題（Web API 設計實作）
<div class="left-block hidden-xs">
   <h1><a href="/courses/38/syllabus">Web API 設計實作</a></h1>
</div>

# 第二個片段，發現這里有小標題（所屬章節：7. Jbuilder 用法）
 <div class="des-text">
    <h4>所屬章節：7. Jbuilder 用法</h4>
    <p><p>本章預計學習時間: 1小時半以內</p></p>
    <p><p>再學習5節就可以完成本章了</p></p>
 </div>
 
# 第三個片段,主體內容
<div class="post group">
    <div class="post-content markdown">
      <p>新增 <code>app/views/api/v1/trains/show.json.jbuilder</code> 檔案，這就是 JBuilder 樣板，用來定義 JSON 長什么樣子：</p>
....略

            <p>用瀏覽器瀏覽 <code>http://localhost:3000/api/v1/trains/0822</code> 確認正常。</p>
    </div>
</div>

好了確定了要抓取的主要內容就可以進入下一步，完善腳本，并寫入文件

require 'rest-client'
require 'nokogiri'

url       = 'https://fullstack.qzy.camp/posts/860'        # 隨意測試一個url
cookie    = '_quanzhan_session=你復制的cookie值放這里'       # 你在登錄全棧營時，瀏覽器中cookie值
response  = RestClient.get url, {Cookie: cookie}          # 必須傳cookie參數（如果需要登錄）
doc       = Nokogiri::HTML.parse(response.body)           # 解析

+ # 分解html
+ them      = doc.css("h1")[0].to_s            # 大標題
+ chapt     = doc.css(".des-text h4").to_s     # 小標題
+ post      = doc.css(".post").to_s            # 主體內容
+ content   = them + chapt + post              # 組合

+ # 文件寫入 
+ file = File.new("page.erb", 'w')
+ file.write(content)
- puts doc

終端運行：ruby test.rb
如果你的本地文件page.erb中有html正常寫入，則表示正常。

Snip20180113_7.png

批量抓取多個頁面
現在我們能抓取單個頁面了，但是我想要的效果是一下子抓取多個頁面，怎么辦呢？
讓我們看看全棧營網址規律：

Snip20180113_9.png

Snip20180113_10.png

Snip20180113_8.png

稍加比較我們就可以知道，只需改變請求最后的數字就可以批量抓取了。
比如抓取web api這部分內容，代碼如下：

require 'rest-client'
require 'nokogiri'

basic_url   = 'https://fullstack.qzy.camp/posts/'           # 基礎url
cookie    = '_quanzhan_session=你復制的cookie值放這里'                                     

(825..865).each do |p| 
  url = basic_url + p.to_s                                  
  response  = RestClient.get url, {Cookie: cookie}          # 必須傳cookie參數（如果需要登錄）
  doc       = Nokogiri::HTML.parse(response.body)
  
  # 分解html
  them      = doc.css("h1")[0].to_s            # 大標題
  chapt     = doc.css(".des-text h4").to_s     # 小標題
  post      = doc.css(".post").to_s            # 主體內容
  content   = them + chapt + post              # 組合
  
  # 文件寫入 
  file = File.new("page.erb", 'w')
  file.write(content)
  puts "#{url}------已成功抓取"
end

執行后畫面如下

Snip20180113_11.png

完善腳本
做到這里你的確可以抓取數據了，但是還有三點需要完善：

1、頁面現在是沒有帶樣式的，需要美化（定義css）
2、前面通過輸入一個連續的post編號的方式，批量獲取數據的方式，并非萬能的（有的地方post 號是不連續的），完善為抓取頁面下頁的鏈接
3、導出的pdf讓目錄正常

require 'rest-client'
require 'nokogiri'

style = "<style>.frame {
    margin-left: 30px;
    margin-right: 30px;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: normal;
}

.view-count {
    float: right;
    margin-top: -54px;
    color: #9B9B9B;
}

.markdown h2, .markdown h3, .markdown h4 {
    text-align: left;
    font-weight: 800;
    font-size: 16px !important;
    line-height: 100%;
    margin: 0;
    color: #555;
    margin-top: 16px;
    margin-bottom: 16px;
    border-bottom: 1px solid #eee;
    padding-bottom: 5px;
}

  .markdown .figure-code figcaption {
    background-color: #e6e6e6;

    font: 100%/2.25 Monaco, Menlo, Consolas, 'Courier New', monospace;
    text-indent: 10.5px;
    
    -moz-border-radius: 0.25em 0.25em 0 0;
    -webkit-border-radius: 0.25em;
    border-radius: 0.25em 0.25em 0 0;
    -moz-box-shadow: inset 0 0 0 1px #d9d9d9;
    -webkit-box-shadow: inset 0 0 0 1px #d9d9d9;
    box-shadow: inset 0 0 0 1px #d9d9d9;
}

.markdown {
    position: relative;
    line-height: 1.8em;
    font-size: 14px;
    text-overflow: ellipsis;
    word-wrap: break-word;
    font-family: 'PT Serif', Georgia, Times, 'Times New Roman', serif !important;
}

.markdown ol li, .markdown ul li {
    line-height: 1.6em;
    padding: 2px 0;
    color: #333;
    font-size: 16px;
}

.markdown .figure-code {
    margin: 20px 0;
}

.post-content {
    padding-top: 5px;
    padding-bottom: 5px;
}

.markdown code {
    background-color: #ececec;
    color: #d14;
    font-size: 85%;
    text-shadow: 0 1px 0 rgba(255,255,255,0.9);
    border: 1px solid #d9d9d9;
    padding: 0.15em 0.3em;
}

div {
    display: block;
}

.markdown figure.code pre {
    background-color: #ffffcc !important;
}

.code .gi {
    color: #859900;
    line-height: 1.2em;
}

.code .err {
    color: #93A1A1;
}

.markdown a:link, .markdown a:visited {
    color: #0069D6 !important;
    text-decoration: none !important;
}

.markdown p {
    font-size: 16px;
    line-height: 1.5em;
}

.markdown blockquote {
    margin-left: 0 !important;
    margin-right: 0 !important;
    padding: 12px;
    border-left: 5px solid #50AF51;
    background-color: #F3F8F3;
    clear: both;
    display: block;
}

.markdown blockquote>*:first-child {
    margin-top: 0 !important;
}

.markdown blockquote>*:last-child {
    margin-bottom: 0 !important;
}

.markdown blockquote p {
    color: #222;
}

* {
    outline: none !important;
}

a:active, a:hover, a:link, a:visited {
    text-decoration: none;
}

pre {
    margin: 0;
}

.markdown img {
    vertical-align: top;
    max-width:100%;
    height:auto;
}

h1 a {
  color: #071A52;
}

h4 {
  color: #734488;
}

hr {
  border-color: #DEDEDE;
  border-width: 0.8px;
  margin-bottom: auto;
}

.end {
  height: 400px;
}
.end img {
  clear: both; 
  display: block; 
  margin:auto;
  margin-top: -70px; 
}

.end p {
  margin-left: 300px;
  margin-top: -100px;
  color: #FF9D76;
}
</style>"

print "-----------請輸入一個開始頁面的post編號："
# 獲取開始頁面的編碼 和文件名
start_page = gets.chop
print "--------------------請輸入保存的文件名："
file_name  = gets.chop 

# 結束畫面
page_end = "<div class='end'>
              <img src='https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1515845318684&di=399b355dd05f4eeb015b087061656115&imgtype=0&src=http%3A%2F%2Fimgsrc.baidu.com%2Fforum%2Fw%253D580%2Fsign%3Dc775b978013b5bb5bed720f606d2d523%2F248ea813632762d018421c6ca2ec08fa503dc64c.jpg'>
              <p>又學完一篇好開森！</p>
            </div>"

# 寫入樣式
file = File.new("#{file_name}.html", 'w')
file.write(style)

# 基礎鏈接
basic_url = 'https://fullstack.qzy.camp'
url       = basic_url + '/posts/' + start_page
cookie    = '_quanzhan_session=你復制的cookie值放這里'

puts "---------------------------已開始抓取數據：請耐心等候"
while (url != 'end')  
  # 請求數據
  response = RestClient.get url, {Cookie: cookie}
  doc      = Nokogiri::HTML.parse(response.body)
  
  # 當post存在時，解析
  if !doc.css(".post").to_s.empty?
    title              = doc.css(".post-title-h1.markdown h1").to_s
    chapt              = doc.css(".des-text h4").to_s + '<hr>'
    post               = doc.css(".post").to_s + page_end
    content            = title + chapt + post
    page               = "<div class='frame'>#{content}</div>"
    # 寫入本page數據
    file.write(page)
    puts "#{url}----------中數據已成功抓取"

    # 計算下一個請求url
    next_relative_path = doc.css("li.next a")[0]['href'].to_s
    # 如果解析出來是 /dashboard 則代表本課結束
    url = next_relative_path == '/dashboard' ? 'end' : (basic_url + next_relative_path) 
  end
end
puts "---------------------------本課數據已全部抓取 ??"

二、將抓取html轉化成pdf

這里利用在線轉換工具
把抓取的html文件上傳到在線工具，轉換成pdf后下載即可。

PS: 最后如果覺得簡書顯示代碼的方式不太友好，歡迎訪問我的博客：
http://dmy-blog.logdown.com/posts/4739981-how-does-ruby-crawl-web-content-and-make-it-into-pdf

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

ruby如何抓取html并轉換成pdf?

ruby如何抓取html并轉換成pdf?

起因

行動

方法

一、抓取網頁html、寫入文件

二、將抓取html轉化成pdf

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

ruby如何抓取html并轉換成pdf?

起因

行動

方法

一、 抓取網頁html、寫入文件

二、 將抓取html轉化成pdf

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

一、抓取網頁html、寫入文件

二、將抓取html轉化成pdf