爬蟲初探-Scrapy

Scrapy 資料

官方文檔永遠是首選，建議把 tutorial 完整的過一遍。
網址：https://doc.scrapy.org/en/latest/intro/tutorial.html

爬取步驟

我們準備爬取宅男女神排行榜的所有女神相冊，首先看看入口是怎么樣的。

20171024-goddessrank
可以看到這里有5頁，每頁有20個女神，所以我們的爬蟲邏輯應該是：
- 遍歷這20個女神，進入到各自的主頁，獲取它們主頁的個人信息。
  - 從女神主頁進入她的相冊寫真集頁面（如果某女神寫真集較少，則直接在主頁進入相冊），把各相冊中圖片下載下來，這里要注意某個寫真相冊有很多頁，每頁有好幾張圖片，在這里同樣需要遍歷每一頁。
- 遍歷這5頁，重復上述動作。
進入第一個女神：夏美醬的主頁，可以看到有她的一些個人信息，以及寫真集。

20171024-xiameijiang
好的，大致信息已經知道了，我們從簡單的個人信息爬取開始。

爬取個人信息

首先從簡單的做起，爬取排行榜所有女神的個人信息，如姓名、生日、年齡、三圍、出生，在女神的主頁，通過谷歌瀏覽器的開發者工具，可以看到這樣的代碼：

20171024-xiameijianginfo

于是 spider 中爬取女神個人信息的代碼是這樣的：

  import scrapy
  import re

  class GoddessSpider(scrapy.Spider):
      name = "goddess"
      start_urls = ['https://www.nvshens.com/rank/sum/']

      def parse(self, response):
          # follow links to goddess pages
          for href in response.css('div.rankli_imgdiv a::attr(href)'):
              yield response.follow(href, self.parse_goddess)

          # follow pagination links
          # ...

      def parse_goddess(self, response):
          def util(self, l):
              if l is not None and len(l) != 0:
                  return l[0]
              else:
                  return None
          dic = dict(zip(response.css('div.infodiv td::text').extract()[0::2], response.css('div.infodiv td::text').extract()[1::2]))
          dic['姓名'] = response.css('div.div_h1 h1::text').extract()[0]
          yield dic

解釋：用 'div.infodiv td::text' 找到的既包含了“年齡”又包含了“20(屬牛)”，而且是按順序存儲的，所以調用 extract() 方法把 Selector 對象變成列表后，把這個列表的奇數項作為 key （如：年齡、生日、星座...），偶數項作為 value （如：20(屬牛)、1997-09-22、處女座）。然后再用 zip 函數，就可以做到兩個列表轉為字典。很實用的功能。
姓名在女神主頁的 h1 中可以找到，最后加進字典中即可。

輸出類似下面這樣：

  2017-10-24 13:43:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nvshens.com/girl/24410/>
  {'年 齡：': '22 (屬豬)', '生 日：': '1995-10-01', '星 座：': '天秤座', '身 高：': '165', '三 圍：': 'B88 W60 H86', '出 生：': '中國 上海徐匯區', '職 業：': '平面模特、主播', '興 趣：': '旅游、時尚、文藝、美食', '姓名': '周于希dummy(Dummy Zhou)'}
  2017-10-24 13:43:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.nvshens.com/girl/19705/> (referer: https://www.nvshens.com/rank/sum/)
  2017-10-24 13:43:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.nvshens.com/girl/20440/>
  {'年 齡：': '22 (屬狗)', '生 日：': '1994-12-24', '星 座：': '魔羯座', '身 高：': '165', '三 圍：': 'B90(F75) W60 H88', '出 生：': '中國 浙江杭州', '職 業：': '鋼管舞老師、模特', '興 趣：': '舞蹈', '姓名': '于姬(Una)'}

剛才只爬取了第一頁的女神主頁，還有4頁需要爬取，查看入口的分頁器代碼，并沒有像 Scrapy 官方教程那么簡單，在官方教程中，“下一頁”的按鈕有明確的 class 或者 id 唯一標識，但在這里沒有，如下：

20171024-goddesspages
可以看到，“”的按鈕并沒有定義 class 或者 id，這就和其他的頁數按鈕混在一起了，那要怎么判斷下一頁呢？可以看到，當前頁（圖中也就是第1頁）是比較特別的，因為它被 class='cur' 唯一標識了，而跳轉第2頁看看，class='cur' 就變成第二頁的標識了，顯然，這里就是突破口。

進入 Scrapy 的 shell 窗口，調試一下：

  ## 讀取分頁條的所有按鈕（也就是 a 鏈接）
  >>> response.css('div.pagesYY a::attr(href)')
  [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*/a/@href" data='1.ht
  ml'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*/a/@href" data=
  '2.html'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*/a/@href"
  data='3.html'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*/a/@h
  ref" data='4.html'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*
  /a/@href" data='5.html'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-se
  lf::*/a/@href" data='2.html'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-
  or-self::*/a/@href" data='5.html'>]

  ## 讀取分頁條的當前頁數按鈕

  >>> response.css('div.pagesYY a.cur::attr(href)')
  [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' pagesYY ')]/descendant-or-self::*/a[@class and contai
  ns(concat(' ', normalize-space(@class), ' '), ' cur ')]/@href" data='1.html'>]

可以看到，當前頁數的 Selector 對象并不與上面相同（因為選擇器不同），即不能簡單用 in 關鍵字判斷（if 'a' in 'abc'），利用正則表達式來尋找 X.html，比較一下就行了，下面是代碼：

  def parse(self, response):
      # follow links to goddess pages
      for href in response.css('div.rankli_imgdiv a::attr(href)'):
          yield response.follow(href, self.parse_goddess)

      # follow pagination links
      next_page = None
      L = len(response.css('div.pagesYY a::attr(href)'))
      for i in range(L):
          tmp_page = re.findall(r"[1-5].html", str(response.css('div.pagesYY a::attr(href)')[i]))
          print("tmp_page=", tmp_page)
          cur_page = re.findall(r"[1-5].html", str(response.css('div.pagesYY a.cur::attr(href)')))
          print("cur_page=", cur_page)
          if tmp_page == cur_page and cur_page != ['5.html']:
              next_page = response.css('div.pagesYY a::attr(href)')[i+1] # Attention: next_page = cur_page + 1
              print("next_page=", next_page)
              break
      if next_page is not None:
          print("--------------------------------------------------------------------")
          yield response.follow(next_page, self.parse)

爬取圖片

好了，前面把整個框架搭好了，現在要進入女神主頁的相冊，有點小興奮呢:)
這里有個問題了，有些女神的相冊很多，主頁只能顯示最近6個相冊，在相冊 div 的右下角有個按鈕用來進入相冊集頁面（例如夏美醬的相冊集 url ：'https://www.nvshens.com/girl/21501/album/'）：
```
  <span class='archive_more'><a style='text-decoration: none' href='/girl/21501/album/' title='全部圖片' class='title'>共50冊</a></span>
```
然而有些女神只有少數相冊，甚至只有一個相冊，右下角也沒有上述按鈕，如果在地址欄手動輸入：XXXX/album/，那么會出現404錯誤，我們的爬蟲當然要“智能”判斷這兩種情況，實現全部爬取。

用簡單的 if 判斷一下即可，在其中一個分支中要再開一個函數處理：

  def parse_goddess(self, response):
      # get goddess info, like name, age, birthday ...
      def util(self, l):
          if l is not None and len(l) != 0:
              return l[0]
          else:
              return None
      dic = dict(zip(response.css('div.infodiv td::text').extract()[0::2], response.css('div.infodiv td::text').extract()[1::2]))
      dic['姓名'] = response.css('div.div_h1 h1::text').extract()[0]
      yield dic

      # get to the album page (before photo page) or photo page directly
      if response.css('span.archive_more a::attr(href)') is not None:
          for archive_more in response.css('span.archive_more a::attr(href)'):
              yield response.follow(archive_more, self.parse_goddess_album)
      else:
          for album_link in response.css('a.igalleryli_link::attr(href)'):
              yield response.follow(album_link, self.parse_goddess_photo)

  def parse_goddess_album(self, response):
      for album_link in response.css('a.igalleryli_link::attr(href)'):
          yield response.follow(album_link, self.parse_goddess_photo)

好的，現在就要開始編寫 parse_goddess_photo 函數了，我們隨便打開一個女神相冊進入，再調用檢查工具，看一看從哪里突破。

20171025-goddessphotohtml
圖片的 url 地址一目了然，可以用選擇器找到外部的 ul#hgallery 標簽，然后加個 for 循環即可，注意到爬取圖片下載到本地時有兩點要注意：
- 路徑：在工程目錄下創建一個文件夾，名字就是當前爬取的相冊，里面儲存該相冊的所有圖片，同時還可以爬取該相冊的介紹信息，保存到相冊文件夾的 txt 文件中。而且，要為每個本地圖片指定名字，在這里用了正則表達式，把 url 最后的 http://../..//XX.jpg 中的 XX.jpg 作為本地圖片的名字。
- urllib 當前版本下載圖片到本地要這樣操作：
```
import urllib.request
with open(path + "".join(re.findall(r"..jpg", img_src)), 'wb+') as f_img:
    conn = urllib.request.urlopen(img_src)
    f_img.write(conn.read())
```
這個頁面的“下一頁”按鈕是有 class 標識的，雖然“上一頁”和“下一頁”按鈕的 class 都是 a1，但是無論當前打開哪一頁，這兩個按鈕一直都存在，比如在第一頁按上一頁，還是第一頁的地址，在最后一頁按下一頁，還是最后一頁的地址，又因為 Scrapy 默認不會爬取重復的頁面，所以這里很好編寫代碼。

結合下載圖片的操作，新創建的 parse_goddess_photo 函數可以按葫蘆畫瓢寫出：

  def parse_goddess_photo(self, response):
      # NOW U ARE IN PHOTO PAGE!
      # download photo
      album_title = response.css('h1#htilte::text').extract_first()
      album_desc = response.css('div#ddesc::text').extract_first()
      album_info = response.css('div#dinfo span::text').extract_first() + response.css('div#dinfo::text').extract()[1]
      path = 'goddess_photo/' + album_title + '/'
      if not os.path.exists(path):
          os.makedirs(path)
      with open(path + 'album_info.txt', 'a+') as f:
          f.write(album_desc)
          f.write(album_info)
      for img_src in response.css('ul#hgallery img::attr(src)').extract():
          with open(path + "".join(re.findall(r"[0-9]{1,4}.jpg", img_src)), 'wb+') as f_img:
              f_img.write(urllib.request.urlopen(img_src).read())
              print("DOWALOADING img_src:" + img_src)

      # follow pagination links
      next_page = response.css('a.a1::attr(href)')[1]
      if next_page is not None:
          print("".join(re.findall(r"..html", str(next_page))) + '--> next_page:' + album_title)
          yield response.follow(next_page, self.parse_goddess_photo)

最終效果

第一次寫爬蟲，也沒考慮到效率問題，大概花了5個小時才爬取完，在下載圖片時，不同的圖片就放在不同的文件夾里，這樣很好管理。

20171025-xiaoguo
所有圖片加起來總大小超過7G。

20171025-goddessspace
其實這篇筆記是我邊擼代碼邊寫的，寫在這里的時候爬蟲還在運行，因為之前我沒注意要把 setting.py 中的 ROBOTSTXT_OBEY 的值改成 FALSE，所以在爬取到快結束時發現卡住了，后來重新運行，查了下原因才改過來的。
重新開始爬取，我是選擇從排行榜第四頁進入的，運行了挺久的了，還沒爬到新的女神，一直在爬之前爬過的，所以下載的圖片自然沒有增長，因為 Scrapy 默認開啟10個線程，所以那些沒有爬過的女神并不是按順序的。
總之，這篇筆記主要是記錄了第一次爬蟲的經歷，挺好玩的，也有挺多需要注意的地方，下次想想怎么改進爬蟲速度，再學習一下應對網站反爬蟲的方法。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

爬蟲初探-Scrapy

爬蟲初探-Scrapy

爬蟲初探-Scrapy

Scrapy 資料

爬取步驟

爬取個人信息

爬取圖片

最終效果

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

爬蟲初探-Scrapy

爬蟲初探-Scrapy

Scrapy 資料

爬取步驟

爬取個人信息

爬取圖片

最終效果

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频