Chapter 3: Starting to Crawl
- urlparse module
1.Key:
Think First.
What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
我需要收集哪些數據?這些數據可以通過采集幾個已經確定的網站(永遠是最簡單的做法)完成嗎?或者我需要通過爬蟲發現那些我可能不知道的網站從而獲取我想要的信息嗎?
When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
當我的爬蟲到了某一個網站,它是立即順著下一個出站鏈接跳轉到下一個新網站,還是在網站上呆一會,深入采集網站的內容?
Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
有沒有我不想采集的一些網站?我對非英文網站的內容感興趣么?
How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across?
如果我的爬蟲引起了某個網站網管的懷疑,我該如何避免法律責任?
urlparse module
urlparse 模塊主要是把 url 拆分為六個部分,并返回元組 tuple。并且可以把拆分后的部分再組成一個 url。主要函數有 urljoin、urlsplit、urlunsplit、urlparse 等。
urlparse function
>>> from urlparse import urlparse
>>> o =
urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'
其將 url 解析成六個部分(scheme, netloc, path, parameters, query, fragment)。
scrapy
Scrapy uses the Item objects to determine which pieces of information it should save from the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a CSV, JSON, or XML files, using the following commands:
Scrapy 用 Item 對象決定要從它瀏覽的頁面中提取哪些信息。Scrapy 支持用不同的輸出格式來保存這些信息,比如 CSV、JSON、XML 文件格式,對應命令如下:
$ scrapy crawl article -o articles.csv -t csv
$ scrapy crawl article -o articles.json -t json
$ scrapy crawl article -o articles.xml -t xml
當然我們也可以自己定義 Item 對象,把結果寫入我們需要的一個文件或者數據庫中,只要在爬蟲的 parse 部分增加相應的代碼即可。
Scrapy 是處理網絡數據采集相關問題的利器。它可以自動收集所有 URL,然后和指定的規則進行比較;確保所有的 URL 是唯一的;根據需求對相關的 URL 進行標準化;以及到更深層的頁面中遞歸查詢。
2.Need to know:
在[用Scrapy采集]的模塊中:
我們需要下載 scrapy 這一個 package。「 這個 package 不支持 python3.x 和 python2.6,只能使用 python2.7 。」
我的嘗試:
$ sudo pip install scrapy
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
Perhaps try: xcode-select --install
意思是缺少 libxml2 ,通過命令行輸入:
$ xcode-select --install
接著會彈出 Xcode command line tools 下載,里面包含了 libxml2。安裝完成之后,再次嘗試 sudo pip install scrapy,報錯,內容為:
>>> from six.moves import xmlrpc_client as xmlrpclib
ImportError: cannot import name xmlrpc_client
在 stackoverflow 上尋找原因:
- six.moves is a virtual namespace. It provides access to packages that were renamed between Python 2 and 3. As such, you shouldn't be installing anything.
- By importing from six.moves.xmlrpc_client the developer doesn't have to handle the case where it is located at xmlrpclib in Python 2, and at xmlrpc.client in Python 3. Note that these are part of the standard library.
- The mapping was added to six version 1.5.0; make sure you have that version or newer.
- Mac comes with six version 1.4.1 pre-installed in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python and this will interfere with any version you install in site-packages (which is listed last in the sys.path).
The best work-around is to use a virtualenv and install your own version of six into that, together with whatever else you need for this project. Create a new virtualenv for new projects.- If you absolutely have to install this at the system level, then for this specific project you'll have to remove the /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python path:
>>> import sys
>>> sys.path.remove('/System/Library/Frameworks/Python.framework
> /Versions/2.7/Extras/lib/python')
- This will remove various OS X-provided packages from your path for just that run of Python; Apple installs these for their own needs.
Mac 自帶的 six 版本過低,scrapy 需要 six 的版本在 1.5.0 以上,建議是采用 Python 虛擬環境,如果真的需要在 system level 上進行更改的話,需要重新安裝 six。
于是,我先嘗試了其中的一個解決辦法:
$ sudo rm -rf /Library/Python/2.7/site-packages/six*
$ sudo rm -rf
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six*
$ sudo pip install six
但很不幸的是,sudo rm -rf 嘗試刪除文件的時候失敗報錯,Operation not Permitted。
繼續查找原因:
- This is because OS X El Capitan ships with six 1.4.1 installed already and when it attempts to uninstall it (because scrapy depends on six >= 1.5) it doesn't have permission to do so because System Integrity Protection doesn't allow even root to modify those directories.
- Ideally, pip should just skip uninstalling those items since they aren't installed to site-packages they are installed to a special Apple directory. However, even if pip skips uninstalling those items and installs six into site-packages we'll hit another bug where Apple puts their pre-installed stuff earlier in the sys.path than site-packages. I've talked to Apple about this and I'm not sure if they're going to do anything about it or not.
我的 Mac OS X 系統版本為 10.11.4,Mac 自版本 10.11 之后,由于新的 SIP 機制,即使是 root 用戶也無法對 /System 中的內容進行修改刪除(在系統恢復中可以辦到)。
于是,我采用另外一種方法繼續嘗試:
$ sudo pip uninstall six
$ easy_install six
同樣得到的是 Operation not Permitted(此方法在10.11之前的版本應該都可以行得通)。
后來嘗試了通過 Python 虛擬環境進行解決,能力不夠失敗。
還嘗試了通過下載 Python 官網的 2.7.11,不使用 Mac 系統默認自帶的 2.7.10(有人提到使用自己安裝的 Python2.7 可以解決問題),折騰了半天,還是失敗告終,還差點弄的 pip 無法安裝 package。挽救辦法為:
$ brew link python
$ bre unlink python
到最后,本來想著要放棄的,Stackoverflow 上的另一個辦法讓事情有了轉機:
This is a known issue on Mac OSX for Scrapy. You can refer to this link.
Basically the issue is with the PYTHONPATH in your system. To solve the issue change the current PYTHONPATH to point to the newer or none Mac OSX version of Python. Before running Scrapy, try:
export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH
If that worked you can change the .bashrc file permanently:
$ echo "export PYTHONPATH=/Library/Python/2.7/site-packages:$PYTHONPATH" >> ~/.bashrc
If none of this works, take a look at the link above.
此時命令行輸入 python,之后輸入:
>>> import scrapy
沒有報錯,說明可以導入scrapy。
嘗試書上的命令:
$ scrapy startproject wikiSpider
得到信息:
New Scrapy project 'wikiSpider' created in:
/Users/randolph/PycharmProjects/Scraping/wikiSpider
You can start your first spider with:
cd wikiSpider
scrapy genspider example example.com
成功!scrapy is ready to go!
3.Correct errors in printing:
- 暫無
4.Still have Question:
- 暫無