1. Goose Extractor
1.1 Python Goose介紹
Goose Extractor是一個Python的開源文章提取庫。可以用它提取文章的文本內容、圖片、視頻、元信息和標簽。Goose本來是由Gravity.com編寫的Java庫,最近轉向了scala。
Goose Extractor網站是這么介紹的:
Goose Extractor完全用Python重寫了。目標是給定任意資訊文章或者任意文章類的網頁,不僅提取出文章的主體,同時提取出所有元信息以及圖片等信息。
Goose Extractor基于NLTK和Beautiful Soup,分別是文本處理和HTML解析的領導者。用Python進行文章提取可以使用Python Goose。
Goose目前只支持Python2
1.2 安裝Python Goose
pip install goose-extractor
直接使用Url鏈接抽取示例:
from goose import Goose
url = 'https://www.fireeye.com/blog/executive-perspective/2017/08/fireeye-provides-update-on-allegations-of-breach.html'
g = Goose()
article = g.extract(url=url)
print article.title
print article.meta_description
print article.cleaned_text[:150]
print article.top_image.src
使用Html文檔抽取示例:
# -*- coding: utf-8 -*-
import goose,urllib2,sys
reload(sys)
sys.setdefaultencoding("utf-8")
#url = "https://www.fireeye.com/blog/executive-perspective/2017/08/anti-encryption-and-cyber-sovereignty.html"
url = "https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
article = g.extract(raw_html=raw_html)
print article.title.encode('gbk', 'ignore')
print article.meta_description.encode('gbk', 'ignore')
print article.cleaned_text.encode('gbk', 'ignore')
1.3 urllib2獲取的HTML網頁亂碼問題
網頁可能是壓縮了,看里面是不是有 Content-Encoding:xxx
如果是壓縮了,需要手動解壓,urllib是不會幫你解壓的
解決代碼:
#-*- encoding: utf-8 -*-
import urllib2,gzip,StringIO
url = r'https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/'
response = urllib2.urlopen(url)
stream = StringIO.StringIO(response.read())
with gzip.GzipFile(fileobj=stream) as f:
data = f.read()
print(data)
附一篇文章談Python編碼:也談Python的中文編碼處理
2. Boilerpipe
Github開源代碼:Boilerpipe
在開源系統里Boilerpipe的precision和recall都好過Goose,甚至比收費的Alchemy API還要好。Boilerpipe是Java的,在Python里調用需要用python-boilerpipe這個包裝,它底層用的是jpype。也可以用JCC來調。代碼如下:
安裝:
git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
pip install -r requirements.txt
python setup.py install
使用:
from boilerpipe.extract import Extractor
url = "https://krebsonsecurity.com/2017/09/equifax-hackers-stole-200k-credit-card-accounts-in-one-fell-swoop/"
extractor = Extractor(extractor='ArticleExtractor', url=url)
print extractor.getText().encode('gbk', 'ignore')
或傳入一個HTML文本作為參數:
extractor = Extractor(extractor='ArticleExtractor', html=myWebPage)
用getText() or getHTML() 拿回處理過的純文本或加亮了正文的HTML
processed_plaintext = extractor.getText()
highlighted_html = extractor.getHTML()
也可以用JCC把Java的包編譯成Python可以調用的包
wget http://boilerpipe.googlecode.com/files/boilerpipe-1.2.0-bin.tar.gz
tar xvzf boilerpipe-*.tar.gz
cd boilerpipe-1.2.0
sudo python -m jcc \ --jar boilerpipe-1.2.0.jar \ --classpath lib/nekohtml-1.9.13.jar \ --classpath lib/xerces-2.9.1.jar \ --package java.net \ java.net.URL \ --python boilerpipe --build --install
使用:
import boilerpipe
jars = ':'.join(('lib/nekohtml-1.9.13.jar', 'lib/xerces-2.9.1.jar'))
boilerpipe.initVM(boilerpipe.CLASSPATH+':'+jars)
extractor = boilerpipe.ArticleExtractor.getInstance()
url = boilerpipe.URL('http://readthedocs.org/docs/jcc')
extractor.getText(url)
3. 各種Python正文抽取工具比較
各種Python正文抽取工具比較