本文主要介紹Python中NLTK文本分析的內容,咱先來看看文本分析的整個流程:
原始文本 - 分詞 - 詞性標注 - 詞形歸一化 - 去除停用詞 - 去除特殊字符 - 單詞大小寫轉換 - 文本分析
一、分詞
使用DBSCAN聚類算法的英文介紹文本為例:
from nltk import word_tokenize
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
token_words = word_tokenize(sentence)
print(token_words)
輸出分詞結果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']
二、詞性標注
為什么要進行詞性標注?咱先來看看不做詞性標注,直接按照第一步分詞結果進行詞形歸一化的情形:
常見詞形歸一化有兩種方式(詞干提取與詞形歸并):
1、詞干提取
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
words_stemmer = [lancaster_stemmer.stem(token_word) for token_word in token_words]
print(words_stemmer)
輸出結果:
['dbscan', '-', 'density-based', 'spat', 'clust', 'of', 'apply', 'with', 'nois', '.', 'find', 'cor', 'sampl', 'of', 'high', 'dens', 'and', 'expand', 'clust', 'from', 'them', '.', 'good', 'for', 'dat', 'which', 'contain', 'clust', 'of', 'simil', 'dens']
說明:詞干提取默認提取單詞詞根,容易得出一些不具實際意義的單詞,比如上面的”Spatial“變為”spat“,”Noise“變為”nois“,在常規文本分析中沒意義,在信息檢索中用該方法會比較合適。
2、詞形歸并(單詞變體還原)
from nltk.stem import WordNetLemmatizer
wordnet_lematizer = WordNetLemmatizer()
words_lematizer = [wordnet_lematizer.lemmatize(token_word) for token_word in token_words]
print(words_lematizer)
輸出結果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expands', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'cluster', 'of', 'similar', 'density']
說明:這種方法主要在于將過去時、將來時、第三人稱等單詞還原為原始詞,不會產生詞根這些無意義的單詞,但是仍存在有些詞無法還原的情況,比如“Finds”、“expands”、”contains“仍是第三人稱的形式,原因在于wordnet_lematizer.lemmatize函數默認將其當做一個名詞,以為這就是單詞原型,如果我們在使用該函數時指明動詞詞性,就可以將其變為”contain“了。所以要先進行詞性標注獲取單詞詞性(詳情如下)。
3、詞性標注
先分詞,再詞性標注:
from nltk import word_tokenize,pos_tag
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
token_word = word_tokenize(sentence) #分詞
token_words = pos_tag(token_word) #詞性標注
print(token_words)
輸出結果:
[('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]
說明:列表中每個元組第二個元素顯示為該詞的詞性,具體每個詞性注釋可運行代碼”nltk.help.upenn_tagset()“或參看說明文檔:詞性標簽說明
三、詞形歸一化(指明詞性)
from nltk.stem import WordNetLemmatizer
words_lematizer = []
wordnet_lematizer = WordNetLemmatizer()
for word, tag in token_words:
if tag.startswith('NN'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='n') # n代表名詞
elif tag.startswith('VB'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='v') # v代表動詞
elif tag.startswith('JJ'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='a') # a代表形容詞
elif tag.startswith('R'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='r') # r代表代詞
else:
word_lematizer = wordnet_lematizer.lemmatize(word)
words_lematizer.append(word_lematizer)
print(words_lematizer)
輸出結果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
說明:可以看到單詞變體已經還原成單詞原型,如“Finds”、“expands”、”contains“均已變為各自的原型。
四、去除停用詞
經過分詞與詞形歸一化之后,得到各個詞性單詞的原型,但仍存在一些無實際意義的介詞、量詞等在文本分析中不重要的詞(這類詞在文本分析中稱作停用詞),需要將其去除。
from nltk.corpus import stopwords
cleaned_words = [word for word in words_lematizer if word not in stopwords.words('english')]
print('原始詞:', words_lematizer)
print('去除停用詞后:', cleaned_words)
輸出結果:
原始詞: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
去除停用詞后: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', '.', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', '.', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
說明:of、for、and這類停用詞已被去除。
五、去除特殊字符
標點符號在文本分析中也是不需要的,也將其剔除,這里我們采用循環列表判斷的方式來剔除,可自定義要去除的標點符號、要剔除的特殊單詞也可以放在這將其剔除,比如咱將"DBSCAN"也連同標點符號剔除。
characters = [',', '.','DBSCAN', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-','...','^','{','}']
words_list = [word for word in cleaned_words if word not in characters]
print(words_list)
輸出結果:
['Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
說明:處理后的單詞列表已不存在“-”、“.”等特殊字符。
六、大小寫轉換
為防止同一個單詞同時存在大小寫而算作兩個單詞的情況,還需要統一單詞大小寫(此處統一為小寫)。
words_lists = [x.lower() for x in words_list ]
print(words_lists)
輸出結果:
['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']
七、文本分析
經以上六步的文本預處理后,已經得到干凈的單詞列表做文本分析或文本挖掘(可轉換為DataFrame之后再做分析)。
統計詞頻(這里我們以統計詞頻為例):
from nltk import FreqDist
freq = FreqDist(words_lists)
for key,val in freq.items():
print (str(key) + ':' + str(val))
輸出結果:
density-based:1
spatial:1
clustering:1
applications:1
noise:1
finds:1
core:1
sample:1
high:1
density:2
expand:1
cluster:2
good:1
data:1
contain:1
similar:1
可視化(折線圖):
freq.plot(20,cumulative=False)
可視化(詞云):
繪制詞云需要將單詞列表轉換為字符串
words = ' '.join(words_lists)
words
輸出結果:
'density-based spatial clustering applications noise finds core sample high density expand cluster good data contain cluster similar density'
繪制詞云
from wordcloud import WordCloud
from imageio import imread
import matplotlib.pyplot as plt
pic = imread('./picture/china.jpg')
wc = WordCloud(mask = pic,background_color = 'white',width=800, height=600)
wwc = wc.generate(words)
plt.figure(figsize=(10,10))
plt.imshow(wwc)
plt.axis("off")
plt.show()
文本分析結論:根據折線圖或詞云,咱可以直觀看到“density”與“cluster”兩個單詞出現最多,詞云中字體越大。