Analyzer(分析器)
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters
分詞器就是將句子分成單個的詞,過濾器就是對分詞的結(jié)果進行篩選,例如中文中將“的”“呀”這些對句子主體意思影響不大的詞刪除,英語中類似的就是"is","a"等等。
分析器包括兩個部分:tokenizer(分詞器)和filter(分詞過濾器,它們將按照所列的順序發(fā)生作用)。for example:
<fieldType name="text_ik_analysis" class="solr.TextField" sortMissingLast="true" omitNorms="true" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LengthFilterFactory" min="2" max="20" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LengthFilterFactory" min="2" max="20" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
Tokenizer(分詞器)
常見的分詞器有:
- KeywordTokenizerFactory:不管什么內(nèi)容,整句當成一個關(guān)鍵字
- LetterTokenizerFactory:根據(jù)字母來分詞,拋棄非字母的部分,例如:"I can't" ==> "I", "can", "t"
- WhitespaceTokenizerFactory:根據(jù)空格來分詞,例如:"I do" ==> "I", "do"
- IKTokenizerFactory:IK分詞器
Filter(過濾器)
常見的過濾器:
- LowerCaseFilterFactory:將大寫字母轉(zhuǎn)換成小寫,不處理非字母部分
- SynonymFilterFactory:同義詞
- LengthFilterFactory: 限定字符長度
- RemoveDuplicatesTokenFilterFactory:移除重復(fù)文本