在Lucene6.1.0運(yùn)用word1.2進(jìn)行分詞

為什么用word1.2?

最新的word分詞是1.3版本,但是用1.3的時(shí)候會(huì)出現(xiàn)一些Bug,產(chǎn)生Java.lang.OutOfMemory錯(cuò)誤,所以還是用比較穩(wěn)定的1.2版本。

在Lucene 6.1.0中,實(shí)現(xiàn)一個(gè)Analyzer的子類,也就是構(gòu)建自己的Analyzer的時(shí)候,需要實(shí)現(xiàn)的方法是createComponet(String fieldName),而在Word 1.2中,沒有實(shí)現(xiàn)這個(gè)方法(word 1.2對(duì)lucene 4.+的版本支持較好),運(yùn)用ChineseWordAnalyzer運(yùn)行的時(shí)候會(huì)提示:

Exception in thread "main" java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:140)

所以要對(duì)ChineseWordAnalyzer做一些修改.

實(shí)現(xiàn)createComponet(String fieldName)方法

新建一個(gè)Analyzer的子類MyWordAnalyzer,根據(jù)ChinesWordAnalyzer改寫:

public class MyWordAnalyzer extends Analyzer {
    Segmentation segmentation = null;

    public MyWordAnalyzer() {
        segmentation = SegmentationFactory.getSegmentation(
           SegmentationAlgorithm.BidirectionalMaximumMatching);
    }
    public MyWordAnalyzer(Segmentation segmentation) {
        this.segmentation = segmentation;
    }
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new MyWordTokenizer(segmentation);
        return new TokenStreamComponents(tokenizer);
    }
}

其中segmentation屬性可以設(shè)置分詞所用的算法,默認(rèn)的是雙向最大匹配算法。接著要實(shí)現(xiàn)的是MyWordTokenizer,也是模仿ChineseWordTokenizer來(lái)寫:

public class MyWordTokenizer extends Tokenizer{     
        private final CharTermAttribute charTermAttribute 
                = addAttribute(CharTermAttribute.class);
        private final OffsetAttribute offsetAttribute 
                = addAttribute(OffsetAttribute.class);
        private final PositionIncrementAttribute 
          positionIncrementAttribute 
                 = addAttribute(PositionIncrementAttribute.class);
        
        private Segmentation segmentation = null;
        private BufferedReader reader = null;
        private final Queue<Word> words = new LinkedTransferQueue<>();
        private int startOffset=0;
            
        public MyWordTokenizer() {
            segmentation = SegmentationFactory.getSegmentation(
                   SegmentationAlgorithm.BidirectionalMaximumMatching);
        }   
        public MyWordTokenizer(Segmentation segmentation) {
            this.segmentation = segmentation;
        }
        private Word getWord() throws IOException {
            Word word = words.poll();
            if(word == null){
                String line;
                while( (line = reader.readLine()) != null ){
                    words.addAll(segmentation.seg(line));
                }
                startOffset = 0;
                word = words.poll();
            }
            return word;
        }
        @Override
        public final boolean incrementToken() throws IOException {
            reader=new BufferedReader(input);
            Word word = getWord();
            if (word != null) {
                int positionIncrement = 1;
                //忽略停用詞
                while(StopWord.is(word.getText())){
                    positionIncrement++;
                    startOffset += word.getText().length();
                    word = getWord();
                    if(word == null){
                        return false;
                    }
                }
                charTermAttribute.setEmpty().append(word.getText());

                 offsetAttribute.setOffset(startOffset, startOffset
                      +word.getText().length());
                positionIncrementAttribute.setPositionIncrement(
                      positionIncrement);
                startOffset += word.getText().length();
                return true;
            }
            return false;
        }
}

incrementToken()是必需要實(shí)現(xiàn)的方法,返回true的時(shí)候表示后面還有token,返回false表示解析結(jié)束。在incrementToken()的第一行,將input的值賦給reader,input是Tokenizer為Reader的對(duì)象,在Tokenizer中還有另一個(gè)Reader對(duì)象——inputPending,在Tokenizer中源碼如下:

public abstract class Tokenizer extends TokenStream {  
  /** The text source for this Tokenizer. */
  protected Reader input = ILLEGAL_STATE_READER;
  
  /** Pending reader: not actually assigned to input until reset() */
  private Reader inputPending = ILLEGAL_STATE_READER;

input中存儲(chǔ)的是需要解析的文本,但是文本是先放到inputPending中,直到調(diào)用了reset方法之后才將值賦給input。
  reset()方法定義如下:

 @Override
  public void reset() throws IOException {
    super.reset();
    input = inputPending;
    inputPending = ILLEGAL_STATE_READER;
  }

在調(diào)用reset()方法之前,input里面的是沒有需要解析的文本信息的,所以要在reset()之后再將input的值賦給reader(一個(gè)BufferedReader 的對(duì)象)。
  
  做了上面的修改之后,就可以運(yùn)用Word 1.2里面提供的算法進(jìn)行分詞了:

測(cè)試類MyWordAnalyzerTest

public class MyWordAnalyzerTest {

    public static void main(String[] args) throws IOException {
        String text = "乒乓球拍賣完了";
        Analyzer analyzer = new MyWordAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("text", text);
        // 準(zhǔn)備消費(fèi)
        tokenStream.reset();
        // 開始消費(fèi)
        while (tokenStream.incrementToken()) {
            // 詞
            CharTermAttribute charTermAttribute 
               = tokenStream.getAttribute(CharTermAttribute.class);
            // 詞在文本中的起始位置
            OffsetAttribute offsetAttribute 
               = tokenStream.getAttribute(OffsetAttribute.class);
            // 第幾個(gè)詞
            PositionIncrementAttribute positionIncrementAttribute 
                = tokenStream
                    .getAttribute(PositionIncrementAttribute.class);

            System.out.println(charTermAttribute.toString() + " " 
                  + "(" + offsetAttribute.startOffset() + " - "
                  + offsetAttribute.endOffset() + ") " 
                  + positionIncrementAttribute.getPositionIncrement());
        }
        // 消費(fèi)完畢
        tokenStream.close();
    }
}

結(jié)果如下:

運(yùn)用word1.2的分詞結(jié)果

因?yàn)樵趇ncreamToken()中,將停止詞去掉了,所以分詞結(jié)果中沒有出現(xiàn)“了”。從上面的結(jié)果也可以看到,Word分詞可以將句子分解為“乒乓球拍”和“賣完”,對(duì)比用SmartChineseAnalyzer():

運(yùn)用SmartCineseAnalyzer的分詞結(jié)果

綜上Word的分詞效果還是不錯(cuò)的。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

推薦閱讀更多精彩內(nèi)容