簡介
Solr是一種開放源碼的、底層的核心技術是使用Lucene 來實現的搜索引擎。
OK,這里提到了search engine,《solr in action》中,詳細說明了search engine的適用場景,以及和 DB的區別,對我收獲很大,摘抄一段:
1. Search engine
Search engines like Solr are optimized to handle data exhibiting four main characteristics:
- Text-centric
- Read- dominant
- Document- oriented
- Flexible schema
1.1 Text-centric
A search engine supports non text data such as dates and numbers, but its primary strength is handling text data based on natural language
.
If users aren’t interested in the information in the text, a search engine may not be the best solution for your problem.
Think about whether your data is text-centric. The main consideration is whether or not the text fields in your data contain information that users will want to query.
Solr等搜索引擎為搜索包含自然語言的文本內容做了優化,比如電子郵件,網頁,簡歷,PDF文檔,或是推特、微博、博客這些社交內容等等,都適合用Solr來處理。
1.2 Read- dominant
Think of read-dominant as meaning that documents are read far more often than they’re created or updated.
If you must update existing data in a search engine often, that could be an indication that a search engine might not be the best solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice when you need fast random writes to existing data.
1.3 Document-oriented
In a search engine, a document is a self-contained collection of fields, in which each field only holds data(can have multiple values) and doesn’t contain subfields.
A search engine isn’t the place to store data unless it’s useful for search or displaying results
In general, you should store the minimal set of information for each document needed to satisfy search requirements.
1.4 Flexible schema
In a relational database, every row in a table has the same structure. In Solr, documents can have different fields.
2. Solr vs Lucene
兩者的區別有:
- Lucene本質上是搜索庫,不是獨立的應用程序,而Solr是
- Lucene專注于搜索底層的建設,而Solr專注于企業應用
- Lucene不負責支撐搜索服務所必須的管理,而Solr負責
所以說,一句話概括: Solr是Lucene面向企業搜索應用的擴展
。
Solr 提供了層面搜索、命中醒目顯示并且支持多種輸出格式(包括XML/XSLT 和JSON等格式),它附帶了一個基于HTTP 的管理界面。Solr的特性包括:
- 高級的全文搜索功能
- 一個真正的擁有動態字段(Dynamic Field)和唯一鍵(Unique Key)的數據模式(Data Schema)
- 專為高通量的網絡流量進行的優化
- 基于開放接口(XML和HTTP)的標準
- 綜合的HTML管理界面
- 可伸縮性-能夠有效地復制到另外一個Solr搜索服務器
- 使用XML配置達到靈活性和適配性
- 可擴展的插件體系
- 支持對結果進行動態的分組和過濾
- 高度可配置和可擴展的緩存機制
因為 Solr 包裝并擴展了Lucene,所以它們使用很多相同的術語。
2. solr 配置
2.1 solrconfig.xml
定義solr的處理程序(handler)和一些擴展程序。其中的配置很多,其實很多都可以保持默認。
- dataDir:索引存放位置
- autoCommit:solr在建索引的時候收到請求并沒用立即寫入文件,而是先放到緩存中,等收到commit命令時才將緩存中得數據寫入索引文件。
- maxDocs:
Maximum number of documents to add since the last commit before automatically triggering a new commit.
- maxTime:
Maximum amount of time in ms that is allowed to pass since a document was added before automatically triggering a new commit.
- openSearcher:
if false, the commit causes recent index changes to be flushed to stable storage, but does not cause a new searcher to be opened to make those changes visible.
- autoSoftCommit:
softAutoCommit is like autoCommit except it causes a 'soft' commit which only ensures that changes are visible but does not ensure that data is synced to disk. This is faster and more near-realtime friendly than a hard commit.
2.2 manage-schema
用于定義索引的字段和字段類型
2.2.1 fieldType:字段類型(int、float、string、ik...)
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ik" class="solr.TextField" sortMissingLast="true" omitNorms="true" autoGeneratePhraseQueries="false">
<analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
2.2.2 field:字段,定義需要的字段名和它的類型
- name 字段名
- type 字段類型
- indexed 是否進行索引
- stored 是否進行保存,如不保存,可以進行搜索,但不能顯示該字段的內容
- required 是否是必須字段,如若是,該字段必須有值,否則索引報錯
- multiValued 是否允許多值
- docValues
- sortMissingLast
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
2.2.3 dynamicFields
動態字段表示,如果字段的定義沒有在配置中找到,就在動態字段類型中進行查找
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/>
2.2.4 copyField
復制源字段到目標字段,maxchars 限制復制的最大長度
<copyField source="body" dest="teaser" maxChars="300"/>
2.2.5 uniqueKey
相當于數據庫中得主鍵,如建索引時遇到重復的,則會覆蓋掉以前的記錄
<uniqueKey>id</uniqueKey>
2.2.6 defaultSearchField
如果搜索參數中沒有指定具體的field,那么這是默認的域
<defaultSearchField>text</defaultSearchField>
2.2.7 solrQueryParser
配置搜索參數短語間的邏輯,可以是"AND | OR"。
<solrQueryParser defaultOperator="OR" />