之前沒做過搜索引擎相關的業務,最近口袋的文獻和指南搜索需要進行調整優化,遂入坑solr,出乎意料的是solr的相關資料非常少(更別提中文了),官網的介紹又非常的干,堆砌各種example,剛好發現了一本《solr in action》(以solr4為例講解),讀了幾章后,收獲頗豐,所以這次的雙周分享是摘錄《solr in action》中那些讓我感到醍醐灌頂的話。
Why do I need a search engine?
Search engines like Solr are optimized to handle data exhibiting four main characteristics:
- Text-centric(文本為中心)
- Read- dominant(以讀為主)
- Document- oriented(面向文檔)
- Flexible schema(靈活的schema)
Text-centric
We think text-centric is more appropriate for describing the type of data Solr handles.
Of course, a search engine also supports non text data such as dates and numbers, but its primary strength is handling text data based on natural language.
搜索引擎主要是用來處理大段文本的搜索。
Read- dominant
Think of read-dominant as meaning that documents are read far more often than they’re created or updated.
if you must update existing data in a search engine often, that could be an indication that a search engine might not be the best solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice when you need fast random writes to existing data.
搜索引擎以讀為主,如果需要頻繁的更新,那么solr不會是個好選擇。
Document-oriented
Ina search engine, a document is a self-contained collection of fields, in which each field only holds data and doesn’t contain nested fields.
In general, you should store the minimal set of information for each document needed to satisfy search requirements.
在搜索引擎的數據結構中,是面向文檔的,文檔中包含一組fields。
Flexible schema
In a relational database, every row in a table has the same structure. In Solr, documents can have different fields.
文檔是非結構化的,不同的文檔可以由完全不同的fields組成,前提是field在managed-schema中有定義。
Don’t use a search engine to ...
- First, search engines are designed to return a small set of documents per query, usually 10 to 100.
搜索引擎應該只用來返回少量的結果集。如果一次性請求所有大量的結果,索引查詢是會比較快,但是根據index重建大量的document會很慢。 - Another use case in which you shouldn’t use a search engine is deep analytic tasks that require access to a large subset of the index.
- Also, there’s no direct support in most search engines for document-level security, at least not in Solr.
solr不支持文檔級別的安全校驗。
What is Solr?
Information retrieval engine
Solr is built on Apache Lucene, a popular, Java-based, open source, information retrieval library.
In a nutshell, Solr uses Lucene to provide the core data structures for indexing documents and executing searches to find documents.
如你所見,solr其實是使用Lucene來實現建立index&執行search等核心操作的。
one key difference between a Lucene query and a database query is that in Lucene results are ranked by their relevance to a query, and database results can only be sorted by one or more of the table columns.
Lucene對搜索結果的排序有一套復雜的公式,被///因素所影響,而數據庫只能根據一列或多列column來簡單的排序。
Map Reduce is a programming model that distributes large-scaled data-processing operations across a cluster of commodity servers by formulating an algorithm into two phases: map and reduce.
MapReduce最早是Google提出的,被用來進行海量網頁的索引和搜索。同樣的,Solr提供了SolrCloud,可以運用MapReduce的思想來處理large-scaled數據的檢索,大大提高的性能及服務的高可用。
With Lucene, you need to write Java code to define fields and how to analyze those fields. Solr adds a simple, declarative way to define the structure of your index and how you want fields to be represented and analyzed: an XML-configuration document named schema.xml. Solr also provides copy and dynamic fields.
ok,既然Solr is built on Lucene,那么兩者有什么區別呢?Lucene其實是用戶不友好的,直接使用Lucene的話,你需要寫繁瑣的java代碼去定義field,而solr提供了簡單的xml文件來配置field,同時solr還提供了copy and dynamic fields。
所謂copy field,提供了一個聯合field,即一個name可以對應多個Field。