使用term filter來(lái)搜索數(shù)據(jù)
(1)插入一些測(cè)試帖子數(shù)據(jù)
POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
GET /forum/_mapping/article
{
"forum": {
"mappings": {
"article": {
"properties": {
"articleID": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"hidden": {
"type": "boolean"
},
"postDate": {
"type": "date"
},
"userID": {
"type": "long"
}
}
}
}
}
}
現(xiàn)在es 5.2版本,type=text,默認(rèn)會(huì)設(shè)置兩個(gè)field,一個(gè)是field本身,比如articleID,就是分詞的;還有一個(gè)的話,就是field.keyword,articleID.keyword,默認(rèn)不分詞,會(huì)最多保留256個(gè)字符
(2)根據(jù)用戶ID搜索帖子
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"userID" : 1
}
}
}
}
}
constant_score是不關(guān)心相關(guān)度分?jǐn)?shù) ,僅僅是搜索數(shù)據(jù)。
term:對(duì)搜索文本不分詞,直接拿去倒排索引中匹配,你輸入的是什么,就去匹配什么
比如說(shuō),如果對(duì)搜索文本進(jìn)行分詞的話,“helle world” --> “hello”和“world”,兩個(gè)詞分別去倒排索引中匹配
term,“hello world” --> “hello world”,直接去倒排索引中匹配“hello world”
(3)搜索沒(méi)有隱藏的帖子
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"hidden" : false
}
}
}
}
}
(4)根據(jù)發(fā)帖日期搜索帖子 搜索發(fā)帖日期為 2017-01-01的數(shù)據(jù)
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"postDate": "2017-01-01"
}
}
}
}
}
(5)根據(jù)帖子ID搜索帖子
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"articleID" : "XHDK-A-1293-#fJ3"
}
}
}
}
}
//返回?cái)?shù)據(jù)如下
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"articleID.keyword" : "XHDK-A-1293-#fJ3"
}
}
}
}
}
//返回?cái)?shù)據(jù)如下
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 1,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01"
}
}
]
}
}
// 查看articleID的分詞
GET /forum/_analyze
{
"field": "articleID",
"text": "XHDK-A-1293-#fJ3"
}
//返回?cái)?shù)據(jù)
{
"tokens": [
{
"token": "xhdk",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "a",
"start_offset": 5,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "1293",
"start_offset": 7,
"end_offset": 11,
"type": "<NUM>",
"position": 2
},
{
"token": "fj3",
"start_offset": 13,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
}
]
}
默認(rèn)是analyzed的text類型的field,建立倒排索引的時(shí)候,就會(huì)對(duì)所有的articleID分詞,分詞以后,原本的articleID就沒(méi)有了,只有分詞后的各個(gè)word存在于倒排索引中。
但是 term,是不對(duì)搜索文本分詞的,XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3;但是articleID建立索引的時(shí)候,XHDK-A-1293-#fJ3 --> xhdk,a,1293,fj3 自然就搜索不到了
articleID.keyword,是es最新版本內(nèi)置建立的field,就是不分詞的。所以一個(gè)articleID過(guò)來(lái)的時(shí)候,會(huì)建立兩次索引,一次是自己本身,是要分詞的,分詞后放入倒排索引;
另外一次是基于articleID.keyword, 不分詞,保留256個(gè)字符最多,直接一個(gè)字符串放入倒排索引中。
所以term filter,對(duì)text過(guò)濾,可以考慮使用內(nèi)置的field.keyword來(lái)進(jìn)行匹配。但是有個(gè)問(wèn)題,默認(rèn)就保留256個(gè)字符。所以盡可能還是自己去手動(dòng)建立索引,指定not_analyzed吧。在最新版本的es中,
不需要指定not_analyzed也可以,將type=keyword即可。
(7)重建索引
DELETE /forum
PUT /forum
{
"mappings": {
"article": {
"properties": {
"articleID": {
"type": "keyword"
}
}
}
}
}
//放數(shù)據(jù)
POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
重新根據(jù)帖子ID和發(fā)帖日期進(jìn)行搜索,現(xiàn)在是可以搜索到數(shù)據(jù)的
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"articleID" : "XHDK-A-1293-#fJ3"
}
}
}
}
}
總結(jié)
(1)term filter:根據(jù)exact value進(jìn)行搜索,數(shù)字、boolean、date天然支持
(2)text需要建索引時(shí)指定為not_analyzed,才能用term query
(3)相當(dāng)于SQL中的單個(gè)where條件
搜索 filter 執(zhí)行原理剖析(bitset機(jī)制與caching機(jī)制)
(1)在倒排索引中查找搜索串,獲取document list
date來(lái)舉例
word doc1 doc2 doc3
2017-01-01 * *
2017-02-02 * *
2017-03-03 * * *
filter:2017-02-02
到倒排索引中一找,發(fā)現(xiàn)2017-02-02對(duì)應(yīng)的document list是doc2,doc3
(2)為每個(gè)在倒排索引中搜索到的結(jié)果,構(gòu)建一個(gè)bitset,[0, 0, 0, 1, 0, 1]
使用找到的doc list,構(gòu)建一個(gè)bitset,就是一個(gè)二進(jìn)制的數(shù)組,數(shù)組每個(gè)元素都是0或1,用來(lái)標(biāo)識(shí)一個(gè)doc對(duì)一個(gè)filter條件是否匹配,如果匹配就是1,不匹配就是0
比如:
[0, 1, 1]
doc1:不匹配這個(gè)filter的
doc2和do3:是匹配這個(gè)filter的
盡可能用簡(jiǎn)單的數(shù)據(jù)結(jié)構(gòu)去實(shí)現(xiàn)復(fù)雜的功能,可以節(jié)省內(nèi)存空間,提升性能
(3)遍歷每個(gè)過(guò)濾條件對(duì)應(yīng)的bitset,優(yōu)先從最稀疏的開(kāi)始搜索,查找滿足所有條件的document
一次性其實(shí)可以在一個(gè)search請(qǐng)求中,發(fā)出多個(gè)filter條件,每個(gè)filter條件都會(huì)對(duì)應(yīng)一個(gè)bitset,遍歷每個(gè)filter條件對(duì)應(yīng)的bitset,先從最稀疏的開(kāi)始遍歷
[0, 0, 0, 1, 0, 0]:比較稀疏
[0, 1, 0, 1, 0, 1]
先遍歷比較稀疏的bitset,就可以先過(guò)濾掉盡可能多的數(shù)據(jù)
遍歷所有的bitset,找到匹配所有filter條件的doc
請(qǐng)求:filter,postDate=2017-01-01,userID=1
postDate: [0, 0, 1, 1, 0, 0]
userID: [0, 1, 0, 1, 0, 1]
遍歷完兩個(gè)bitset之后,找到的匹配所有條件的doc,就是doc4 ,就可以將document作為結(jié)果返回給client了
(4)caching bitset,跟蹤query,在最近256個(gè)query中超過(guò)一定次數(shù)的過(guò)濾條件,緩存其bitset。對(duì)于小segment(<1000,或<3%),不緩存bitset。
比如postDate=2017-01-01,[0, 0, 1, 1, 0, 0],可以緩存在內(nèi)存中,這樣下次如果再有這個(gè)條件過(guò)來(lái)的時(shí)候,就不用重新掃描倒排索引,避免反復(fù)生成bitset,可以大幅度提升性能。
在最近的256個(gè)filter中,有某個(gè)filter超過(guò)了一定的次數(shù),次數(shù)不固定,就會(huì)自動(dòng)緩存這個(gè)filter對(duì)應(yīng)的bitset
segment(上半季),filter針對(duì)小segment獲取到的結(jié)果,可以不緩存,segment記錄數(shù)<1000,或者segment大小<index總大小的3%
segment數(shù)據(jù)量很小,此時(shí)哪怕是掃描也很快;segment會(huì)在后臺(tái)自動(dòng)合并,小segment很快就會(huì)跟其他小segment合并成大segment,此時(shí)就緩存也沒(méi)有什么意義,segment很快就消失了
針對(duì)一個(gè)小segment的bitset,[0, 0, 1, 0]
filter比query的好處就在于會(huì)caching,但是之前不知道caching的是什么東西,實(shí)際上并不是一個(gè)filter返回的完整的doc list數(shù)據(jù)結(jié)果。而是filter bitset緩存起來(lái)。下次不用掃描倒排索引了。
(5)filter大部分情況下來(lái)說(shuō),在query之前執(zhí)行,先盡量過(guò)濾掉盡可能多的數(shù)據(jù)
query:是會(huì)計(jì)算doc對(duì)搜索條件的relevance score,還會(huì)根據(jù)這個(gè)score去排序
filter:只是簡(jiǎn)單過(guò)濾出想要的數(shù)據(jù),不計(jì)算relevance score,也不排序
(6)如果document有新增或修改,那么cached bitset會(huì)被自動(dòng)更新
postDate=2017-01-01,[0, 0, 1, 0]
document,id=5,postDate=2017-01-01,會(huì)自動(dòng)更新到postDate=2017-01-01這個(gè)filter的bitset中,全自動(dòng),緩存會(huì)自動(dòng)更新。postDate=2017-01-01的bitset,[0, 0, 1, 0, 1]
document,id=1,postDate=2016-12-30,修改為postDate-2017-01-01,此時(shí)也會(huì)自動(dòng)更新bitset,[1, 0, 1, 0, 1]
(7)以后只要是有相同的filter條件的,會(huì)直接來(lái)使用這個(gè)過(guò)濾條件對(duì)應(yīng)的cached bitset
基于bool組合多個(gè)filter條件來(lái)搜索數(shù)據(jù)
1、搜索發(fā)帖日期為2017-01-01,或者帖子ID為XHDK-A-1293-#fJ3的帖子,同時(shí)要求帖子的發(fā)帖日期絕對(duì)不為2017-01-02
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{"term": { "postDate": "2017-01-01" }},
{"term": {"articleID": "XHDK-A-1293-#fJ3"}}
],
"must_not": {
"term": {
"postDate": "2017-01-02"
}
}
}
}
}
}
}
must,should,must_not,filter:必須匹配,可以匹配其中任意一個(gè)即可,必須不匹配
2、搜索帖子ID為XHDK-A-1293-#fJ3,或者是帖子ID為JODL-X-1937-#pV7而且發(fā)帖日期為2017-01-01的帖子
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"should": [
{
"term": {
"articleID": "XHDK-A-1293-#fJ3"
}
},
{
"bool": {
"must": [
{
"term":{
"articleID": "JODL-X-1937-#pV7"
}
},
{
"term": {
"postDate": "2017-01-01"
}
}
]
}
}
]
}
}
}
}
}
總結(jié)
(1)bool:must,must_not,should,組合多個(gè)過(guò)濾條件
(2)bool可以嵌套
(3)相當(dāng)于SQL中的多個(gè)and條件:當(dāng)你把搜索語(yǔ)法學(xué)好了以后,基本可以實(shí)現(xiàn)部分常用的sql語(yǔ)法對(duì)應(yīng)的功能
使用terms搜索多個(gè)值(類似于 sql中的in)以及多值搜索結(jié)果優(yōu)化
1、為帖子數(shù)據(jù)增加tag字段
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag" : ["java", "hadoop"]} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag" : ["java"]} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag" : ["hadoop"]} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag" : ["java", "elasticsearch"]} }
2、(1)搜索articleID為KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子,(2) 搜索tag中包含java的帖子
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"terms": {
"articleID.keyword": [
"KDKE-B-9947-#kL5",
"QQPX-R-3956-#aD8"
]
}
}
}
}
}
GET /forum/article/_search
{
"query" : {
"constant_score" : {
"filter" : {
"terms" : {
"tag" : ["java"]
}
}
}
}
}
3、上面的搜索結(jié)果不夠準(zhǔn)確, 優(yōu)化搜索結(jié)果,僅僅搜索tag只包含java的帖子
新增tag時(shí),同時(shí)新增tag字符的數(shù)量
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag_cnt" : 2} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag_cnt" : 2} }
再搜索
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"tag_cnt": 1
}
},
{
"terms": {
"tag": ["java"]
}
}
]
}
}
}
}
}
總結(jié)
(1)terms多值搜索
(2)優(yōu)化terms多值搜索的結(jié)果
(3)相當(dāng)于SQL中的in語(yǔ)句
基于range filter來(lái)進(jìn)行范圍過(guò)濾
1、為帖子數(shù)據(jù)增加瀏覽量的字段
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"view_cnt" : 30} }
{ "update": { "_id": "2"} }
{ "doc" : {"view_cnt" : 50} }
{ "update": { "_id": "3"} }
{ "doc" : {"view_cnt" : 100} }
{ "update": { "_id": "4"} }
{ "doc" : {"view_cnt" : 80} }
2、搜索瀏覽量在30~60之間的帖子
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"view_cnt": {
"gt": 30,
"lt": 60
}
}
}
}
}
}
gte 大于等于
lte 小于等于
3、搜索發(fā)帖日期在最近1個(gè)月的帖子
POST /forum/article/_bulk
{ "index": { "_id": 5 }}
{ "articleID" : "DHJK-B-1395-#Ky5", "userID" : 3, "hidden": false, "postDate": "2017-03-01", "tag": ["elasticsearch"], "tag_cnt": 1, "view_cnt": 10 }
// 指定時(shí)間范圍 查找
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"postDate": {
"gt": "2017-03-10||-30d"
}
}
}
}
}
}
// 使用now函數(shù) 對(duì)當(dāng)前時(shí)間范圍查找
GET /forum/article/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"postDate": {
"gt": "now-30d"
}
}
}
}
}
}
總結(jié)
(1)range,sql中的between,或者是>=1,<=1
(2)range做范圍過(guò)濾
手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度
1、為帖子數(shù)據(jù)增加標(biāo)題字段
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }
2、搜索標(biāo)題中包含java或elasticsearch的blog
這個(gè),就跟之前的那個(gè)term query,不一樣了。不是搜索exact value,是進(jìn)行full text全文檢索。
match query,是負(fù)責(zé)進(jìn)行全文檢索的。當(dāng)然,如果要檢索的field,是not_analyzed類型的,那么match query也相當(dāng)于term query。
GET /forum/article/_search
{
"query": {
"match": {
"title": "java elasticsearch"
}
}
}
3、搜索標(biāo)題中包含java和elasticsearch的blog
搜索結(jié)果精準(zhǔn)控制的第一步:靈活使用and關(guān)鍵字,如果你是希望所有的搜索關(guān)鍵字都要匹配的,那么就用and,可以實(shí)現(xiàn)單純match query無(wú)法實(shí)現(xiàn)的效果
GET /forum/article/_search
{
"query": {
"match": {
"title": {
"query": "java elasticsearch",
"operator": "and"
}
}
}
}
4、搜索包含java,elasticsearch,spark,hadoop,4個(gè)關(guān)鍵字中,至少3個(gè)的blog
控制搜索結(jié)果的精準(zhǔn)度的第二步:指定一些關(guān)鍵字中,必須至少匹配其中的多少個(gè)關(guān)鍵字,才能作為結(jié)果返回
GET /forum/article/_search
{
"query": {
"match": {
"title": {
"query": "java elasticsearch spark hadoop",
"minimum_should_match": "75%"
}
}
}
}
5、用bool組合多個(gè)搜索條件,來(lái)搜索title
GET /forum/article/_search
{
"query": {
"bool": {
"must": { "match": { "title": "java" }},
"must_not": { "match": { "title": "spark" }},
"should": [
{ "match": { "title": "hadoop" }},
{ "match": { "title": "elasticsearch" }}
]
}
}
}
6、bool組合多個(gè)搜索條件,如何計(jì)算relevance score
must和should搜索對(duì)應(yīng)的分?jǐn)?shù),加起來(lái),除以must和should的總數(shù)
排名第一:java,同時(shí)包含should中所有的關(guān)鍵字,hadoop,elasticsearch
排名第二:java,同時(shí)包含should中的elasticsearch
排名第三:java,不包含should中的任何關(guān)鍵字
should是可以影響相關(guān)度分?jǐn)?shù)的
must是確保說(shuō),誰(shuí)必須有這個(gè)關(guān)鍵字,同時(shí)會(huì)根據(jù)這個(gè)must的條件去計(jì)算出document對(duì)這個(gè)搜索條件的relevance score
在滿足must的基礎(chǔ)之上,should中的條件,不匹配也可以,但是如果匹配的更多,那么document的relevance score就會(huì)更高
7、搜索java,hadoop,spark,elasticsearch,至少包含其中3個(gè)關(guān)鍵字
默認(rèn)情況下,should是可以不匹配任何一個(gè)的,比如上面的搜索中,this is java blog,就不匹配任何一個(gè)should條件
但是有個(gè)例外的情況,如果沒(méi)有must的話,那么should中必須至少匹配一個(gè)才可以
比如下面的搜索,should中有4個(gè)條件,默認(rèn)情況下,只要滿足其中一個(gè)條件,就可以匹配作為結(jié)果返回
但是可以精準(zhǔn)控制,should的4個(gè)條件中,至少匹配幾個(gè)才能作為結(jié)果返回
GET /forum/article/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "java" }},
{ "match": { "title": "elasticsearch" }},
{ "match": { "title": "hadoop" }},
{ "match": { "title": "spark" }}
],
"minimum_should_match": 3
}
}
}
總結(jié):
1、全文檢索的時(shí)候,進(jìn)行多個(gè)值的檢索,有兩種做法,match query;should
2、控制搜索結(jié)果精準(zhǔn)度:and operator,minimum_should_match
基于term+bool實(shí)現(xiàn)多條件搜索底層原理
1、普通match如何轉(zhuǎn)換為term+should
{
"match": { "title": "java elasticsearch"}
}
使用諸如上面的match query進(jìn)行多值搜索的時(shí)候,es會(huì)在底層自動(dòng)將這個(gè)match query轉(zhuǎn)換為bool的語(yǔ)法
bool should,指定多個(gè)搜索詞,同時(shí)使用term query
{
"bool": {
"should": [
{ "term": { "title": "java" }},
{ "term": { "title": "elasticsearch" }}
]
}
}
2、and match如何轉(zhuǎn)換為term+must
{
"match": {
"title": {
"query": "java elasticsearch",
"operator": "and"
}
}
}
{
"bool": {
"must": [
{ "term": { "title": "java" }},
{ "term": { "title": "elasticsearch" }}
]
}
}
3、minimum_should_match如何轉(zhuǎn)換
{
"match": {
"title": {
"query": "java elasticsearch hadoop spark",
"minimum_should_match": "75%"
}
}
}
{
"bool": {
"should": [
{ "term": { "title": "java" }},
{ "term": { "title": "elasticsearch" }},
{ "term": { "title": "hadoop" }},
{ "term": { "title": "spark" }}
],
"minimum_should_match": 3
}
}
基于boost的細(xì)粒度搜索條件權(quán)重控制
需求:搜索標(biāo)題中包含java的帖子,同時(shí)呢,如果標(biāo)題中包含hadoop或elasticsearch就優(yōu)先搜索出來(lái),同時(shí)呢,如果一個(gè)帖子包含java hadoop,一個(gè)帖子包含java elasticsearch,包含hadoop的帖子要比elasticsearch優(yōu)先搜索出來(lái)
知識(shí)點(diǎn),搜索條件的權(quán)重,boost,可以將某個(gè)搜索條件的權(quán)重加大,此時(shí)當(dāng)匹配這個(gè)搜索條件和匹配另一個(gè)搜索條件的document,計(jì)算relevance score時(shí),匹配權(quán)重更大的搜索條件的document,relevance score會(huì)更高,當(dāng)然也就會(huì)優(yōu)先被返回回來(lái)
默認(rèn)情況下,搜索條件的權(quán)重都是一樣的,都是1
GET /forum/article/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "blog"
}
}
],
"should": [
{
"match": {
"title": {
"query": "java"
}
}
},
{
"match": {
"title": {
"query": "hadoop"
}
}
},
{
"match": {
"title": {
"query": "elasticsearch"
}
}
},
{
"match": {
"title": {
"query": "spark",
"boost": 5
}
}
}
]
}
}
}
為什么多shard場(chǎng)景下relevance score不準(zhǔn)確?
2、如何解決該問(wèn)題?
(1)生產(chǎn)環(huán)境下,數(shù)據(jù)量大,盡可能實(shí)現(xiàn)均勻分配
數(shù)據(jù)量很大的話,其實(shí)一般情況下,在概率學(xué)的背景下,es都是在多個(gè)shard中均勻路由數(shù)據(jù)的,路由的時(shí)候根據(jù)_id,負(fù)載均衡
比如說(shuō)有10個(gè)document,title都包含java,一共有5個(gè)shard,那么在概率學(xué)的背景下,如果負(fù)載均衡的話,其實(shí)每個(gè)shard都應(yīng)該有2個(gè)doc,title包含java
如果說(shuō)數(shù)據(jù)分布均勻的話,其實(shí)就沒(méi)有剛才說(shuō)的那個(gè)問(wèn)題了
(2)測(cè)試環(huán)境下,將索引的primary shard設(shè)置為1個(gè),number_of_shards=1,index settings
如果說(shuō)只有一個(gè)shard,那么當(dāng)然,所有的document都在這個(gè)shard里面,就沒(méi)有這個(gè)問(wèn)題了
(3)測(cè)試環(huán)境下,搜索附帶search_type=dfs_query_then_fetch參數(shù),會(huì)將local IDF取出來(lái)計(jì)算global IDF
計(jì)算一個(gè)doc的相關(guān)度分?jǐn)?shù)的時(shí)候,就會(huì)將所有shard對(duì)的local IDF計(jì)算一下,獲取出來(lái),在本地進(jìn)行g(shù)lobal IDF分?jǐn)?shù)的計(jì)算,會(huì)將所有shard的doc作為上下文來(lái)進(jìn)行計(jì)算,也能確保準(zhǔn)確性。但是production生產(chǎn)環(huán)境下,不推薦這個(gè)參數(shù),因?yàn)樾阅芎懿睢?/p>
基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索
1、為帖子數(shù)據(jù)增加content字段
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }
2、搜索title或content中包含java或solution的帖子
下面這個(gè)就是multi-field搜索,多字段搜索
GET /forum/article/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "java solution" }},
{ "match": { "content": "java solution" }}
]
}
}
}
3、結(jié)果分析
期望的是doc5,結(jié)果是doc2,doc4排在了前面
計(jì)算每個(gè)document的relevance score:每個(gè)query的分?jǐn)?shù),乘以matched query數(shù)量,除以總query數(shù)量
算一下doc4的分?jǐn)?shù)
{ "match": { "title": "java solution" }},針對(duì)doc4,是有一個(gè)分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc4,也是有一個(gè)分?jǐn)?shù)的
所以是兩個(gè)分?jǐn)?shù)加起來(lái),比如說(shuō),1.1 + 1.2 = 2.3
matched query數(shù)量 = 2
總query數(shù)量 = 2
2.3 * 2 / 2 = 2.3
算一下doc5的分?jǐn)?shù)
{ "match": { "title": "java solution" }},針對(duì)doc5,是沒(méi)有分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc5,是有一個(gè)分?jǐn)?shù)的
所以說(shuō),只有一個(gè)query是有分?jǐn)?shù)的,比如2.3
matched query數(shù)量 = 1
總query數(shù)量 = 2
2.3 * 1 / 2 = 1.15
doc5的分?jǐn)?shù) = 1.15 < doc4的分?jǐn)?shù) = 2.3
4、best fields策略,dis_max
best fields策略,就是說(shuō),搜索到的結(jié)果,應(yīng)該是某一個(gè)field中匹配到了盡可能多的關(guān)鍵詞,被排在前面;而不是盡可能多的field匹配到了少數(shù)的關(guān)鍵詞,排在了前面
dis_max語(yǔ)法,直接取多個(gè)query中,也就是說(shuō) 分?jǐn)?shù)最高的那一個(gè)query的分?jǐn)?shù)即可
{ "match": { "title": "java solution" }},針對(duì)doc4,是有一個(gè)分?jǐn)?shù)的,1.1
{ "match": { "content": "java solution" }},針對(duì)doc4,也是有一個(gè)分?jǐn)?shù)的,1.2
取最大分?jǐn)?shù),1.2
{ "match": { "title": "java solution" }},針對(duì)doc5,是沒(méi)有分?jǐn)?shù)的
{ "match": { "content": "java solution" }},針對(duì)doc5,是有一個(gè)分?jǐn)?shù)的,2.3
取最大分?jǐn)?shù),2.3
然后doc4的分?jǐn)?shù) = 1.2 < doc5的分?jǐn)?shù) = 2.3,所以doc5就可以排在更前面的地方,符合我們的需要
GET /forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "java solution" }},
{ "match": { "content": "java solution" }}
]
}
}
}
tie_breaker 優(yōu)化dis_max
GET /forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "java beginner" }},
{ "match": { "body": "java beginner" }}
]
}
}
}
可能在實(shí)際場(chǎng)景中出現(xiàn)的一個(gè)情況是這樣的:
(1)某個(gè)帖子,doc1,title中包含java,content不包含java beginner任何一個(gè)關(guān)鍵詞
(2)某個(gè)帖子,doc2,content中包含beginner,title中不包含任何一個(gè)關(guān)鍵詞
(3)某個(gè)帖子,doc3,title中包含java,content中包含beginner
(4)最終搜索,可能出來(lái)的結(jié)果是,doc1和doc2排在doc3的前面,而不是我們期望的doc3排在最前面
dis_max,只是取分?jǐn)?shù)最高的那個(gè)query的分?jǐn)?shù)而已。dis_max只取某一個(gè)query最大的分?jǐn)?shù),完全不考慮其他query的分?jǐn)?shù)
解決方法:使用tie_breaker將其他query的分?jǐn)?shù)也考慮進(jìn)去
tie_breaker參數(shù)的意義,在于說(shuō),將其他query的分?jǐn)?shù),乘以tie_breaker,然后綜合與最高分?jǐn)?shù)的那個(gè)query的分?jǐn)?shù),綜合在一起進(jìn)行計(jì)算
除了取最高分以外,還會(huì)考慮其他的query的分?jǐn)?shù)
tie_breaker的值,在0~1之間,是個(gè)小數(shù),就ok
GET /forum/article/_search
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "java beginner" }},
{ "match": { "body": "java beginner" }}
],
"tie_breaker": 0.3
}
}
}
multi_match語(yǔ)法
GET /forum/article/_search
{
"query": {
"multi_match": {
"query": "java solution",
"type": "best_fields", //默認(rèn)就是best_fields
"fields": [ "title^2", "content" ], //這里是設(shè)置權(quán)重為2
"tie_breaker": 0.3,
"minimum_should_match": "50%"
}
}
}
從best-fields換成most-fields策略
best-fields策略,主要是說(shuō)將某一個(gè)field匹配盡可能多的關(guān)鍵詞的doc優(yōu)先返回回來(lái)
most-fields策略,主要是說(shuō)盡可能返回更多field匹配到某個(gè)關(guān)鍵詞的doc,優(yōu)先返回回來(lái)
most_fields策略進(jìn)行cross-fields search及其弊端
cross-fields搜索,一個(gè)唯一標(biāo)識(shí),跨了多個(gè)field。比如一個(gè)人,標(biāo)識(shí),是姓名;一個(gè)建筑,它的標(biāo)識(shí)是地址。姓名可以散落在多個(gè)field中,比如first_name和last_name中,地址可以散落在country,province,city中。
跨多個(gè)field搜索一個(gè)標(biāo)識(shí),比如搜索一個(gè)人名,或者一個(gè)地址,就是cross-fields搜索
初步來(lái)說(shuō),如果要實(shí)現(xiàn),可能用most_fields比較合適。因?yàn)閎est_fields是優(yōu)先搜索單個(gè)field最匹配的結(jié)果,cross-fields本身就不是一個(gè)field的問(wèn)題了。
加入數(shù)據(jù) 并搜索
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }
GET /forum/article/_search
{
"query": {
"multi_match": {
"query": "Peter Smith",
"type": "most_fields",
"fields": [ "author_first_name", "author_last_name" ]
}
}
}
// 返回結(jié)果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.6931472,
"hits": [
{
"_index": "forum",
"_type": "article",
"_id": "2",
"_score": 0.6931472,
"_source": {
"articleID": "KDKE-B-9947-#kL5",
"userID": 1,
"hidden": false,
"postDate": "2017-01-02",
"tag": [
"java"
],
"tag_cnt": 1,
"view_cnt": 50,
"title": "this is java blog",
"content": "i think java is the best programming language",
"sub_title": "learned a lot of course",
"author_first_name": "Smith",
"author_last_name": "Williams"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "1",
"_score": 0.5753642,
"_source": {
"articleID": "XHDK-A-1293-#fJ3",
"userID": 1,
"hidden": false,
"postDate": "2017-01-01",
"tag": [
"java",
"hadoop"
],
"tag_cnt": 2,
"view_cnt": 30,
"title": "this is java and elasticsearch blog",
"content": "i like to write best elasticsearch article",
"sub_title": "learning more courses",
"author_first_name": "Peter",
"author_last_name": "Smith"
}
},
{
"_index": "forum",
"_type": "article",
"_id": "5",
"_score": 0.51623213,
"_source": {
"articleID": "DHJK-B-1395-#Ky5",
"userID": 3,
"hidden": false,
"postDate": "2017-03-01",
"tag": [
"elasticsearch"
],
"tag_cnt": 1,
"view_cnt": 10,
"title": "this is spark blog",
"content": "spark is best big data solution based on scala ,an programming language similar to java",
"sub_title": "haha, hello world",
"author_first_name": "Tonny",
"author_last_name": "Peter Smith"
}
}
]
}
}
Peter Smith,匹配author_first_name,匹配到了Smith,這時(shí)候它的分?jǐn)?shù)很高,為什么啊???
因?yàn)镮DF分?jǐn)?shù)高,IDF分?jǐn)?shù)要高,那么這個(gè)匹配到的term(Smith),在所有doc中的出現(xiàn)頻率要低,author_first_name field中,Smith就出現(xiàn)過(guò)1次
Peter Smith這個(gè)人,doc 1,Smith在author_last_name中,但是author_last_name出現(xiàn)了兩次Smith,所以導(dǎo)致doc 1的IDF分?jǐn)?shù)較低
不要有過(guò)多的疑問(wèn),一定是這樣嗎? 說(shuō)不清楚 這個(gè)搜索算法實(shí)在太過(guò)于復(fù)雜。
總結(jié) most_fields 進(jìn)行跨字段搜索的弊端
問(wèn)題1:只是找到盡可能多的field匹配的doc,而不是某個(gè)field完全匹配的doc
問(wèn)題2:most_fields,沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù),就是匹配的特別少的結(jié)果
問(wèn)題3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的時(shí)候,由于first_name中很少有Smith的,所以query在所有document中的頻率很低,
得到的分?jǐn)?shù)很高,可能Smith Williams反而會(huì)排在Peter Smith前面
使用用copy_to,將多個(gè)field組合成一個(gè)field 解決三個(gè)弊端
PUT /forum/_mapping/article
{
"properties": {
"new_author_first_name": {
"type": "string",
"copy_to": "new_author_full_name"
},
"new_author_last_name": {
"type": "string",
"copy_to": "new_author_full_name"
},
"new_author_full_name": {
"type": "string"
}
}
}
用了這個(gè)copy_to語(yǔ)法之后,就可以將多個(gè)字段的值拷貝到一個(gè)字段中,并建立倒排索引
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} } --> Peter Smith
{ "update": { "_id": "2"} }
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} } --> Smith Williams
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} } --> Jack Ma
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} } --> Robbin Li
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} } --> Tonny Peter Smith
GET /forum/article/_search
{
"query": {
"match": {
"new_author_full_name": "Peter Smith"
}
}
}
結(jié)果雖然沒(méi)有復(fù)現(xiàn)場(chǎng)景但是原理是通的
總結(jié)
問(wèn)題1:只是找到盡可能多的field匹配的doc,而不是某個(gè)field完全匹配的doc --> 解決,最匹配的document被最先返回
問(wèn)題2:most_fields,沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù),就是匹配的特別少的結(jié)果 --> 解決,可以使用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)
問(wèn)題3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的時(shí)候,由于first_name中很少有Smith的,所以query在所有document中的頻率很低,得到的分?jǐn)?shù)很高,
可能Smith Williams反而會(huì)排在Peter Smith前面 --> 解決,Smith和Peter在一個(gè)field了,所以在所有document中出現(xiàn)的次數(shù)是均勻的,不會(huì)有極端的偏差
使用原生cross-fiels技術(shù)解決搜索弊端
GET /forum/article/_search
{
"query": {
"multi_match": {
"query": "Peter Smith",
"type": "cross_fields",
"operator": "and",
"fields": ["author_first_name", "author_last_name"]
}
}
}
要求Peter必須在author_first_name或author_last_name中出現(xiàn)
要求Smith必須在author_first_name或author_last_name中出現(xiàn)
問(wèn)題1:只是找到盡可能多的field匹配的doc,而不是某個(gè)field完全匹配的doc --> 解決,要求每個(gè)term都必須在任何一個(gè)field中出現(xiàn)
Peter Smith可能是橫跨在多個(gè)field中的,所以必須要求每個(gè)term都在某個(gè)field中出現(xiàn),組合起來(lái)才能組成我們想要的標(biāo)識(shí),完整的人名
原來(lái)most_fiels,可能像Smith Williams也可能會(huì)出現(xiàn),因?yàn)閙ost_fields要求只是任何一個(gè)field匹配了就可以,匹配的field越多,分?jǐn)?shù)越高
問(wèn)題2:most_fields,沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù),就是匹配的特別少的結(jié)果 --> 解決,既然每個(gè)term都要求出現(xiàn),長(zhǎng)尾肯定被去除掉了
java hadoop spark --> 這3個(gè)term都必須在任何一個(gè)field出現(xiàn)了
比如有的document,只有一個(gè)field中包含一個(gè)java,那就被干掉了,作為長(zhǎng)尾就沒(méi)了
問(wèn)題3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的時(shí)候,由于first_name中很少有Smith的,所以query在所有document中的頻率很低,得到的分?jǐn)?shù)很高,可能Smith Williams反而會(huì)排在Peter Smith前面 --> 計(jì)算IDF的時(shí)候,將每個(gè)query在每個(gè)field中的IDF都取出來(lái),取最小值,就不會(huì)出現(xiàn)極端情況下的極大值了
Smith,在author_first_name這個(gè)field中,在所有doc的這個(gè)Field中,出現(xiàn)的頻率很低,導(dǎo)致IDF分?jǐn)?shù)很高;Smith在所有doc的author_last_name field中的頻率算出一個(gè)IDF分?jǐn)?shù),因?yàn)橐话銇?lái)說(shuō)last_name中的Smith頻率都較高,所以IDF分?jǐn)?shù)是正常的,不會(huì)太高;然后對(duì)于Smith來(lái)說(shuō),會(huì)取兩個(gè)IDF分?jǐn)?shù)中,較小的那個(gè)分?jǐn)?shù)。就不會(huì)出現(xiàn)IDF分過(guò)高的情況。