亚洲色综合狠狠综合区,男女野外做爰全过程69影院,两腿间花蒂被吸得肿了图片

使用term filter來(lái)搜索數(shù)據(jù)

（1）插入一些測(cè)試帖子數(shù)據(jù)

POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

GET /forum/_mapping/article

{
  "forum": {
    "mappings": {
      "article": {
        "properties": {
          "articleID": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "hidden": {
            "type": "boolean"
          },
          "postDate": {
            "type": "date"
          },
          "userID": {
            "type": "long"
          }
        }
      }
    }
  }
}

現(xiàn)在es 5.2版本，type=text，默認(rèn)會(huì)設(shè)置兩個(gè)field，一個(gè)是field本身，比如articleID，就是分詞的；還有一個(gè)的話，就是field.keyword，articleID.keyword，默認(rèn)不分詞，會(huì)最多保留256個(gè)字符

（2）根據(jù)用戶ID搜索帖子

GET /forum/article/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "userID" : 1
                }
            }
        }
    }
}

constant_score是不關(guān)心相關(guān)度分?jǐn)?shù) ，僅僅是搜索數(shù)據(jù)。
term：對(duì)搜索文本不分詞，直接拿去倒排索引中匹配，你輸入的是什么，就去匹配什么
比如說(shuō)，如果對(duì)搜索文本進(jìn)行分詞的話，“helle world” --> “hello”和“world”，兩個(gè)詞分別去倒排索引中匹配
term，“hello world” --> “hello world”，直接去倒排索引中匹配“hello world”
（3）搜索沒(méi)有隱藏的帖子

GET /forum/article/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "hidden" : false
                }
            }
        }
    }
}

（4）根據(jù)發(fā)帖日期搜索帖子搜索發(fā)帖日期為 2017-01-01的數(shù)據(jù)

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "postDate": "2017-01-01"
        }
      }
    }
  }
}

（5）根據(jù)帖子ID搜索帖子

GET /forum/article/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "articleID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

//返回?cái)?shù)據(jù)如下
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

GET /forum/article/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "articleID.keyword" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

//返回?cái)?shù)據(jù)如下

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 1,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01"
        }
      }
    ]
  }
}

// 查看articleID的分詞
GET /forum/_analyze
{
  "field": "articleID",
  "text": "XHDK-A-1293-#fJ3"
}
//返回?cái)?shù)據(jù)
{
  "tokens": [
    {
      "token": "xhdk",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "a",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "1293",
      "start_offset": 7,
      "end_offset": 11,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "fj3",
      "start_offset": 13,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

默認(rèn)是analyzed的text類型的field，建立倒排索引的時(shí)候，就會(huì)對(duì)所有的articleID分詞，分詞以后，原本的articleID就沒(méi)有了，只有分詞后的各個(gè)word存在于倒排索引中。
但是 term，是不對(duì)搜索文本分詞的，XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3；但是articleID建立索引的時(shí)候，XHDK-A-1293-#fJ3 --> xhdk，a，1293，fj3 自然就搜索不到了

articleID.keyword，是es最新版本內(nèi)置建立的field，就是不分詞的。所以一個(gè)articleID過(guò)來(lái)的時(shí)候，會(huì)建立兩次索引，一次是自己本身，是要分詞的，分詞后放入倒排索引；
另外一次是基于articleID.keyword，不分詞，保留256個(gè)字符最多，直接一個(gè)字符串放入倒排索引中。

所以term filter，對(duì)text過(guò)濾，可以考慮使用內(nèi)置的field.keyword來(lái)進(jìn)行匹配。但是有個(gè)問(wèn)題，默認(rèn)就保留256個(gè)字符。所以盡可能還是自己去手動(dòng)建立索引，指定not_analyzed吧。在最新版本的es中，
不需要指定not_analyzed也可以，將type=keyword即可。
（7）重建索引

DELETE /forum

PUT /forum
{
  "mappings": {
    "article": {
      "properties": {
        "articleID": {
          "type": "keyword"
        }
      }
    }
  }
}
//放數(shù)據(jù)
POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

重新根據(jù)帖子ID和發(fā)帖日期進(jìn)行搜索,現(xiàn)在是可以搜索到數(shù)據(jù)的

GET /forum/article/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "articleID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

總結(jié)
（1）term filter：根據(jù)exact value進(jìn)行搜索，數(shù)字、boolean、date天然支持
（2）text需要建索引時(shí)指定為not_analyzed，才能用term query
（3）相當(dāng)于SQL中的單個(gè)where條件

搜索 filter 執(zhí)行原理剖析（bitset機(jī)制與caching機(jī)制）

（1）在倒排索引中查找搜索串，獲取document list

date來(lái)舉例
word doc1 doc2 doc3

2017-01-01 * *
2017-02-02 * *
2017-03-03 * * *

filter：2017-02-02
到倒排索引中一找，發(fā)現(xiàn)2017-02-02對(duì)應(yīng)的document list是doc2,doc3
（2）為每個(gè)在倒排索引中搜索到的結(jié)果，構(gòu)建一個(gè)bitset，[0, 0, 0, 1, 0, 1]
使用找到的doc list，構(gòu)建一個(gè)bitset，就是一個(gè)二進(jìn)制的數(shù)組，數(shù)組每個(gè)元素都是0或1，用來(lái)標(biāo)識(shí)一個(gè)doc對(duì)一個(gè)filter條件是否匹配，如果匹配就是1，不匹配就是0
比如：
[0, 1, 1]
doc1：不匹配這個(gè)filter的
doc2和do3：是匹配這個(gè)filter的
盡可能用簡(jiǎn)單的數(shù)據(jù)結(jié)構(gòu)去實(shí)現(xiàn)復(fù)雜的功能，可以節(jié)省內(nèi)存空間，提升性能
（3）遍歷每個(gè)過(guò)濾條件對(duì)應(yīng)的bitset，優(yōu)先從最稀疏的開(kāi)始搜索，查找滿足所有條件的document

一次性其實(shí)可以在一個(gè)search請(qǐng)求中，發(fā)出多個(gè)filter條件，每個(gè)filter條件都會(huì)對(duì)應(yīng)一個(gè)bitset，遍歷每個(gè)filter條件對(duì)應(yīng)的bitset，先從最稀疏的開(kāi)始遍歷

[0, 0, 0, 1, 0, 0]：比較稀疏
[0, 1, 0, 1, 0, 1]

先遍歷比較稀疏的bitset，就可以先過(guò)濾掉盡可能多的數(shù)據(jù)

遍歷所有的bitset，找到匹配所有filter條件的doc

請(qǐng)求：filter，postDate=2017-01-01，userID=1

postDate: [0, 0, 1, 1, 0, 0]
userID: [0, 1, 0, 1, 0, 1]

遍歷完兩個(gè)bitset之后，找到的匹配所有條件的doc，就是doc4 ，就可以將document作為結(jié)果返回給client了

（4）caching bitset，跟蹤query，在最近256個(gè)query中超過(guò)一定次數(shù)的過(guò)濾條件，緩存其bitset。對(duì)于小segment（<1000，或<3%），不緩存bitset。

比如postDate=2017-01-01，[0, 0, 1, 1, 0, 0]，可以緩存在內(nèi)存中，這樣下次如果再有這個(gè)條件過(guò)來(lái)的時(shí)候，就不用重新掃描倒排索引，避免反復(fù)生成bitset，可以大幅度提升性能。

在最近的256個(gè)filter中，有某個(gè)filter超過(guò)了一定的次數(shù)，次數(shù)不固定，就會(huì)自動(dòng)緩存這個(gè)filter對(duì)應(yīng)的bitset

segment（上半季），filter針對(duì)小segment獲取到的結(jié)果，可以不緩存，segment記錄數(shù)<1000，或者segment大小<index總大小的3%

segment數(shù)據(jù)量很小，此時(shí)哪怕是掃描也很快；segment會(huì)在后臺(tái)自動(dòng)合并，小segment很快就會(huì)跟其他小segment合并成大segment，此時(shí)就緩存也沒(méi)有什么意義，segment很快就消失了

針對(duì)一個(gè)小segment的bitset，[0, 0, 1, 0]

filter比query的好處就在于會(huì)caching，但是之前不知道caching的是什么東西，實(shí)際上并不是一個(gè)filter返回的完整的doc list數(shù)據(jù)結(jié)果。而是filter bitset緩存起來(lái)。下次不用掃描倒排索引了。

（5）filter大部分情況下來(lái)說(shuō)，在query之前執(zhí)行，先盡量過(guò)濾掉盡可能多的數(shù)據(jù)

query：是會(huì)計(jì)算doc對(duì)搜索條件的relevance score，還會(huì)根據(jù)這個(gè)score去排序
filter：只是簡(jiǎn)單過(guò)濾出想要的數(shù)據(jù)，不計(jì)算relevance score，也不排序

（6）如果document有新增或修改，那么cached bitset會(huì)被自動(dòng)更新

postDate=2017-01-01，[0, 0, 1, 0]
document，id=5，postDate=2017-01-01，會(huì)自動(dòng)更新到postDate=2017-01-01這個(gè)filter的bitset中，全自動(dòng)，緩存會(huì)自動(dòng)更新。postDate=2017-01-01的bitset，[0, 0, 1, 0, 1]
document，id=1，postDate=2016-12-30，修改為postDate-2017-01-01，此時(shí)也會(huì)自動(dòng)更新bitset，[1, 0, 1, 0, 1]

（7）以后只要是有相同的filter條件的，會(huì)直接來(lái)使用這個(gè)過(guò)濾條件對(duì)應(yīng)的cached bitset

基于bool組合多個(gè)filter條件來(lái)搜索數(shù)據(jù)

1、搜索發(fā)帖日期為2017-01-01，或者帖子ID為XHDK-A-1293-#fJ3的帖子，同時(shí)要求帖子的發(fā)帖日期絕對(duì)不為2017-01-02

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {"term": { "postDate": "2017-01-01" }},
            {"term": {"articleID": "XHDK-A-1293-#fJ3"}}
          ],
          "must_not": {
            "term": {
              "postDate": "2017-01-02"
            }
          }
        }
      }
    }
  }
}

must，should，must_not，filter：必須匹配，可以匹配其中任意一個(gè)即可，必須不匹配

2、搜索帖子ID為XHDK-A-1293-#fJ3，或者是帖子ID為JODL-X-1937-#pV7而且發(fā)帖日期為2017-01-01的帖子


GET /forum/article/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            {
              "term": {
                "articleID": "XHDK-A-1293-#fJ3"
              }
            },
            {
              "bool": {
                "must": [
                  {
                    "term":{
                      "articleID": "JODL-X-1937-#pV7"
                    }
                  },
                  {
                    "term": {
                      "postDate": "2017-01-01"
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

總結(jié)
（1）bool：must，must_not，should，組合多個(gè)過(guò)濾條件
（2）bool可以嵌套
（3）相當(dāng)于SQL中的多個(gè)and條件：當(dāng)你把搜索語(yǔ)法學(xué)好了以后，基本可以實(shí)現(xiàn)部分常用的sql語(yǔ)法對(duì)應(yīng)的功能

使用terms搜索多個(gè)值(類似于 sql中的in)以及多值搜索結(jié)果優(yōu)化

1、為帖子數(shù)據(jù)增加tag字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag" : ["java", "hadoop"]} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag" : ["java"]} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag" : ["hadoop"]} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag" : ["java", "elasticsearch"]} }

2、(1)搜索articleID為KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子，(2) 搜索tag中包含java的帖子

GET /forum/article/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "articleID.keyword": [
            "KDKE-B-9947-#kL5",
            "QQPX-R-3956-#aD8"
          ]
        }
      }
    }
  }
}

GET /forum/article/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "terms" : { 
                    "tag" : ["java"]
                }
            }
        }
    }
}

3、上面的搜索結(jié)果不夠準(zhǔn)確，優(yōu)化搜索結(jié)果，僅僅搜索tag只包含java的帖子
新增tag時(shí)，同時(shí)新增tag字符的數(shù)量

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"tag_cnt" : 2} }
{ "update": { "_id": "2"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "3"} }
{ "doc" : {"tag_cnt" : 1} }
{ "update": { "_id": "4"} }
{ "doc" : {"tag_cnt" : 2} }

再搜索

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "tag_cnt": 1
              }
            },
            {
              "terms": {
                "tag": ["java"]
              }
            }
          ]
        }
      }
    }
  }
}

總結(jié)
（1）terms多值搜索
（2）優(yōu)化terms多值搜索的結(jié)果
（3）相當(dāng)于SQL中的in語(yǔ)句

基于range filter來(lái)進(jìn)行范圍過(guò)濾

1、為帖子數(shù)據(jù)增加瀏覽量的字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"view_cnt" : 30} }
{ "update": { "_id": "2"} }
{ "doc" : {"view_cnt" : 50} }
{ "update": { "_id": "3"} }
{ "doc" : {"view_cnt" : 100} }
{ "update": { "_id": "4"} }
{ "doc" : {"view_cnt" : 80} }

2、搜索瀏覽量在30~60之間的帖子

GET /forum/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "view_cnt": {
            "gt": 30,
            "lt": 60
          }
        }
      }
    }
  }
}

gte 大于等于
lte 小于等于

3、搜索發(fā)帖日期在最近1個(gè)月的帖子

POST /forum/article/_bulk
{ "index": { "_id": 5 }}
{ "articleID" : "DHJK-B-1395-#Ky5", "userID" : 3, "hidden": false, "postDate": "2017-03-01", "tag": ["elasticsearch"], "tag_cnt": 1, "view_cnt": 10 }

// 指定時(shí)間范圍 查找
GET /forum/article/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "gt": "2017-03-10||-30d"
          }
        }
      }
    }
  }
}
// 使用now函數(shù) 對(duì)當(dāng)前時(shí)間范圍查找
GET /forum/article/_search 
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "postDate": {
            "gt": "now-30d"
          }
        }
      }
    }
  }
}

總結(jié)
（1）range，sql中的between，或者是>=1，<=1
（2）range做范圍過(guò)濾

手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度

1、為帖子數(shù)據(jù)增加標(biāo)題字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }

2、搜索標(biāo)題中包含java或elasticsearch的blog

這個(gè)，就跟之前的那個(gè)term query，不一樣了。不是搜索exact value，是進(jìn)行full text全文檢索。
match query，是負(fù)責(zé)進(jìn)行全文檢索的。當(dāng)然，如果要檢索的field，是not_analyzed類型的，那么match query也相當(dāng)于term query。

GET /forum/article/_search
{
    "query": {
        "match": {
            "title": "java elasticsearch"
        }
    }
}

3、搜索標(biāo)題中包含java和elasticsearch的blog

搜索結(jié)果精準(zhǔn)控制的第一步：靈活使用and關(guān)鍵字，如果你是希望所有的搜索關(guān)鍵字都要匹配的，那么就用and，可以實(shí)現(xiàn)單純match query無(wú)法實(shí)現(xiàn)的效果

GET /forum/article/_search
{
    "query": {
        "match": {
            "title": {
        "query": "java elasticsearch",
        "operator": "and"
        }
        }
    }
}

4、搜索包含java，elasticsearch，spark，hadoop，4個(gè)關(guān)鍵字中，至少3個(gè)的blog

控制搜索結(jié)果的精準(zhǔn)度的第二步：指定一些關(guān)鍵字中，必須至少匹配其中的多少個(gè)關(guān)鍵字，才能作為結(jié)果返回

GET /forum/article/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch spark hadoop",
        "minimum_should_match": "75%"
      }
    }
  }
}

5、用bool組合多個(gè)搜索條件，來(lái)搜索title

GET /forum/article/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "java" }},
      "must_not": { "match": { "title": "spark"  }},
      "should": [
                  { "match": { "title": "hadoop" }},
                  { "match": { "title": "elasticsearch"   }}
      ]
    }
  }
}

6、bool組合多個(gè)搜索條件，如何計(jì)算relevance score

must和should搜索對(duì)應(yīng)的分?jǐn)?shù)，加起來(lái)，除以must和should的總數(shù)

排名第一：java，同時(shí)包含should中所有的關(guān)鍵字，hadoop，elasticsearch
排名第二：java，同時(shí)包含should中的elasticsearch
排名第三：java，不包含should中的任何關(guān)鍵字

should是可以影響相關(guān)度分?jǐn)?shù)的

must是確保說(shuō)，誰(shuí)必須有這個(gè)關(guān)鍵字，同時(shí)會(huì)根據(jù)這個(gè)must的條件去計(jì)算出document對(duì)這個(gè)搜索條件的relevance score
在滿足must的基礎(chǔ)之上，should中的條件，不匹配也可以，但是如果匹配的更多，那么document的relevance score就會(huì)更高

7、搜索java，hadoop，spark，elasticsearch，至少包含其中3個(gè)關(guān)鍵字

默認(rèn)情況下，should是可以不匹配任何一個(gè)的，比如上面的搜索中，this is java blog，就不匹配任何一個(gè)should條件
但是有個(gè)例外的情況，如果沒(méi)有must的話，那么should中必須至少匹配一個(gè)才可以
比如下面的搜索，should中有4個(gè)條件，默認(rèn)情況下，只要滿足其中一個(gè)條件，就可以匹配作為結(jié)果返回

但是可以精準(zhǔn)控制，should的4個(gè)條件中，至少匹配幾個(gè)才能作為結(jié)果返回

GET /forum/article/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "java" }},
        { "match": { "title": "elasticsearch"   }},
        { "match": { "title": "hadoop"   }},
    { "match": { "title": "spark"   }}
      ],
      "minimum_should_match": 3 
    }
  }
}

總結(jié):
1、全文檢索的時(shí)候，進(jìn)行多個(gè)值的檢索，有兩種做法，match query；should
2、控制搜索結(jié)果精準(zhǔn)度：and operator，minimum_should_match

基于term+bool實(shí)現(xiàn)多條件搜索底層原理

1、普通match如何轉(zhuǎn)換為term+should

{
    "match": { "title": "java elasticsearch"}
}

使用諸如上面的match query進(jìn)行多值搜索的時(shí)候，es會(huì)在底層自動(dòng)將這個(gè)match query轉(zhuǎn)換為bool的語(yǔ)法
bool should，指定多個(gè)搜索詞，同時(shí)使用term query

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

2、and match如何轉(zhuǎn)換為term+must

{
    "match": {
        "title": {
            "query":    "java elasticsearch",
            "operator": "and"
        }
    }
}

{
  "bool": {
    "must": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }}
    ]
  }
}

3、minimum_should_match如何轉(zhuǎn)換

{
    "match": {
        "title": {
            "query":                "java elasticsearch hadoop spark",
            "minimum_should_match": "75%"
        }
    }
}

{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"   }},
      { "term": { "title": "hadoop" }},
      { "term": { "title": "spark" }}
    ],
    "minimum_should_match": 3 
  }
}

基于boost的細(xì)粒度搜索條件權(quán)重控制

需求：搜索標(biāo)題中包含java的帖子，同時(shí)呢，如果標(biāo)題中包含hadoop或elasticsearch就優(yōu)先搜索出來(lái)，同時(shí)呢，如果一個(gè)帖子包含java hadoop，一個(gè)帖子包含java elasticsearch，包含hadoop的帖子要比elasticsearch優(yōu)先搜索出來(lái)

知識(shí)點(diǎn)，搜索條件的權(quán)重，boost，可以將某個(gè)搜索條件的權(quán)重加大，此時(shí)當(dāng)匹配這個(gè)搜索條件和匹配另一個(gè)搜索條件的document，計(jì)算relevance score時(shí)，匹配權(quán)重更大的搜索條件的document，relevance score會(huì)更高，當(dāng)然也就會(huì)優(yōu)先被返回回來(lái)

默認(rèn)情況下，搜索條件的權(quán)重都是一樣的，都是1

GET /forum/article/_search 
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "blog"
          }
        }
      ],
      "should": [
        {
          "match": {
            "title": {
              "query": "java"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "hadoop"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "elasticsearch"
            }
          }
        },
        {
          "match": {
            "title": {
              "query": "spark",
              "boost": 5
            }
          }
        }
      ]
    }
  }
}

為什么多shard場(chǎng)景下relevance score不準(zhǔn)確？

image.png

2、如何解決該問(wèn)題？

（1）生產(chǎn)環(huán)境下，數(shù)據(jù)量大，盡可能實(shí)現(xiàn)均勻分配

數(shù)據(jù)量很大的話，其實(shí)一般情況下，在概率學(xué)的背景下，es都是在多個(gè)shard中均勻路由數(shù)據(jù)的，路由的時(shí)候根據(jù)_id，負(fù)載均衡
比如說(shuō)有10個(gè)document，title都包含java，一共有5個(gè)shard，那么在概率學(xué)的背景下，如果負(fù)載均衡的話，其實(shí)每個(gè)shard都應(yīng)該有2個(gè)doc，title包含java
如果說(shuō)數(shù)據(jù)分布均勻的話，其實(shí)就沒(méi)有剛才說(shuō)的那個(gè)問(wèn)題了

（2）測(cè)試環(huán)境下，將索引的primary shard設(shè)置為1個(gè)，number_of_shards=1，index settings

如果說(shuō)只有一個(gè)shard，那么當(dāng)然，所有的document都在這個(gè)shard里面，就沒(méi)有這個(gè)問(wèn)題了

（3）測(cè)試環(huán)境下，搜索附帶search_type=dfs_query_then_fetch參數(shù)，會(huì)將local IDF取出來(lái)計(jì)算global IDF

計(jì)算一個(gè)doc的相關(guān)度分?jǐn)?shù)的時(shí)候，就會(huì)將所有shard對(duì)的local IDF計(jì)算一下，獲取出來(lái)，在本地進(jìn)行g(shù)lobal IDF分?jǐn)?shù)的計(jì)算，會(huì)將所有shard的doc作為上下文來(lái)進(jìn)行計(jì)算，也能確保準(zhǔn)確性。但是production生產(chǎn)環(huán)境下，不推薦這個(gè)參數(shù)，因?yàn)樾阅芎懿睢?/p>

基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索

1、為帖子數(shù)據(jù)增加content字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"content" : "i like to write best elasticsearch article"} }
{ "update": { "_id": "2"} }
{ "doc" : {"content" : "i think java is the best programming language"} }
{ "update": { "_id": "3"} }
{ "doc" : {"content" : "i am only an elasticsearch beginner"} }
{ "update": { "_id": "4"} }
{ "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} }
{ "update": { "_id": "5"} }
{ "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }

2、搜索title或content中包含java或solution的帖子

下面這個(gè)就是multi-field搜索，多字段搜索

GET /forum/article/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

3、結(jié)果分析
期望的是doc5，結(jié)果是doc2,doc4排在了前面

計(jì)算每個(gè)document的relevance score：每個(gè)query的分?jǐn)?shù)，乘以matched query數(shù)量，除以總query數(shù)量

算一下doc4的分?jǐn)?shù)

{ "match": { "title": "java solution" }}，針對(duì)doc4，是有一個(gè)分?jǐn)?shù)的
{ "match": { "content": "java solution" }}，針對(duì)doc4，也是有一個(gè)分?jǐn)?shù)的

所以是兩個(gè)分?jǐn)?shù)加起來(lái)，比如說(shuō)，1.1 + 1.2 = 2.3
matched query數(shù)量 = 2
總query數(shù)量 = 2

2.3 * 2 / 2 = 2.3

算一下doc5的分?jǐn)?shù)

{ "match": { "title": "java solution" }}，針對(duì)doc5，是沒(méi)有分?jǐn)?shù)的
{ "match": { "content": "java solution" }}，針對(duì)doc5，是有一個(gè)分?jǐn)?shù)的

所以說(shuō)，只有一個(gè)query是有分?jǐn)?shù)的，比如2.3
matched query數(shù)量 = 1
總query數(shù)量 = 2

2.3 * 1 / 2 = 1.15

doc5的分?jǐn)?shù) = 1.15 < doc4的分?jǐn)?shù) = 2.3

4、best fields策略，dis_max
best fields策略，就是說(shuō)，搜索到的結(jié)果，應(yīng)該是某一個(gè)field中匹配到了盡可能多的關(guān)鍵詞，被排在前面；而不是盡可能多的field匹配到了少數(shù)的關(guān)鍵詞，排在了前面

dis_max語(yǔ)法，直接取多個(gè)query中，也就是說(shuō) 分?jǐn)?shù)最高的那一個(gè)query的分?jǐn)?shù)即可

{ "match": { "title": "java solution" }}，針對(duì)doc4，是有一個(gè)分?jǐn)?shù)的，1.1
{ "match": { "content": "java solution" }}，針對(duì)doc4，也是有一個(gè)分?jǐn)?shù)的，1.2
取最大分?jǐn)?shù)，1.2

{ "match": { "title": "java solution" }}，針對(duì)doc5，是沒(méi)有分?jǐn)?shù)的
{ "match": { "content": "java solution" }}，針對(duì)doc5，是有一個(gè)分?jǐn)?shù)的，2.3
取最大分?jǐn)?shù)，2.3

然后doc4的分?jǐn)?shù) = 1.2 < doc5的分?jǐn)?shù) = 2.3，所以doc5就可以排在更前面的地方，符合我們的需要

GET /forum/article/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java solution" }},
                { "match": { "content":  "java solution" }}
            ]
        }
    }
}

tie_breaker 優(yōu)化dis_max

GET /forum/article/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java beginner" }},
                { "match": { "body":  "java beginner" }}
            ]
        }
    }
}

可能在實(shí)際場(chǎng)景中出現(xiàn)的一個(gè)情況是這樣的：
（1）某個(gè)帖子，doc1，title中包含java，content不包含java beginner任何一個(gè)關(guān)鍵詞
（2）某個(gè)帖子，doc2，content中包含beginner，title中不包含任何一個(gè)關(guān)鍵詞
（3）某個(gè)帖子，doc3，title中包含java，content中包含beginner
（4）最終搜索，可能出來(lái)的結(jié)果是，doc1和doc2排在doc3的前面，而不是我們期望的doc3排在最前面

dis_max，只是取分?jǐn)?shù)最高的那個(gè)query的分?jǐn)?shù)而已。dis_max只取某一個(gè)query最大的分?jǐn)?shù)，完全不考慮其他query的分?jǐn)?shù)

解決方法：使用tie_breaker將其他query的分?jǐn)?shù)也考慮進(jìn)去

tie_breaker參數(shù)的意義，在于說(shuō)，將其他query的分?jǐn)?shù)，乘以tie_breaker，然后綜合與最高分?jǐn)?shù)的那個(gè)query的分?jǐn)?shù)，綜合在一起進(jìn)行計(jì)算
除了取最高分以外，還會(huì)考慮其他的query的分?jǐn)?shù)
tie_breaker的值，在0~1之間，是個(gè)小數(shù)，就ok

GET /forum/article/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "java beginner" }},
                { "match": { "body":  "java beginner" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

multi_match語(yǔ)法

GET /forum/article/_search
{
  "query": {
    "multi_match": {
        "query":                "java solution",
        "type":                 "best_fields", //默認(rèn)就是best_fields
        "fields":               [ "title^2", "content" ], //這里是設(shè)置權(quán)重為2
        "tie_breaker":          0.3,
        "minimum_should_match": "50%" 
    }
  } 
}

從best-fields換成most-fields策略
best-fields策略，主要是說(shuō)將某一個(gè)field匹配盡可能多的關(guān)鍵詞的doc優(yōu)先返回回來(lái)
most-fields策略，主要是說(shuō)盡可能返回更多field匹配到某個(gè)關(guān)鍵詞的doc，優(yōu)先返回回來(lái)

most_fields策略進(jìn)行cross-fields search及其弊端

cross-fields搜索，一個(gè)唯一標(biāo)識(shí)，跨了多個(gè)field。比如一個(gè)人，標(biāo)識(shí)，是姓名；一個(gè)建筑，它的標(biāo)識(shí)是地址。姓名可以散落在多個(gè)field中，比如first_name和last_name中，地址可以散落在country，province，city中。
跨多個(gè)field搜索一個(gè)標(biāo)識(shí)，比如搜索一個(gè)人名，或者一個(gè)地址，就是cross-fields搜索

初步來(lái)說(shuō)，如果要實(shí)現(xiàn)，可能用most_fields比較合適。因?yàn)閎est_fields是優(yōu)先搜索單個(gè)field最匹配的結(jié)果，cross-fields本身就不是一個(gè)field的問(wèn)題了。

加入數(shù)據(jù) 并搜索

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }

GET /forum/article/_search
{
  "query": {
    "multi_match": {
      "query":       "Peter Smith",
      "type":        "most_fields",
      "fields":      [ "author_first_name", "author_last_name" ]
    }
  }
}
// 返回結(jié)果

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog",
          "content": "i think java is the best programming language",
          "sub_title": "learned a lot of course",
          "author_first_name": "Smith",
          "author_last_name": "Williams"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog",
          "content": "i like to write best elasticsearch article",
          "sub_title": "learning more courses",
          "author_first_name": "Peter",
          "author_last_name": "Smith"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.51623213,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith"
        }
      }
    ]
  }
}

Peter Smith，匹配author_first_name，匹配到了Smith，這時(shí)候它的分?jǐn)?shù)很高，為什么啊？？？
因?yàn)镮DF分?jǐn)?shù)高，IDF分?jǐn)?shù)要高，那么這個(gè)匹配到的term（Smith），在所有doc中的出現(xiàn)頻率要低，author_first_name field中，Smith就出現(xiàn)過(guò)1次
Peter Smith這個(gè)人，doc 1，Smith在author_last_name中，但是author_last_name出現(xiàn)了兩次Smith，所以導(dǎo)致doc 1的IDF分?jǐn)?shù)較低

不要有過(guò)多的疑問(wèn)，一定是這樣嗎？說(shuō)不清楚這個(gè)搜索算法實(shí)在太過(guò)于復(fù)雜。

總結(jié) most_fields 進(jìn)行跨字段搜索的弊端
問(wèn)題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc
問(wèn)題2：most_fields，沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果
問(wèn)題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時(shí)候，由于first_name中很少有Smith的，所以query在所有document中的頻率很低，
得到的分?jǐn)?shù)很高，可能Smith Williams反而會(huì)排在Peter Smith前面

使用用copy_to，將多個(gè)field組合成一個(gè)field 解決三個(gè)弊端

PUT /forum/_mapping/article
{
  "properties": {
      "new_author_first_name": {
          "type":     "string",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_last_name": {
          "type":     "string",
          "copy_to":  "new_author_full_name" 
      },
      "new_author_full_name": {
          "type":     "string"
      }
  }
}

用了這個(gè)copy_to語(yǔ)法之后，就可以將多個(gè)字段的值拷貝到一個(gè)字段中，并建立倒排索引

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} }       --> Peter Smith
{ "update": { "_id": "2"} } 
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} }        --> Smith Williams
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} }           --> Jack Ma
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} }         --> Robbin Li
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} }     --> Tonny Peter Smith

GET /forum/article/_search
{
  "query": {
    "match": {
      "new_author_full_name":       "Peter Smith"
    }
  }
}

結(jié)果雖然沒(méi)有復(fù)現(xiàn)場(chǎng)景但是原理是通的

總結(jié)
問(wèn)題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc --> 解決，最匹配的document被最先返回
問(wèn)題2：most_fields，沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果 --> 解決，可以使用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)
問(wèn)題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時(shí)候，由于first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分?jǐn)?shù)很高，
可能Smith Williams反而會(huì)排在Peter Smith前面 --> 解決，Smith和Peter在一個(gè)field了，所以在所有document中出現(xiàn)的次數(shù)是均勻的，不會(huì)有極端的偏差

使用原生cross-fiels技術(shù)解決搜索弊端

GET /forum/article/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "type": "cross_fields", 
      "operator": "and",
      "fields": ["author_first_name", "author_last_name"]
    }
  }
}

要求Peter必須在author_first_name或author_last_name中出現(xiàn)
要求Smith必須在author_first_name或author_last_name中出現(xiàn)

問(wèn)題1：只是找到盡可能多的field匹配的doc，而不是某個(gè)field完全匹配的doc --> 解決，要求每個(gè)term都必須在任何一個(gè)field中出現(xiàn)

Peter Smith可能是橫跨在多個(gè)field中的，所以必須要求每個(gè)term都在某個(gè)field中出現(xiàn)，組合起來(lái)才能組成我們想要的標(biāo)識(shí)，完整的人名

原來(lái)most_fiels，可能像Smith Williams也可能會(huì)出現(xiàn)，因?yàn)閙ost_fields要求只是任何一個(gè)field匹配了就可以，匹配的field越多，分?jǐn)?shù)越高

問(wèn)題2：most_fields，沒(méi)辦法用minimum_should_match去掉長(zhǎng)尾數(shù)據(jù)，就是匹配的特別少的結(jié)果 --> 解決，既然每個(gè)term都要求出現(xiàn)，長(zhǎng)尾肯定被去除掉了

java hadoop spark --> 這3個(gè)term都必須在任何一個(gè)field出現(xiàn)了

比如有的document，只有一個(gè)field中包含一個(gè)java，那就被干掉了，作為長(zhǎng)尾就沒(méi)了

問(wèn)題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時(shí)候，由于first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分?jǐn)?shù)很高，可能Smith Williams反而會(huì)排在Peter Smith前面 --> 計(jì)算IDF的時(shí)候，將每個(gè)query在每個(gè)field中的IDF都取出來(lái)，取最小值，就不會(huì)出現(xiàn)極端情況下的極大值了

Smith，在author_first_name這個(gè)field中，在所有doc的這個(gè)Field中，出現(xiàn)的頻率很低，導(dǎo)致IDF分?jǐn)?shù)很高；Smith在所有doc的author_last_name field中的頻率算出一個(gè)IDF分?jǐn)?shù)，因?yàn)橐话銇?lái)說(shuō)last_name中的Smith頻率都較高，所以IDF分?jǐn)?shù)是正常的，不會(huì)太高；然后對(duì)于Smith來(lái)說(shuō)，會(huì)取兩個(gè)IDF分?jǐn)?shù)中，較小的那個(gè)分?jǐn)?shù)。就不會(huì)出現(xiàn)IDF分過(guò)高的情況。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

es使用與原理3 -- 復(fù)雜一點(diǎn)的API

es使用與原理3 -- 復(fù)雜一點(diǎn)的API

使用term filter來(lái)搜索數(shù)據(jù)

搜索 filter 執(zhí)行原理剖析（bitset機(jī)制與caching機(jī)制）

基于bool組合多個(gè)filter條件來(lái)搜索數(shù)據(jù)

使用terms搜索多個(gè)值(類似于 sql中的in)以及多值搜索結(jié)果優(yōu)化

基于range filter來(lái)進(jìn)行范圍過(guò)濾

手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度

基于term+bool實(shí)現(xiàn)多條件搜索底層原理

基于boost的細(xì)粒度搜索條件權(quán)重控制

為什么多shard場(chǎng)景下relevance score不準(zhǔn)確？

基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索

tie_breaker 優(yōu)化dis_max

most_fields策略進(jìn)行cross-fields search及其弊端

使用原生cross-fiels技術(shù)解決搜索弊端

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

es使用與原理3 -- 復(fù)雜一點(diǎn)的API

使用term filter來(lái)搜索數(shù)據(jù)

搜索 filter 執(zhí)行原理剖析（bitset機(jī)制與caching機(jī)制）

基于bool組合多個(gè)filter條件來(lái)搜索數(shù)據(jù)

使用terms搜索多個(gè)值(類似于 sql中的in)以及多值搜索結(jié)果優(yōu)化

基于range filter來(lái)進(jìn)行范圍過(guò)濾

手動(dòng)控制全文檢索結(jié)果的精準(zhǔn)度

基于term+bool實(shí)現(xiàn)多條件搜索底層原理

基于boost的細(xì)粒度搜索條件權(quán)重控制

為什么多shard場(chǎng)景下relevance score不準(zhǔn)確？

基于dis_max實(shí)現(xiàn)best fields策略進(jìn)行多字段搜索

tie_breaker 優(yōu)化dis_max

most_fields策略進(jìn)行cross-fields search及其弊端

使用原生cross-fiels技術(shù)解決搜索弊端

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频