Elasticsearch 7.x 深入【1】索引【三】 field datatype

1. 借鑒

官方 mapping-types
percolator 查詢
官網 percolator 博客
IEEE 754精度
IEEE 754標準
簡書:rank feature
elasticsearch 7.0 新特性之 search as you type
The new elasticsearch datatype, search_as_you_type
自然語言處理NLP中的N-gram模型
自然語言處理中的N-Gram模型詳解
ElasticSearch一看就懂之分詞器edge_ngram和ngram的區別
Elasticsearch - edgeNGram自動補全
極客時間 阮一鳴老師的Elasticsearch核心技術與實戰

2. 開始

主要是按照官方文檔進行的一系列操作,順便寫一些自己在測試過程中的心得,例子在官網上都有,看官移步官方文檔即可

alias

別名映射為索引中的字段定義另一個名稱。別名可用于替代搜索請求中的目標字段,以及選擇其他api(如字段功能)。

# 創建索引
PUT /trips
{
  "mappings": {
    "properties": {
      "distance": {
        "type": "long"
      },
      "route_length_miles": {
        "type": "alias",
        "path": "distance"
      },
      "transit_mode": {
        "type": "keyword"
      }
    }
  } 
}

# 索引文檔
PUT /trips/_doc/1
{
  "distance": 1,
  "transit_mode": "walk"
}


PUT /trips/_doc/2
{
  "distance": 40,
  "transit_mode": "bike"
}

PUT /trips/_doc/3
{
  "distance": 120,
  "transit_mode": "train"
}

# 查詢
GET /trips/_search
{
  "query": {
    "range": {
      "route_length_miles": {
        "gte": 39
      }
    }
  }
}
# 查詢結果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "trips",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "distance" : 40,
          "transit_mode" : "bike"
        }
      },
      {
        "_index" : "trips",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "distance" : 120,
          "transit_mode" : "train"
        }
      }
    ]
  }
}

使用alias有以下限制

  1. 目標必須是一個具體的字段,而不是一個對象或另一個字段別名。
  2. 目標字段必須在創建別名時存在。
  3. 如果定義了嵌套對象,則字段別名必須具有與其目標相同的嵌套范圍。
  4. 字段別名只能有一個目標。
  5. 不支持寫入字段別名。
  6. 嘗試在索引或更新請求中使用別名將導致失敗。
  7. 別名不能用作copy_to的目標或多字段。

flattened

這種數據類型對于索引具有大量或未知數量的惟一鍵的對象非常有用。只為整個JSON對象創建一個字段映射,這有助于防止映射爆炸,避免有太多不同的字段映射。

PUT bug_reports
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "labels": {
        "type": "flattened"
      }
    }
  }
}

POST bug_reports/_doc/1
{
  "title": "Results are not sorted correctly.",
  "labels": {
    "priority": "urgent",
    "release": ["v1.2.5", "v1.3.0"],
    "timestamp": {
      "created": 1541458026,
      "closed": 1541457010
    }
  }
}
  • 對應注意點3
GET bug_reports/_search
{
  "query": {
    "term": {"labels": "urgent"}
  }
}
  • 對應注意點2
GET bug_reports/_search
{
  "query": {
    "term": {"labels.release": "v1.3.0"}
  }
}

使用flattened有以下限制和注意點

  1. 在索引期間,將為JSON對象中的每個葉值創建標記。這些值被索引為字符串關鍵字,不會對數字或日期進行分析或特殊處理。
  2. 可以使用對象點語法[如:person.name]查詢
  3. 查詢頂級水平字段將搜索對象中的所有葉值
  4. 不支持高亮
  5. 可以使用以下查詢term, terms, terms_set, prefix, range,match, multi_match, query_string, simple_query_string, exists

join

用于創建父子文檔

[以下總結和例子參看阮一鳴老師的git]
它有以下特性和限制:

  • 父文檔和子文檔是兩個獨立的文檔
  • 更新父文檔無需更新子文檔。子文檔被添加,刪除或者修改不影響父文檔和其他子文檔。
  • 父子文檔必須在相同的分片上
  • 指定子文檔時,必須指定它的父文檔的ID

舉個栗子

# 如果已經存在則刪除
DELETE /blogs
# 創建索引
PUT /blogs
{
  "settings": {
    "number_of_shards": 2
  }, 
  "mappings": {
    "properties": {
      "blog_comments_relation": {
        "type": "join",
        "relations": {
          "blog": "comment"
        }
      },
      "content": {
        "type": "text"
      },
      "title": {
        "type": "keyword"
      },
      "creator": {
        "type": "keyword"
      },
      "commentator": {
        "type": "keyword"
      }
    }
  }
}
  • 我們一步步來分析

如果有人會問:如何區分父子文檔包含的屬性呢?比如說blog有creator,comment有commentator。因為我當時看的時候就有這種疑問。答案是父子文檔的屬性都包含在properties中,哪個文檔需要,哪個文檔添加這個屬性即可。

先來看一下各個屬性的含義

image.png

我們來看下如何索引父文檔

PUT /blogs/_doc/blog1
{
  "title": "測試1",
  "content": "這是第一篇測試博客",
  "creator": "測試1",
  "blog_comments_relation": {
    "name": "blog"
  }
}

PUT /blogs/_doc/blog2
{
  "title": "測試2",
  "content": "這是第二篇測試博客",
  "creator": "測試2",
  "blog_comments_relation": {
    "name": "blog"
  }
}
image.png

那如何索引子文檔呢?

PUT /blogs/_doc/comment1?routing=blog1
{
  "title": "測試1的評論-1",
  "content": "我是測試1的評論-1",
  "commentator": "測試1的評論-1",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}


PUT /blogs/_doc/comment2?routing=blog1
{
  "title": "測試1的評論-2",
  "content": "我是測試1的評論-2",
  "commentator": "測試1的評論-2",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}

PUT /blogs/_doc/comment3?routing=blog1
{
  "title": "測試1的評論-3",
  "content": "我是測試1的評論-3",
  "commentator": "測試1的評論-3",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}
image.png
  • 父子文檔必須在相同的分片上,確保查詢join性能
  • 指定子文檔時,必須指定它的父文檔的ID,使用routing參數,分配到相同的分片

查詢所有文檔

GET /blogs/_search

根據父文檔ID查詢父文檔

GET /blogs/_doc/blog1

根據父文檔ID和子文檔ID查詢子文檔

GET /blogs/_doc/comment1?routing=blog1

通過父文檔查詢其包含子文檔的信息

GET /blogs/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "blog1"
    }
  }
}

通過子文檔返回父文檔

GET /blogs/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "term": {
          "title": {
            "value": "測試1的評論-3"
          }
        }
      }
    }
  }
}

通過父文檔查詢子文檔

GET /blogs/_search
{
  "query": {
    "has_parent": {
      "parent_type": "blog",
      "query": {
        "term": {
          "title": "測試1"
        }
      }
    }
  }
}

更新子文檔

PUT /blogs/_doc/comment1?routing=blog1
{
  "title": "測試1的評論-1-update",
  "content": "我是測試1的評論-1-update",
  "commentator": "測試1的評論-1-update",
  "blog_comments_relation": {
    "name": "comment",
    "parent": "blog1"
  }
}

nested

嵌套對象

  • 允許對象數組中的對象被獨立索引[如果是普通的對象不需要,因為可以使用.語法來進行檢索]

我們為什么要用nested類型呢?舉個例子

# 創建索引
PUT /my_movie_sample
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

# 索引文檔
PUT /my_movie_sample/_doc/1
{
  "title": "測試電影1",
  "actors": [
    {
      "first_name": "caiser",
      "last_name": "hot"
    },
    {
      "first_name": "ga",
      "last_name": "el"
    }
    ]
}
  • 搜索一下
GET /my_movie_sample/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "actors.first_name": "caiser"
          }
        },
        {
          "match": {
            "actors.last_name": "el"
          }
        }
      ]
    }
  }
}
  • 搜索結果
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.723315,
    "hits" : [
      {
        "_index" : "my_movie_sample",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.723315,
        "_source" : {
          "title" : "測試電影1",
          "actors" : [
            {
              "first_name" : "caiser",
              "last_name" : "hot"
            },
            {
              "first_name" : "ga",
              "last_name" : "el"
            }
          ]
        }
      }
    ]
  }
}

其實這就很奇怪了,我們的文檔中只有兩個演員[hot caiser]和[el ga],只有這兩個人,但是我搜索[el caiser]竟然出來了,是不是很不合理。

如何使用nested類型呢?舉個栗子

# 創建索引
PUT /my_movie
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "actors": {
        "type": "nested",
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

# 索引文檔
PUT /my_movie/_doc/1
{
  "title": "測試電影1",
  "actors": [
    {
      "first_name": "caiser",
      "last_name": "hot"
    },
    {
      "first_name": "ga",
      "last_name": "el"
    }
    ]
}

先來測試一下普通的.語法搜索

GET /my_movie/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "actors.first_name": "caiser"
          }
        },
        {
          "match": {
            "actors.last_name": "hot"
          }
        }
      ]
    }
  }
}
  • 嗯,啥也沒有,這是對的,nested類型,使用.語法是無法搜索出來的

nested類型,對應的搜索是nested搜索,例子如下:

GET /my_movie/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "actors",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "actors.first_name": "caiser"
                    }
                  },
                  {
                    "match": {
                      "actors.last_name": "hot"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}
  • 搜索結果
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.3862944,
    "hits" : [
      {
        "_index" : "my_movie",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.3862944,
        "_source" : {
          "title" : "測試電影1",
          "actors" : [
            {
              "first_name" : "caiser",
              "last_name" : "hot"
            },
            {
              "first_name" : "ga",
              "last_name" : "el"
            }
          ]
        }
      }
    ]
  }
}

如果我使用[el caiser]這個來搜索呢?

# 搜索語句
GET /my_movie/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "actors.first_name": "caiser"
          }
        },
        {
          "match": {
            "actors.last_name": "el"
          }
        }
      ]
    }
  }
}
  • 結果
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
  • 這個就沒有被搜索出來,因為沒有叫[el caiser]的演員

如果聚合呢?寫法相似

# 我們按照first_name聚合
GET /my_movie/_search
{
  "size": 0, 
  "aggs": {
    "actors": {
      "nested": {
        "path": "actors"
      },
      "aggs": {
        "first_name_term": {
          "terms": {
            "field": "actors.first_name",
            "size": 10
          }
        }
      }
    }
  }
}
  • 結果如下:
GET /my_movie/_search
{
  "size": 0, 
  "aggs": {
    "actors": {
      "nested": {
        "path": "actors"
      },
      "aggs": {
        "first_name_term": {
          "terms": {
            "field": "actors.first_name",
            "size": 10
          }
        }
      }
    }
  }
}

比較一下:nested和join

nested join
優點 文檔存儲在一起,讀取性能高 父子文檔可以獨立更新
缺點 更新嵌套文檔時,需要更新整個文檔 需要額外的內存維護關系,讀取性能相對較差
適用場景 子文檔偶爾更新,以查詢為主 子文檔更新頻繁

percolator

這個一開始我也沒搞清楚,看了幾篇博客,有一些幫助理解的粘在最上邊了,最后合理的中文解釋如下:

percolator允許您根據索引注冊查詢,然后發送包含文檔的percolate請求,并從已注冊的查詢集中獲取與該文檔匹配的查詢。
將它看作是elasticsearch本質上所做的反向操作:不是發送文檔、索引它們,然后運行查詢,而是發送查詢、注冊它們,然后發送文檔并找出哪些查詢與該文檔匹配。

用我的話來說就是:一般的查詢都是通過條件篩選匹配的文檔,而percolator查詢是根據文檔篩選匹配的查詢,就是輸入和輸出不同(普通查詢:輸入(查詢條件),輸出(匹配的文檔);percolator查詢:輸入(文檔),輸出(匹配的查詢))

接下來,我們以用戶訂閱了特定主題,當新文章出現時,找出感興趣的用戶為例

# 1. 創建主題訂閱索引
PUT topic_subscription
{
  "mappings": {
    "properties": {
      "query": {
          "type": "percolator"
      },
      "topic": { // 主題
          "type": "text"
      },
      "userId": { // 用戶ID
        "type": "keyword"
      }
    }
  }
}

# 2. 添加用戶訂閱
#(id為1,2,3的用戶訂閱了“新聞”主題)
PUT /topic_subscription/_doc/1
{
  "userId": [1, 2, 3],
  "query": {
    "match": {
      "topic": "新聞"
    }
  }
}
#(id為1,2的用戶訂閱了“軍事”主題)
PUT /topic_subscription/_doc/2
{
  "userId": [1, 2],
  "query": {
    "match": {
      "topic": "軍事"
    }
  }
}
#(id為2的用戶訂閱了“計算機”主題)
PUT /topic_subscription/_doc/3
{
  "userId": [2],
  "query": {
    "match": {
      "topic": "計算機"
    }
  }
}

# 3.當來了幾篇新文章,查看有哪些用戶分別訂閱了哪些主題
GET /topic_subscription/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "documents": [
        {
          "topicId": 10001,
          "topic": "這是一篇有關軍事的文章"
        },
        {
          "topicId": 10002,
          "topic": "這是一篇有關新聞的文章"
        },
        {
          "topicId": 10003,
          "topic": "計算機是一個好東西-_-"
        }]
    }
  }
}
  • 查詢結果
  • 其中"_percolator_document_slot” 指代的是:輸入文檔的position,從0開始計數。
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.4120297,
    "hits" : [
      {
        "_index" : "topic_subscription",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.4120297,
        "_source" : {
          "userId" : [
            2
          ],
          "query" : {
            "match" : {
              "topic" : "計算機"
            }
          }
        },
        "fields" : {
          "_percolator_document_slot" : [
            2
          ]
        }
      },
      {
        "_index" : "topic_subscription",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.8687345,
        "_source" : {
          "query" : {
            "match" : {
              "topic" : "新聞"
            }
          },
          "userId" : [
            1,
            2,
            3,
            4,
            5
          ]
        },
        "fields" : {
          "_percolator_document_slot" : [
            1
          ]
        }
      },
      {
        "_index" : "topic_subscription",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8687345,
        "_source" : {
          "userId" : [
            1,
            2
          ],
          "query" : {
            "match" : {
              "topic" : "軍事"
            }
          }
        },
        "fields" : {
          "_percolator_document_slot" : [
            0
          ]
        }
      }
    ]
  }
}

range

好吧,說實話,之前以為只有在查詢里面有這個,沒有想到它還是種類型

  • range有以下類型(大家可以看官方文檔的)
類型 描述
integer_range 在 -2^{31}2^{31} - 1之間
float_range IEEE 754 單精度
long_range 在 -2^{63}2^{63} - 1之間
double_range IEEE 754 雙精度
date_range 無符號64位整數
ip_range 支持IPv4或IPv6(或IPv4和IPv6混合)地址的一組ip值。
# 創建索引,包含2個range類型
PUT range_index
{
  "mappings": {
    "properties": {
      "expected_attendees": {
        "type": "integer_range"
      },
      "time_frame": {
        "type": "date_range", 
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

# 添加一篇文檔
PUT range_index/_doc/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2020-04-06 23:00:00", 
    "lte" : "2020-04-07"
  }
}

# 整形范圍查詢
GET /range_index/_search
{
  "query": {
    "term": {
      "expected_attendees": {
        "value": 12
      }
    }
  }
}

# 日期范圍查詢
GET /range_index/_search
{
  "query": {
    "range": {
      "time_frame": {
        "gte": "2020-04-07",
        "lte": "2020-04-07 12:00:00"
      }
    }
  }
}
  • 同時在查詢是可以指定relation,它有如下三種類型
類型 描述 是否默認
WITHIN 必須包含原文檔
CONTAINS 必須包含在原文檔范圍內
INTERSECTS 有交集即可匹配
  • 我們來個圖


    within
contains
intersects

rank_feature

翻譯過來是排序功能。
我們來看下它的使用場景和限制[以下均來自官網,翻譯而來]

  • rank_feature字段只支持單值字段和嚴格正的值。多值字段和負值將被拒絕。
  • rank_feature字段不支持查詢、排序或聚合。它們只能在rank_feature查詢中使用。
  • rank_feature字段僅保留9位重要的精度,轉換成大約0.4%的相對誤差。

直接上例子【官網自取】

# 創建一個索引,包含兩個rank_feature類型
PUT my_index_rank
{
  "mappings": {
    "properties": {
      "page_rank": {
        "type": "rank_feature"
      },
      "url_length": {
        "type": "rank_feature",
        "positive_score_impact": false
      },
      "title": {
        "type": "text"
      }
    }
  }
}

與分數負相關的排序特性應該將positive_score_impact設置為false(默認為true)。rank_feature查詢將使用它來修改評分公式,使評分隨特性值的增加而減少,而不是增加。例如在網絡搜索中,url長度是一個常用的與分數負相關的特征。

# 添加一篇文檔
PUT my_index_rank/_doc/1
{
  "page_rank": 12,
  "url_length": 20,
  "title": "I am proud to be a Chinese"
}

# 好的來搜索一下
GET /my_index_rank/_search
{
  "query": {
    "rank_feature": {
      "field": "page_rank"
    }
  }
}
  • 結果:
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
我在哪,我是誰,我在干啥?
  • 額,難道是文檔太少???不能夠啊,再來幾個文檔
PUT my_index_rank/_doc/2
{
  "page_rank": 12,
  "url_length": 20,
  "title": "News of the outbreak"
}

PUT my_index_rank/_doc/3
{
  "page_rank": 10,
  "url_length": 50,
  "title": "The Chinese people have overcome the epidemic"
}

POST /my_index_rank/_refresh
  • 再來查一遍試試
GET /my_index_rank/_search
{
  "query": {
    "rank_feature": {
      "field": "page_rank"
    }
  }
}
  • 結果
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}
  • e,這。。。
  • 所以趕緊上簡書搜一下rank feature到底干啥的,以下摘自借鑒中的文章,因為文章的作者在回復部分解答了疑問,所以在這里說明下,再次感謝。

可用在標簽或分類加權

  • 這個回答有點模糊。。。我們來個查詢的例子
GET my_index_rank/_search?explain=true
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "chinese"
          }
        }
      ], 
      "should": [
        {
          "rank_feature": {
            "field": "page_rank",
            "boost": 2
          }
        },
        {
          "rank_feature": {
            "field": "url_length",
            "boost": 0.1
          }
        }
      ]
    }
  }
}
  • 再來看下結果什么的[我把explain打開了,所以有點長]
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0127629,
    "hits" : [
      {
        "_shard" : "[my_index_rank][0]",
        "_node" : "7VYDXI3wSdSZLz6AIPQXEw",
        "_index" : "my_index_rank",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0127629,
        "_source" : {
          "page_rank" : 12,
          "url_length" : 20,
          "title" : "I am proud to be a Chinese"
        },
        "_explanation" : {
          "value" : 1.0127629,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.44000342,
              "description" : "weight(title:chinese in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.44000342,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.42553192,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.5147453,
              "description" : "Saturation function on the _feature field for the page_rank feature, computed as w * S / (S + k) from:",
              "details" : [
                {
                  "value" : 1.0,
                  "description" : "w, weight of this function",
                  "details" : [ ]
                },
                {
                  "value" : 11.3125,
                  "description" : "k, pivot feature value that would give a score contribution equal to w/2",
                  "details" : [ ]
                },
                {
                  "value" : 12.0,
                  "description" : "S, feature value",
                  "details" : [ ]
                }
              ]
            },
            {
              "value" : 0.058014184,
              "description" : "Saturation function on the _feature field for the url_length feature, computed as w * S / (S + k) from:",
              "details" : [
                {
                  "value" : 0.1,
                  "description" : "w, weight of this function",
                  "details" : [ ]
                },
                {
                  "value" : 0.036132812,
                  "description" : "k, pivot feature value that would give a score contribution equal to w/2",
                  "details" : [ ]
                },
                {
                  "value" : 0.049926758,
                  "description" : "S, feature value",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[my_index_rank][0]",
        "_node" : "7VYDXI3wSdSZLz6AIPQXEw",
        "_index" : "my_index_rank",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.9447938,
        "_source" : {
          "page_rank" : 10,
          "url_length" : 50,
          "title" : "The Chinese people have overcome the epidemic"
        },
        "_explanation" : {
          "value" : 0.9447938,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.44000342,
              "description" : "weight(title:chinese in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.44000342,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.42553192,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 7.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.46920824,
              "description" : "Saturation function on the _feature field for the page_rank feature, computed as w * S / (S + k) from:",
              "details" : [
                {
                  "value" : 1.0,
                  "description" : "w, weight of this function",
                  "details" : [ ]
                },
                {
                  "value" : 11.3125,
                  "description" : "k, pivot feature value that would give a score contribution equal to w/2",
                  "details" : [ ]
                },
                {
                  "value" : 10.0,
                  "description" : "S, feature value",
                  "details" : [ ]
                }
              ]
            },
            {
              "value" : 0.035582155,
              "description" : "Saturation function on the _feature field for the url_length feature, computed as w * S / (S + k) from:",
              "details" : [
                {
                  "value" : 0.1,
                  "description" : "w, weight of this function",
                  "details" : [ ]
                },
                {
                  "value" : 0.036132812,
                  "description" : "k, pivot feature value that would give a score contribution equal to w/2",
                  "details" : [ ]
                },
                {
                  "value" : 0.019958496,
                  "description" : "S, feature value",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}
  • 這太長了,沒關系,我截個圖,我們以其中一個為例,就更好看了。


    以查詢出的第一篇文檔為例

我們可以看到,這篇文檔的分數包含三部分:??。match匹配title為chinese的權重,??。page_rank的權重,??。url_length的權重;最后做一次加法。對于explain的詳細解釋,會在后面的文章中講到。// TODO

search-as-you-type

要理解這種類型,先明白N-gram模型,這個我在借鑒中有些,大家看一下。

token_count

我們來直譯一下官網的解釋:類型為token_count的字段實際上是一個整數字段,它接受字符串值,分析它們,然后為字符串中的令牌數量建立索引。

  • 在我看來就是統計一個句子的分詞數量,這取決于用的分詞器
# 創建索引
PUT /my_index_token_count_chinese_city
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "length": {
            "type": "token_count",
            "analyzer": "ik_smart"
          }
        }
      }
    }
  }
}

# 添加文檔
PUT /my_index_token_count_chinese_city/_doc/1
{
  "city":"大連這座城市"
}

PUT /my_index_token_count_chinese_city/_doc/2
{
  "city":"沈陽這座城市"
}

PUT /my_index_token_count_chinese_city/_doc/3
{
  "city":"北京這座城市"
}

PUT /my_index_token_count_chinese_city/_doc/4
{
  "city":"青島這座城市"
}
# 以青島為例查看一下分詞情況
GET /_analyze
{
  "tokenizer": "ik_smart",
  "text": "青島這座城市"
}
  • 分詞結果
{
  "tokens" : [
    {
      "token" : "青島",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "這座",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "城市",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}
  • 執行一下查詢
GET /my_index_token_count_chinese_city/_search
{
  "query": {
    "term": {
      "city.length": {
        "value": 3
      }
    }
  }
}
  • 看一下查詢結果
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index_token_count_chinese_city",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "city" : "大連這座城市"
        }
      },
      {
        "_index" : "my_index_token_count_chinese_city",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "city" : "沈陽這座城市"
        }
      },
      {
        "_index" : "my_index_token_count_chinese_city",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "city" : "北京這座城市"
        }
      },
      {
        "_index" : "my_index_token_count_chinese_city",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "city" : "青島這座城市"
        }
      }
    ]
  }
}

constant_keyword

常量關鍵字是關鍵字字段的專門化,用于索引中的所有文檔都具有相同的值的情況。
它有以下限制:

  • 如果映射中沒有提供值,則該字段將根據第一個索引文檔中包含的值自動配置自身。雖然這種行為可能很方便,但是請注意,這意味著如果一個有害文檔的值是錯誤的,那么它可能會導致所有其他文檔被拒絕。
  • 字段的值在設置之后不能更改
  • 不允許提供與映射中配置的值不同的值

但是我嘗試了es 7.2.0里,并沒有這個類型。。。以下是官網的例子

PUT logs-debug
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "message": {
        "type": "text"
      },
      "level": {
        "type": "constant_keyword",
        "value": "debug"
      }
    }
  }
}

POST logs-debug/_doc
{
  "date": "2019-12-12",
  "message": "Starting up Elasticsearch",
  "level": "debug"
}

POST logs-debug/_doc
{
  "date": "2019-12-12",
  "message": "Starting up Elasticsearch"
}

wildcard

直接翻譯一下官網:
通配符字段存儲為通配符類grep查詢優化的值。通配符查詢可以在其他字段類型上使用,但是會受到一些限制:

  • 文本字段(text)將任何通配符表達式的匹配限制為單個標記,而不是字段中保存的原始整數值
  • 關鍵字字段(keyword)是未標記的,但是執行通配符查詢的速度很慢(特別是具有領先通配符的模式)。

通配符字段使用ngrams在內部索引整個字段值,并存儲整個字符串。該索引用作一個粗略的過濾器,以減少值的數量,然后通過檢索和檢查完整的值進行檢查。這個字段特別適合在日志行上運行類似grep的查詢。存儲成本通常比關鍵字字段的存儲成本要低,但是對完全匹配項的搜索速度要慢一些。

另外,我在es7.2.0中也創建該類型失敗,以下是官網的例子

PUT my_index
{
  "mappings": {
    "properties": {
      "my_wildcard": {
        "type": "wildcard"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_wildcard" : "This string can be quite lengthy"
}

POST my_index/_doc/_search
{
  "query": {
      "wildcard" : {
          "value": "*quite*lengthy"
        }
  }
}

3. 大功告成

額,這一篇耗時比較長,對類型也有了一個比較全面的認識。
還是那句話,好記性不如爛筆頭。

感謝海賊王
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。

推薦閱讀更多精彩內容