Rank Feature為es能在機(jī)器學(xué)習(xí)場(chǎng)景應(yīng)用提供支持,是es處理特征計(jì)算的開始
1、介紹
rank_feature 是es7.0引入的一種特殊的查詢query ,這種查詢只在rank_feature 和 rank_features字段類型上有效(rank_feature 與rank_features是es7.0新增的數(shù)據(jù)類型),通常被放到boolean query中的should子句中用來提升文檔score,需要注意的是這種查詢的性能要高于function score。
通過一個(gè)例子進(jìn)行介紹:
PUT test
{
"mappings": {
"properties": {
"pagerank": {
"type": "rank_feature"
},
"url_length": {
"type": "rank_feature",
"positive_score_impact": false
},
"topics": {
"type": "rank_features"
}
}
}
}
PUT test/_doc/1
{
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
"content": "Rio 2016",
"pagerank": 50.3,
"url_length": 42,
"topics": {
"sports": 50,
"brazil": 30
}
}
PUT test/_doc/2
{
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in S?o Paulo, Brazil",
"pagerank": 50.3,
"url_length": 47,
"topics": {
"sports": 35,
"formula one": 65,
"brazil": 20
}
}
PUT test/_doc/3
{
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
"content": "Deadpool is a 2016 American superhero film",
"pagerank": 50.3,
"url_length": 37,
"topics": {
"movies": 60,
"super hero": 65
}
}
POST test/_refresh
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "2016"
}
}
],
"should": [
{
"rank_feature": {
"field": "pagerank"
}
},
{
"rank_feature": {
"field": "url_length",
"boost": 0.1
}
},
{
"rank_feature": {
"field": "topics.sports",
"boost": 0.4
}
}
]
}
}
}
2、操作
rank_feature query 支持3中影響打分的函數(shù),分別是saturation(默認(rèn))、Logarithm、Sigmoid。
-
saturation
score區(qū)間(0,1),該函數(shù)的打分公式是 S / (S + pivot) ,其中S是rank feature 或 rank features的value值,pivod是score分界值,當(dāng)S值大于pivot時(shí),score>0.5 ;當(dāng)S值小于pivot時(shí),score<0.5 。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"saturation": {
"pivot": 8
}
}
}
}
如果不指定pivot,elasticsearch會(huì)計(jì)算該field下索引值,近似求解出一個(gè)平均值作為pivot值;如果不知道如何設(shè)置pivot,官方建議不設(shè)置。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"saturation": {}
}
}
}
-
Logarithm
score無邊界,該函數(shù)打分公式是 log(scaling_factor + S) ,其中S是rank feature 或 rank features的value值,scaling_factor 是配置的縮放系數(shù)。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"log": {
"scaling_factor": 4
}
}
}
}
需要注意的是該函數(shù)下的rank feature 或 rank features的value值必須是正數(shù)。
-
Sigmoid
score區(qū)間(0,1),該函數(shù)是 saturation 函數(shù)的擴(kuò)展,計(jì)算公式是 Sexp / (Sexp + pivotexp) ,其中新增了一個(gè)指數(shù)參數(shù) exponent,該參數(shù)必須是整數(shù),建議取值區(qū)間[0.5,1] ,如果開始不知道如何設(shè)置一個(gè)比較理想的exponent值時(shí),官方建議先從saturation函數(shù)開始。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"sigmoid": {
"pivot": 7,
"exponent": 0.6
}
}
}
}