elasticsearch 7.0 新特性之 Rank Feature query

Rank Feature為es能在機(jī)器學(xué)習(xí)場(chǎng)景應(yīng)用提供支持,是es處理特征計(jì)算的開始

1、介紹

rank_feature 是es7.0引入的一種特殊的查詢query ,這種查詢只在rank_feature 和 rank_features字段類型上有效(rank_feature 與rank_features是es7.0新增的數(shù)據(jù)類型),通常被放到boolean query中的should子句中用來提升文檔score,需要注意的是這種查詢的性能要高于function score。

通過一個(gè)例子進(jìn)行介紹:

PUT test
{
  "mappings": {
    "properties": {
      "pagerank": {
        "type": "rank_feature"
      },
      "url_length": {
        "type": "rank_feature",
        "positive_score_impact": false
      },
      "topics": {
        "type": "rank_features"
      }
    }
  }
}

PUT test/_doc/1
{
  "url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
  "content": "Rio 2016",
  "pagerank": 50.3,
  "url_length": 42,
  "topics": {
    "sports": 50,
    "brazil": 30
  }
}

PUT test/_doc/2
{
  "url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
  "content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in S?o Paulo, Brazil",
  "pagerank": 50.3,
  "url_length": 47,
  "topics": {
    "sports": 35,
    "formula one": 65,
    "brazil": 20
  }
}

PUT test/_doc/3
{
  "url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
  "content": "Deadpool is a 2016 American superhero film",
  "pagerank": 50.3,
  "url_length": 37,
  "topics": {
    "movies": 60,
    "super hero": 65
  }
}

POST test/_refresh

GET test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "2016"
          }
        }
      ],
      "should": [
        {
          "rank_feature": {
            "field": "pagerank"
          }
        },
        {
          "rank_feature": {
            "field": "url_length",
            "boost": 0.1
          }
        },
        {
          "rank_feature": {
            "field": "topics.sports",
            "boost": 0.4
          }
        }
      ]
    }
  }
}

2、操作

rank_feature query 支持3中影響打分的函數(shù),分別是saturation(默認(rèn))、Logarithm、Sigmoid。

  • saturation
    score區(qū)間(0,1),該函數(shù)的打分公式是 S / (S + pivot) ,其中S是rank feature 或 rank features的value值,pivod是score分界值,當(dāng)S值大于pivot時(shí),score>0.5 ;當(dāng)S值小于pivot時(shí),score<0.5 。
GET test/_search
{
  "query": {
    "rank_feature": {
      "field": "pagerank",
      "saturation": {
        "pivot": 8
      }
    }
  }
}

如果不指定pivot,elasticsearch會(huì)計(jì)算該field下索引值,近似求解出一個(gè)平均值作為pivot值;如果不知道如何設(shè)置pivot,官方建議不設(shè)置。

GET test/_search
{
  "query": {
    "rank_feature": {
      "field": "pagerank",
      "saturation": {}
    }
  }
}
  • Logarithm
    score無邊界,該函數(shù)打分公式是 log(scaling_factor + S) ,其中S是rank feature 或 rank features的value值,scaling_factor 是配置的縮放系數(shù)。
GET test/_search
{
  "query": {
    "rank_feature": {
      "field": "pagerank",
      "log": {
        "scaling_factor": 4
      }
    }
  }
}

需要注意的是該函數(shù)下的rank feature 或 rank features的value值必須是正數(shù)。

  • Sigmoid
    score區(qū)間(0,1),該函數(shù)是 saturation 函數(shù)的擴(kuò)展,計(jì)算公式是 Sexp / (Sexp + pivotexp) ,其中新增了一個(gè)指數(shù)參數(shù) exponent,該參數(shù)必須是整數(shù),建議取值區(qū)間[0.5,1] ,如果開始不知道如何設(shè)置一個(gè)比較理想的exponent值時(shí),官方建議先從saturation函數(shù)開始。
GET test/_search
{
  "query": {
    "rank_feature": {
      "field": "pagerank",
      "sigmoid": {
        "pivot": 7,
        "exponent": 0.6
      }
    }
  }
}
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。