[[practical-scoring-function]]
=== Lucene's Practical Scoring Function
For multiterm queries, Lucene takes((("relevance", "controlling", "Lucene's practical scoring function", id="ix_relcontPCF", range="startofrange")))((("Boolean Model"))) the <<boolean-model,Boolean model>>,
<<tfidf,TF/IDF>>, and the <<vector-space-model,vector space model>> and
combines ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))((("Vector Space Model"))) them in a single efficient package that collects matching
documents and scores them as it goes.
A multiterm query like
[source,json]
------------------------------
GET /my_index/doc/_search
{
"query": {
"match": {
"text": "quick fox"
}
}
}
------------------------------
is rewritten internally to look like this:
[source,json]
------------------------------
GET /my_index/doc/_search
{
"query": {
"bool": {
"should": [
{"term": { "text": "quick" }},
{"term": { "text": "fox" }}
]
}
}
}
------------------------------
The `bool` query implements the Boolean model and, in this example, will
include only documents that contain either the term `quick` or the term `fox` or
both.
As soon as a document matches a query, Lucene calculates its score for that
query, combining the scores of each matching term. The formula used for
scoring is called the _practical scoring function_.((("practical scoring function"))) It looks intimidating, but
don't be put off--most of the components you already know. It introduces a
few new elements that we discuss next.
................................
score(q,d) = <1>
queryNorm(q) <2>
· coord(q,d) <3>
· ∑ ( <4>
tf(t in d) <5>
· idf(t)2 <6>
· t.getBoost() <7>
· norm(t,d) <8>
) (t in q) <4>
................................
<1> `score(q,d)` is the relevance score of document `d` for query `q`.
<2> `queryNorm(q)` is the <<query-norm,_query normalization_ factor>> (new).
<3> `coord(q,d)` is the <<coord,_coordination_ factor>> (new).
<4> The sum of the weights for each term `t` in the query `q` for document `d`.
<5> `tf(t in d)` is the <<tf,term frequency>> for term `t` in document `d`.
<6> `idf(t)` is the <<idf,inverse document frequency>> for term `t`.
<7> `t.getBoost()` is the <<query-time-boosting,_boost_>> that has been
applied to the query (new).
<8> `norm(t,d)` is the <<field-norm,field-length norm>>, combined with the
<<index-boost,index-time field-level boost>>, if any. (new).
You should recognize `score`, `tf`, and `idf`. The `queryNorm`, `coord`,
`t.getBoost`, and `norm` are new.
We will talk more about <<query-time-boosting,query-time boosting>> later in
this chapter, but first let's get query normalization, coordination, and
index-time field-level boosting out of the way.
[[query-norm]]
==== Query Normalization Factor
The _query normalization factor_ (`queryNorm`) is ((("practical scoring function", "query normalization factor")))((("query normalization factor")))((("normalization", "query normalization factor")))an attempt to _normalize_ a
query so that the results from one query may be compared with the results of
another.
[TIP]
==================================================
Even though the intent of the query norm is to make results from different
queries comparable, it doesn't work very well. The only purpose of
the relevance `_score` is to sort the results of the current query in the
correct order. You should not try to compare the relevance scores from
different queries.
==================================================
This factor is calculated at the beginning of the query. The actual
calculation depends on the queries involved, but a typical implementation is as follows:
..........................
queryNorm = 1 / √sumOfSquaredWeights <1>
..........................
<1> The `sumOfSquaredWeights` is calculated by adding together the IDF of each
term in the query, squared.
TIP: The same query normalization factor is applied to every document, and you
have no way of changing it. For all intents and purposes, it can be ignored.
[[coord]]
==== Query Coordination
The _coordination factor_ (`coord`) is used to((("coordination factor (coord)")))((("query coordination")))((("practical scoring function", "coordination factor"))) reward documents that contain a
higher percentage of the query terms. The more query terms that appear in
the document, the greater the chances that the document is a good match for
the query.
Imagine that we have a query for `quick brown fox`, and that the
weight for each term is 1.5. Without the coordination factor, the score would
just be the sum of the weights of the terms in a document. For instance:
* Document with `fox` -> score: 1.5
* Document with `quick fox` -> score: 3.0
* Document with `quick brown fox` -> score: 4.5
The coordination factor multiplies the score by the number of matching terms
in the document, and divides it by the total number of terms in the query.
With the coordination factor, the scores would be as follows:
* Document with `fox` -> score: `1.5 * 1 / 3` = 0.5
* Document with `quick fox` -> score: `3.0 * 2 / 3` = 2.0
* Document with `quick brown fox` -> score: `4.5 * 3 / 3` = 4.5
The coordination factor results in the document that contains all three terms
being much more relevant than the document that contains just two of them.
Remember that the query for `quick brown fox` is rewritten into a `bool` query
like this:
[source,json]
-------------------------------
GET /_search
{
"query": {
"bool": {
"should": [
{ "term": { "text": "quick" }},
{ "term": { "text": "brown" }},
{ "term": { "text": "fox" }}
]
}
}
}
-------------------------------
The `bool` query uses query coordination by default for all `should` clauses,
but it does allow you to disable coordination. Why might you want to do this?
Well, usually the answer is, you don't. Query coordination is usually a good
thing. When you use a `bool` query to wrap several high-level queries like
the `match` query, it also makes sense to leave coordination enabled. The more
clauses that match, the higher the degree of overlap between your search
request and the documents that are returned.
However, in some advanced use cases, it might make sense to disable
coordination. Imagine that you are looking for the synonyms `jump`, `leap`, and
`hop`. You don't care how many of these synonyms are present, as they all
represent the same concept. In fact, only one of the synonyms is likely to be
present. This would be a good case for disabling the coordination factor:
[source,json]
-------------------------------
GET /_search
{
"query": {
"bool": {
"disable_coord": true,
"should": [
{ "term": { "text": "jump" }},
{ "term": { "text": "hop" }},
{ "term": { "text": "leap" }}
]
}
}
}
-------------------------------
When you use synonyms (see <<synonyms>>), this is exactly what
happens internally: the rewritten query disables coordination for the
synonyms. ((("synonyms", "query coordination and"))) Most use cases for disabling coordination are handled
automatically; you don't need to worry about it.
[[index-boost]]
==== Index-Time Field-Level Boosting
We will talk about _boosting_ a field--making it ((("indexing", "field-level index time boosts")))((("boosting", "index time field-level boosting")))((("practical scoring function", "index time field-level boosting")))more important than other
fields--at query time in <<query-time-boosting>>. It is also possible
to apply a boost to a field at index time. Actually, this boost is applied to
every term in the field, rather than to the field itself.
To store this boost value in the index without using more space
than necessary, this field-level index-time boost is combined with the ((("field-length norm")))field-length norm (see <<field-norm>>) and stored in the index as a single byte.
This is the value returned by `norm(t,d)` in the preceding formula.
[WARNING]
=========================================
We strongly recommend against using field-level index-time boosts for a few
reasons:
* Combining the boost with the field-length norm and storing it in a single
byte means that the field-length norm loses precision. The result is that
Elasticsearch is unable to distinguish between a field containing three words
and a field containing five words.
* To change an index-time boost, you have to reindex all your documents.
A query-time boost, on the other hand, can be changed with every query.
* If a field with an index-time boost has multiple values, the boost is
multiplied by itself for every value, dramatically increasing
the weight for that field.
<<query-time-boosting,Query-time boosting>> is a much simpler, cleaner, more
flexible option.
=========================================
With query normalization, coordination, and index-time boosting out of the way,
we can now move on to the most useful tool for influencing the relevance
calculation: query-time boosting.((("relevance", "controlling", "Lucene's practical scoring function", range="endofrange", startref="ix_relcontPCF")))
- Introduction
- 入門
- 是什么
- 安裝
- API
- 文檔
- 索引
- 搜索
- 聚合
- 小結
- 分布式
- 結語
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障轉移
- 橫向擴展
- 更多擴展
- 應對故障
- 數據
- 文檔
- 索引
- 獲取
- 存在
- 更新
- 創建
- 刪除
- 版本控制
- 局部更新
- Mget
- 批量
- 結語
- 分布式增刪改查
- 路由
- 分片交互
- 新建、索引和刪除
- 檢索
- 局部更新
- 批量請求
- 批量格式
- 搜索
- 空搜索
- 多索引和多類型
- 分頁
- 查詢字符串
- 映射和分析
- 數據類型差異
- 確切值對決全文
- 倒排索引
- 分析
- 映射
- 復合類型
- 結構化查詢
- 請求體查詢
- 結構化查詢
- 查詢與過濾
- 重要的查詢子句
- 過濾查詢
- 驗證查詢
- 結語
- 排序
- 排序
- 字符串排序
- 相關性
- 字段數據
- 分布式搜索
- 查詢階段
- 取回階段
- 搜索選項
- 掃描和滾屏
- 索引管理
- 創建刪除
- 設置
- 配置分析器
- 自定義分析器
- 映射
- 根對象
- 元數據中的source字段
- 元數據中的all字段
- 元數據中的ID字段
- 動態映射
- 自定義動態映射
- 默認映射
- 重建索引
- 別名
- 深入分片
- 使文本可以被搜索
- 動態索引
- 近實時搜索
- 持久化變更
- 合并段
- 結構化搜索
- 查詢準確值
- 組合過濾
- 查詢多個準確值
- 包含,而不是相等
- 范圍
- 處理 Null 值
- 緩存
- 過濾順序
- 全文搜索
- 匹配查詢
- 多詞查詢
- 組合查詢
- 布爾匹配
- 增加子句
- 控制分析
- 關聯失效
- 多字段搜索
- 多重查詢字符串
- 單一查詢字符串
- 最佳字段
- 最佳字段查詢調優
- 多重匹配查詢
- 最多字段查詢
- 跨字段對象查詢
- 以字段為中心查詢
- 全字段查詢
- 跨字段查詢
- 精確查詢
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐標點
- 地理坐標點
- 通過地理坐標點過濾
- 地理坐標盒模型過濾器
- 地理距離過濾器
- 緩存地理位置過濾器
- 減少內存占用
- 按距離排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash單元過濾器
- 地理位置聚合
- 地理位置聚合
- 按距離聚合
- Geohash單元聚合器
- 范圍(邊界)聚合器
- 地理形狀
- 地理形狀
- 映射地理形狀
- 索引地理形狀
- 查詢地理形狀
- 在查詢中使用已索引的形狀
- 地理形狀的過濾與緩存
- 關系
- 關系
- 應用級別的Join操作
- 扁平化你的數據
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套對象
- 嵌套映射
- 嵌套查詢
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion