常用術語查詢 · Elasticsearch 5.4 中文文檔

# 常用術語查詢原文鏈接 : [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html)（修改該鏈接為官網對應的鏈接）譯文鏈接 : [http://www.le.wiki/pages/viewpage.action?pageId=4883341](http://www.le.wiki/pages/viewpage.action?pageId=4883341)（修改該鏈接為 ApacheCN 對應的譯文鏈接）貢獻者 : @羊兩頭 ## 常用術語查詢常用術語查詢是停用詞的現代替代，其提高了搜索結果的精度和回憶（通過考慮停用詞）且不犧牲性能 ### 問題：查詢中的每個詞都有成本。搜索“棕色狐貍”需要三個詞查詢，一個針對“the”，“brown”和“fox”中的每一個，所有這些查詢針對索引中的所有文檔執行。對“the”的查詢可能匹配許多文檔，因此對相關性的影響要小于其他兩個術語。以前，這個問題的解決方案是忽略高頻率的項。通過將“the”視為停用詞，我們減少索引大小并減少需要執行的術語查詢的數量。這種方法的問題是，雖然停用詞對相關性有小的影響，但它們仍然很重要。如果我們刪除禁用詞，我們失去精確性（例如，我們無法區分“快樂”和“不快樂”），我們失去回憶（例如像“The”或“To be or not be be” 存在于索引中）。 ### 解決：常見術語查詢將查詢項劃分為兩組：更重要的（即低頻項）和不太重要的（即先前已經是停用詞的高頻項）。首先它搜索與更重要的術語匹配的文檔。這些是出現在較少文件中并對相關性具有更大影響的術語。然后，它對比不重要的術語執行第二次查詢 - 經常出現并且對相關性影響較小的術語。但是，不是計算所有匹配文檔的相關性分數，而是僅計算已由第一個查詢匹配的文檔的分數。以這種方式，高頻項可以改進相關性計算，而不支付差的性能的成本。如果查詢僅由高頻項組成，則單個查詢作為AND（連接）查詢執行，換句話說，所有項都是必需的。即使每個單獨的術語將匹配許多文檔，術語的組合將結果集縮小到僅最相關。單個查詢也可以作為具有特定minimum_should_match的OR執行，在這種情況下，應該使用足夠高的值。基于cutoff頻率將術語分配給高頻組或低頻組，其可以被指定為絕對頻率（> = 1）或相對頻率（0.0..1.0）。（請記住，文檔頻率是按照每個分片級別計算的，如博文中所述的相關性已損壞）。也許這個查詢的最有趣的屬性是它自動適應域特定的停用詞。例如，在視頻托管網站上，諸如“剪輯”或“視頻”之類的常用術語將自動表現為停用詞，而無需保留手動列表。 ### 舉例：在該示例中，具有大于0.1％的文檔頻率（例如“this”和“is”）的詞將被視為共同詞。 ``` GET /_search { "query": { "common": { "body": { "query": "this is bonsai cool", "cutoff_frequency": 0.001 } } } } ``` 可以使用minimum_should_match（high_freq，low_freq），low_freq_operator（默認“或”）和high_freq_operator（默認“或”）參數來控制應匹配的術語數。對于低頻項，將low_freq_operator設置為“and”，以使所有項都是必需的： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } } } } ``` 相當于： ``` GET /_search { "query": { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ``` 或者使用minimum_should_match來指定必須存在的低頻項的最小數量或百分比，例如： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": 2 } } } } ``` 相當于： ``` GET /_search { "query": { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "minimum_should_match": 2 } }, "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ``` minimum_should_match 可以對具有附加low_freq和high_freq參數的低頻和高頻項應用不同的minimum_should_match。這里是一個提供附加參數的例子（注意結構的變化）： ``` GET /_search { "query": { "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ``` 相當于： ``` GET /_search { "query": { "bool": { "must": { "bool": { "should": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "minimum_should_match": 2 } }, "should": { "bool": { "should": [ { "term": { "body": "the"}}, { "term": { "body": "not"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ], "minimum_should_match": 3 } } } } } ``` 在這種情況下，這意味著當至少有三個詞時，高頻詞只對相關性有影響。但是對于高頻項，minimum_should_match的最有趣的使用是當只有高頻項時： ``` GET /_search { "query": { "common": { "body": { "query": "how not to be", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ``` 相當于： ``` GET /_search { "query": { "bool": { "should": [ { "term": { "body": "how"}}, { "term": { "body": "not"}}, { "term": { "body": "to"}}, { "term": { "body": "be"}} ], "minimum_should_match": "3<50%" } } } ``` 高頻率生成的查詢然后比使用AND稍微限制性。常見的術語查詢還支持boost，analyzer和disable_coord作為參數