Match All Query · my-elasticsearch-cn

# Match All Query ## Match All Query 最簡單的查詢：匹配所有文檔，對每個文檔打分_score為1.0，相當于關系數據庫中的`select * from table` ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_all": {} } }' ~~~ 如果對于某個查詢條件，希望更改其計算_score的權重，可以使用boost參數： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_all": { "boost" : 1.2 } } }' ~~~ ## Match None Query 與全檢索相反，可以使用match_none，不匹配任何文檔 ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_none": {} } }' ~~~ # 全文檢索全文搜索兩個最重要的方面是： * 相關（relevance）:相關是將查詢到相關的文檔結果進行排名的一種能力，這種相關度可以是根據TF/IDF、地理位置相似性（geolocation）、模糊相似，或者其他的一些算法得出。 * 分析（analysis）:將一個文本塊轉換為唯一的、規范化的token的過程，目的是為了（a）創建反向索引以及（b）查詢反向索引。當我們提到相關與分析的時候，我們已經身處查詢上下文之中，而不是過濾。 # Full text queries 高層級的全文檢索，通常會對文本的整體內容進行分析查詢。在檢索前會使用每個字段的analyzer對查詢字段進行分詞。? + 如果我們用它來查詢時間（date）或整數（integer），他們會將查詢字符串用分別當作時間和整數。? + 如果查詢一個準確的（未分析過的 not_analyzed）字符串字段，它會將整個查詢字符串當成一個術語。? + 但是如果要查詢一個全文字段（分析過的 analyzed），它會講查詢字符串傳入到一個合適的分析器，然后生成一個供查詢的術語列表。? 一旦查詢組成了一個術語列表，它會對每個術語逐一執行低層次的查詢，然后將結果合并，為每個文檔生成一個最終的相關性分數。? **注意**：? 當我們想要準確查詢一個未分析過（not_analyzed）的字段之前，需要仔細想想，我們到底是想要一個查詢還是一個過濾。? 單術語查詢通常可以用是非問題表示，所以更適合用過濾來表達，而且這樣子可以有效利用過濾的緩存。? 下面對全文本查詢進行詳細介紹： ## Match Query匹配查詢 match查詢接受文本、數值、時間類型的數據，對其進行分析，構建查詢。簡單示例： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match" : { "message" : "this is a test" } } }' ~~~ 其中message是字段名稱，可以根據情況替換。上面的查詢會先對this is a test進行分詞，對每個term進行匹配并合并結果。 ### match match是布爾類型的查詢，通過對提供的文本進行analyze，構建一個boolean的查詢。 * operator：其操作符operator可以設定為and或者or，用于控制查詢結構的構建。 * minimum_should_match：當存在多個should可選時，可以通過minimum_should_match來設定最少匹配的should條件個數。 * analyzer：可以控制文本分析器 * lenient：默認為false，當設定為true時，可以忽略類型不匹配導致的異常 ### Fuzziness fuziness可以開啟模糊匹配功能。通過設定模糊參數，修改匹配時可以容忍的差距，該值最后在0-2之間，值越大，則計算時間越長。例如下面的例子中，名稱多了一個a，通過模糊匹配也能查找出來： ![這里寫圖片描述](http://img.blog.csdn.net/20161126183122968) 參考：? [https://www.elastic.co/blog/found-fuzzy-search](https://www.elastic.co/blog/found-fuzzy-search) ### zero terms query ### cutoff frequency 指定文檔頻率 ## match phrase 短語匹配，通過對查詢字符串進行分詞，并記錄token的位置關系，然后對待查詢的字段進行過濾查詢分析。例如：下面例子會查詢包含this is a test短語，且順序與其一致的文檔。 ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_phrase" : { "message" : "this is a test" } } }' ~~~ 對于查詢字符串的分析器，可以手動置頂： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_phrase" : { "message" : { "query" : "this is a test", "analyzer" : "my_analyzer" } } } }' ~~~ 有時不希望對順序要求過于嚴格，可以通過設定slop，指定可以移動查詢字符串的token的次數，最終使其順序一致。如果slop足夠大，其檢索與忽略順序一致。例如 ![這里寫圖片描述](http://img.blog.csdn.net/20161126183253344) * 文檔內容為：quick brown fox * 檢索字符串為：fox quick * 移動步驟： * 將quick從pos2移動到pos1 * 將fox從pos1移動到pos2 * 將fox從pos2移動到pos3 ## Match Phrase Prefix Query 與match_phrase類似，但最后一個token作為前綴進行匹配，其最長的匹配長度由max_expansions設定： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match_phrase_prefix" : { "message" : { "query" : "quick brown f", "max_expansions" : 10 } } } }' ~~~ ## multi match query 允許對多個字段進行同時檢索： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "multi_match" : { "query": "this is a test", "fields": [ "subject", "message" ] } } }' ~~~ 可以對各字段分配不同權重，例如下面例子中，subject的權重是message的三倍： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "multi_match" : { "query" : "this is a test", "fields" : [ "subject^3", "message" ] } } }' ~~~ multi match查詢包括以下幾種類型： ### best_fields 將每個match查詢封入dis_max中，這樣可以保證精確匹配得分更高 ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "multi_match" : { "query": "brown fox", "type": "best_fields", "fields": [ "subject", "message" ], "tie_breaker": 0.3 } } }' ~~~ 與下面等價： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "dis_max": { "queries": [ { "match": { "subject": "brown fox" }}, { "match": { "message": "brown fox" }} ], "tie_breaker": 0.3 } } }' ~~~ tie_breaker：只有在use_dis_max參數設為true時才會使用這個參數。它指定低分數項和最高分數項之間的平衡。該參數指定了除了最高得分的子查詢外，其他查詢得分所占的權重。 ### most_fields 對每個字段都進行搜索匹配并計算，匹配文檔越多，分數越高 ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "multi_match" : { "query": "quick brown fox", "type": "most_fields", "fields": [ "title", "title.original", "title.shingles" ] } } }' ~~~ 與下面的一致： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "bool": { "should": [ { "match": { "title": "quick brown fox" }}, { "match": { "title.original": "quick brown fox" }}, { "match": { "title.shingles": "quick brown fox" }} ] } } }' ~~~ 將所有match子句的得分相加并除以match的個數 ### phrase和phrase_prefix 相當于將每個match子句用match_phrase_prefix封裝： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "multi_match" : { "query": "quick brown f", "type": "phrase_prefix", "fields": [ "subject", "message" ] } } }' ~~~ 與下面語句功能一致： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "dis_max": { "queries": [ { "match_phrase_prefix": { "subject": "quick brown f" }}, { "match_phrase_prefix": { "message": "quick brown f" }} ] } } }' ~~~ ### corss_fields 將所有字段作為一個big-fields，進行檢索 ## Common Terms Query ### 問題當查詢多個字段時，每個token會使用一個term查詢，但某些token十分常見，并不應該影響文檔的的得分，例如the、a等詞，將其作為stopword可以減少term查詢個數。但直接移除這些詞匯，我們會損失一些精度，比如我們無法區分 happy和not happy。 ### 解決方法： common terms查詢會分兩步進行查詢 * 查詢重要性高的文檔（分布在較少的文檔中），并計算score * 在第一步查詢結果的文檔中，查詢相關性低的token，并計算socre * 可以通過cutoff_frequency控制頻率（值大于1為絕對頻率，小于1為相對頻率）下面例子中，對頻率大于0.1%的token視為common term，例如下面示例，對低頻token使用and操作： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } } } }' ~~~ 上面查詢近似得等同下面的查詢： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } }' ~~~ 可以分別對高頻和低頻token進行限制： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } }' ~~~ # 多詞匹配其中match是一個boolean查詢，會對匹配字段進行analyze，操作符默認是or，可以根據情況設定為or或and。例如為了同時匹配三個term，則設定為and： ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match" : { "message" : "this is a test", "operator": "and" } } }' ~~~ ## 多字段搜索索引測試的文檔： ~~~ PUT /my_index/my_type/1 { "title": "Quick brown rabbits", "body": "Brown rabbits are commonly seen." } PUT /my_index/my_type/2 { "title": "Keeping pets healthy", "body": "My quick brown fox eats rabbits on a regular basis." } ~~~ 多字段查詢：若多個字段進行查詢，默認查詢按照下面規則進行排序： ~~~ { "query": { "bool": { "should": [ { "match": { "title": "Brown fox" }}, { "match": { "body": "Brown fox" }} ] } } } ~~~ 它會執行 should 語句中的兩個查詢? 將兩個查詢的分數相加? 與總匹配語句的數目相乘? 并除以總語句的數目（這里為：2）? 普通的多字段查詢，文檔1兩個字段都包含brown，所以兩個match都符合，其匹配度高于文檔2；但我們發現文檔2對于borwn fox的匹配度更高，如果我們想要提高最佳匹配的文檔的匹配度，可以使用dis_max： ## dis_max分離最大化查詢 ~~~ { "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" }}, { "match": { "body": "Quick pets" }} ] } } } ~~~ ## dis_breker ## 設定匹配精度 ~~~ curl -XGET 'localhost:9200/_search?pretty' -d' { "query": { "match" : { "message" : "this is a test", "minimum_should_match": "75%" } } }' ~~~ 通常設定最小的匹配百分比，來控制匹配term的個數，例如上面的例子中有三個term，75%會被修正為66.6%，即最少匹配2個term。但該值可以為負數，負數的意義有些特殊。例如有4個term的匹配，當匹配度為-25%與75%，其意義是一樣的，都是最少匹配三個，但處理5個term時，-25%表示至少匹配四個，而75%表示至少匹配三個term。 ## 如何使用bool匹配（How match Uses bool）目前為止，可能已經知道如何對多個詞進行查詢，我們需要做的只是要把多個語句放入bool查詢中，因為默認的操作符是 or，每個 term 查詢都會被當作 should 語句進行處理，所以至少有一個語句需要匹配，下面的兩個查詢是等價的： ~~~ { "match": { "title": "brown fox"} } ~~~ 與 ~~~ { " bool": { "should": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }} ] } } ~~~ 如果使用 and 操作符，那么下面兩個語句也是等價的： ~~~ { " match": { "title": { "query": "brown fox", "operator": "and" } } } ~~~ 與 ~~~ { " bool": { "must": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }} ] } } ~~~ 如果按照下面這樣給定參數 minimum_should_match，那么下面兩個查詢也是等價的： ~~~ { " match": { "title": { "query": "quick brown fox", "minimum_should_match": "75%" } } } ~~~ 與 ~~~ { " bool": { "should": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }}, { "term": { "title": "quick" }} ], "minimum_should_match": 2 } } ~~~ 當然，我們通常將這些查詢以 match 查詢來表示，但是如果了解match內部的工作原理，我們就能對查詢過程按照我們的需要進行控制，有些時候單個match查詢無法滿足需求，比如我們要為一些查詢條件分配更多的權重。在下一部分中，我們會介紹這個例子。