操作 · TUNA-daily

[TOC] ## 1. query string search查詢 > * 結構化查詢，需要傳遞query參數，{}表示查詢子句匹配所有的,所以叫query string search 以下是用json構建的query DSL ### 1.1 match_all > 使用match_all查詢所有文檔，是沒有查詢條件下的默認語句 ~~~ GET _search { "query": { "match_all" : {} } } ~~~ ### 1.2 match 1. 匹配年齡是32的文檔 ~~~ GET _search { "query": { "match": { "age" : 32 } } } ~~~ ### 1.3 match_phrase > 精準匹配: 不會對搜索語句不會分詞，短語匹配 > 像匹配查詢一樣，但用于匹配確切的短語或單詞接近度匹配。 > match_phrase查詢會分析文本，并從被分析的文本中創建短語查詢 > 例如hello world，只能匹配包含hello world的文本 * 返回Java spark短語的文檔 ~~~ GET /forum/article/_search { "query": { "match_phrase": { "content": "java spark" } } } ~~~ * slop 參數（proximity match）通過slop參數，控制短語的間隔，近似匹配 ~~~ GET /forum/article/_search { "query": { "match_phrase": { "content": { "query": "java spark", "slop":10 # java和spark的最大距離，單詞的個數 } } } } ~~~ ### 1.4 match_phrase_prefix > match_phrase_prefix與match_phrase相同，但它允許在文本中的最后一個術語上使用前綴匹配。 > 它接受與短語類型相同的參數。此外，它還接受一個max_expansions參數（默認值50），可以控制最后一個術語將擴展多少個后綴。強烈建議將其設置為可接受的值來控制查詢的執行時間。例如： ~~~ GET /_search { "query": { "match_phrase_prefix" : { "message" : { "query" : "quick brown f", "max_expansions" : 50 } } } } ~~~ > * 這個查詢會構建成 "quick brown" +f 開頭的全文索引， "max_expansions" : 50表示它查看排序的詞典字典，以查找以f開頭的前50個術語，并將這些術語添加到短語查詢中。這樣可能導致以f開頭的單詞，在詞典順序排在50以后的不會被快速的查找到，但是隨著用戶或者f后邊的字符增多而查找到。 > 無論是全文搜索還是精準查詢，都基本上使用到match ### 1.4 multi_match > 1. 使用math的基礎上，加入多個字段匹配 ~~~ GET _search { "query": { "multi_match": { "query": "like", # 查詢語句 "fields": ["interests","about"] # 字段 } } } ~~~ > 2. 可以使用通配符指定字段 ~~~ GET _search { "query": { "multi_match": { "query": "like", "fields": ["interests","*t"] # 匹配以t結尾的字段 } } } ~~~ > 3. 通過^n來增加字段的權重 ~~~ GET _search { "query": { "multi_match": { "query": "music", "fields": ["interests","about^3"] } } } ~~~ > * 這里會增加about在匹配結果時的權重，也就是說about字段中如果包含music，這個文檔在查詢的得到的結果中會比較靠前 > 4. multi查詢的類別 > 4.1 best_fields（multi_match默認類型）查找與任何字段匹配的文檔，但使用最佳字段中的_score，當搜索在同一字段中去尋找的多個單詞時，best_fields類型最為有用。例如，一個領域的“brown fox”在一個領域比“brown”更有意義，而另一個領域的“fox”更有意義。 best_fields類型為每個字段生成匹配查詢，并將其包裝在dis_max查詢中，以找到單個最佳匹配字段。例如，這個查詢： ~~~ GET /_search { "query": { "multi_match" : { "query": "brown fox", "type": "best_fields", "fields": [ "subject", "message" ], "tie_breaker": 0.3 } } } ~~~ 會被轉換成dis_max,包含兩個math查詢 ~~~ GET /_search { "query": { "dis_max": { "queries": [ { "match": { "subject": "brown fox" }}, { "match": { "message": "brown fox" }} ], "tie_breaker": 0.3 # 給其他不太精準的匹配一個權值 } } } ~~~ 通常，best_fields類型使用單個最佳匹配字段的分數，但是如果指定了tie_breaker，則它計算得分如下： 1. 來自于匹配最精準的字段的得分 2. 加上所有其他匹配字段的tie_breaker * _score * best_fields和most_fields類型是以字段為中心的 - 它們會為每個字段生成匹配查詢。這意味著操作符和minimum_should_match參數將單獨應用于每個字段，這可能不是您想要的。 ~~~ GET /_search { "query": { "multi_match" : { "query": "Will Smith", "type": "best_fields", "fields": [ "first_name", "last_name" ], "operator": "and" } } } ~~~ 會類似這樣的為每個字段產生查詢，并且將參數分別用到兩個字段,will smith 被拆分，并且運用了and,會查找first_name和last_name 中含有 will Smith的文檔，并且是精準匹配（表示兩個都得有） ~~~ (+first_name:will +first_name:smith) # 大寫都轉小寫了 | (+last_name:will +last_name:smith) # 大寫都轉小寫了 ~~~ 把and換成or ~~~ GET /_search { "query": { "multi_match" : { "query": "Will smith", "type": "best_fields", "fields": [ "first_name", "last_name" ], "operator": "or" } } } ~~~ 轉換成 ~~~ (first_name:will or first_name:smith) # 大寫都轉小寫了 | (last_name:will or last_name:smith) # 大寫都轉小寫了 ~~~ 所有Word都必須存在于一個文檔匹配的單個字段中。 > 4.2 most_fields 當查詢包含以不同方式分析的相同文本的多個字段時，most_fields類型最為有用。例如，主要領域可能包含同義詞，詞干和術語，而沒有變音符號。第二個字段可能包含原始術語，第三個字段可能包含帶狀鍵。通過組合來自所有三個字段的分數，我們可以在主域中匹配盡可能多的文檔，但是使用第二和第三個字段將最相似的結果推送到列表的頂部。 ~~~ GET /_search { "query": { "multi_match" : { "query": "quick brown fox", "type": "most_fields", "fields": [ "title", "title.original", "title.shingles" ] } } } ~~~ 會被這樣執行 ~~~ GET /_search { "query": { "bool": { "should": [ { "match": { "title": "quick brown fox" }}, { "match": { "title.original": "quick brown fox" }}, { "match": { "title.shingles": "quick brown fox" }} ] } } } ~~~ 來自每個match子句的得分加在一起，然后除以match子句的數量 > 4.3 cross_fields cross_fields類型對于多個字段應匹配的結構化文檔特別有用。例如，當查詢“Will Smith”的first_name和last_name字段時，最佳匹配在一個字段中可能具有“Will”，而在另一個字段中可能具有“Smith”。 ~~~ GET /_search { "query": { "multi_match" : { "query": "Will Smith", "type": "cross_fields", "fields": [ "first_name", "last_name" ], "operator": "and" } } } ~~~ 轉成，和best_field不同的是，他會把 will Smith且分給兩個域 ~~~ +(first_name:will last_name:will) +(first_name:smith last_name:smith) ~~~ ### 1.5 bool > bool 查詢與bool過濾類似，不同的是， bool過濾可以直接給出是否匹配成功，而bool 查詢要計算每一個查詢子句的_score（相關性分值） > 以下查詢將會找到 title 字段中包含"how to make millions"，并且"tag" 字段沒有被標為 spam. 如果有標為"starred"或者發布日期為2014年之前，那么這些匹配的文檔將比同類網站等級高： ~~~ GET _search { "bool": { "must": {"match": {"title":"how to make millions"}}, "must_not": { "match": {"tag": "spam" }}, "should": [ {"match": {"tag": "starred"}}, {"range":{"date": {"gte":"2014-01-01"}}} ] } } ~~~ * 查詢about字段中含有basketball和年齡在35-40之間的文檔 ~~~ GET /megacorp/_search { "query": { "bool": { "must": [ {"match": {"about": "basketball"}}, {"range": {"age": {"gte": 35, "lte": 40}}} ] } } } ~~~ ### 1.6 Common Terms 查詢 > * 查詢中的每個術語都性能消耗。搜索“The brown fox”需要三個查詢，每個查詢“一個”，“brown”和“fox”，所有這些都針對索引中的所有文檔執行。“The”會查詢出很多文檔，所以查詢的效果不如前面的兩個查詢。 > * 以前的做法是把the去掉，這樣有很大的問題，例如我們無法區分“happy”和“not happy” > * common term 是把查詢詞分為兩類，一類是重要詞（查詢與文檔相關性較大的詞）和非重要詞（例如無用詞） >1. common term會先查找重要詞，這些詞會出現在較少的文檔中（效率），且有很好的相關性 > 2. 接著執行次要詞查詢，在計算相關性評分時，不會計算所有匹配的文檔，而是計算第一步中的得到文檔的評分，以這種方式，高頻率可以改善相關性計算，而無需支付性能差的成本。 > 3. 如果查詢僅由高頻項組成，則單個查詢將作為AND（連接）查詢執行，換句話說，所有術語都是必需的。即使每一個術語都符合許多文件，術語的組合將結果集縮小到最相關。單個查詢也可以作為具有特定minimum_should_match的OR執行，在這種情況下，應該使用足夠高的值。 * 在這個例子中，文檔頻率大于0.1％的單詞（例如“this”和“is”）將被視為通用術語。 ~~~ GET /_search { "query": { "common": { "body": { "query": "this is bonsai cool", "cutoff_frequency": 0.001 } } } } ~~~ 可以使用minimum_should_match（high_freq，low_freq），low_freq_operator（默認“或”）和high_freq_operator（默認“或”）參數來控制應該匹配的術語數量。對于低頻條件，將low_freq_operator設置為“and”以使所有條件都需要： ~~~ GET /_search { "query": { "common": { "body": { "query": "nelly the elephant as a cartoon", "cutoff_frequency": 0.001, "low_freq_operator": "and" } } } } ~~~ 可以粗略的等于 ~~~ GET /_search { "query": { "bool": { "must": [ { "term": { "body": "nelly"}}, { "term": { "body": "elephant"}}, { "term": { "body": "cartoon"}} ], "should": [ { "term": { "body": "the"}}, { "term": { "body": "as"}}, { "term": { "body": "a"}} ] } } } ~~~ ### 1.7 Query String 查詢一個使用查詢解析器解析其內容的查詢。 ~~~ GET /_search { "query": { "query_string" : { "default_field" : "content", "query" : "this AND that OR thus" } } } ~~~ ### 1.8 批量查詢 1、批量查詢的好處就是一條一條的查詢，比如說要查詢100條數據，那么就要發送100次網絡請求，這個開銷還是很大的如果進行批量查詢的話，查詢100條數據，就只要發送1次網絡請求，網絡請求的性能開銷縮減100倍 2、mget的語法（1）一條一條的查詢 ~~~ GET /test_index/test_type/1 GET /test_index/test_type/2 ~~~ ~~~ （2）mget批量查詢 GET /_mget { "docs" : [ { "_index" : "test_index", "_type" : "test_type", "_id" : 1 }, { "_index" : "test_index", "_type" : "test_type", "_id" : 2 } ] } { "docs": [ { "_index": "test_index", "_type": "test_type", "_id": "1", "_version": 2, "found": true, "_source": { "test_field1": "test field1", "test_field2": "test field2" } }, { "_index": "test_index", "_type": "test_type", "_id": "2", "_version": 1, "found": true, "_source": { "test_content": "my test" } } ] } ~~~ （3）如果查詢的document是一個index下的不同type種的話 ~~~ GET /test_index/_mget { "docs" : [ { "_type" : "test_type", "_id" : 1 }, { "_type" : "test_type", "_id" : 2 } ] } ~~~ （4）如果查詢的數據都在同一個index下的同一個type下，最簡單了 ~~~ GET /test_index/test_type/_mget { "ids": [1, 2] } ~~~ 3、mget的重要性可以說mget是很重要的，一般來說，在進行查詢的時候，如果一次性要查詢多條數據的話，那么一定要用batch批量操作的api 盡可能減少網絡開銷次數，可能可以將性能提升數倍，甚至數十倍，非常非常之重要 ### 1.9 prefix query( 前綴查詢) * 查出title以C3開頭的文檔，prefix不計算分數，不建立倒排索引，性能較差 ~~~ GET my_index/_search { "query": { "prefix": { "title": { "value": "C3" } } } } ~~~ 得到 ~~~ "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1, "_source": { "title": "C3-K5-DFG65" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "title": "C3-D0-KD345" } } ] } ~~~ ### 1.1.0 wildcard（模糊搜索） * 查詢title以C開頭，5結尾的文檔 ~~~ GET my_index/_search { "query": { "wildcard": { "title": { "value": "C*5" } } } } ~~~ 得到 ~~~ "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "2", "_score": 1, "_source": { "title": "C3-K5-DFG65" } }, { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 1, "_source": { "title": "C3-D0-KD345" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 1, "_source": { "title": "C4-I8-UI365" } } ] ~~~ ### 1.11 搜索推薦 1、ngram和index-time搜索推薦原理什么是ngram 例如quick，5種長度下的ngram ~~~ ngram length=1，q u i c k ngram length=2，qu ui ic ck ngram length=3，qui uic ick ngram length=4，quic uick ngram length=5，quick ~~~ 什么是edge ngram quick，anchor首字母后進行ngram ~~~ q qu qui quic quick ~~~ 使用edge ngram將每個單詞都進行進一步的分詞切分，用切分后的ngram來實現前綴搜索推薦功能 ~~~ hello world hello we h he hel hell hello doc1,doc2 w doc1,doc2 wo wor worl world e doc2 helloworld min ngram = 1 max ngram = 3 h he hel hello w hello --> hello，doc1 w --> w，doc1 ~~~ > doc1，hello和w，而且position也匹配，所以，ok，doc1返回，hello world > 搜索的時候，不用再根據一個前綴，然后掃描整個倒排索引了; 簡單的拿前綴去倒排索引中匹配即可，如果匹配上了，那么就好了; match，全文檢索 2、實驗一下ngram 1. 自定義分詞器 ~~~ PUT /my_index { "settings": { "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, # 切分單詞最小長度 "max_gram": 20 # 切分單詞最大長度 } }, "analyzer": { "autocomplete": { "type": "custom", "tokenizer": "standard", # 標準分詞器 "filter": [ "lowercase", # 大小寫轉換 "autocomplete_filter" # 搜索推薦 ] } } } } } ~~~ 測試自定義分詞器 ~~~ GET /my_index/_analyze { "analyzer": "autocomplete", "text": "quick brown" } ~~~ 得到，quick brown被按照搜索推薦，分成 q，qu，qui，。。。 ~~~ "tokens": [ { "token": "q", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "qu", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "qui", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "quic", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "quick", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "b", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "br", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "bro", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "brow", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "brown", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 2. 設置映射屬性 ~~~ PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "autocomplete", # 倒排索引，使用自定義分詞器 "search_analyzer": "standard" # 搜索正常分詞 } } } ~~~ 3. 插入測試數據 ~~~ PUT my_index/my_type/1 { "title": "hello world" } PUT my_index/my_type/2 { "title": "hello win" } PUT my_index/my_type/3 { "title": "hello dog" } ~~~ 4. 測試 ~~~ GET my_index/_search { "query": { "match_phrase": { "title": "hello w" } } } ~~~ 5. 得出所有可能以hello w 開頭的文檔，并求推薦給用戶 "min_gram": 1, # 切分單詞最小長度 "max_gram": 4 # 切分單詞最大長度，hello被分成h，he，hel，hell。 ~~~ hello world h he hel hell hello w wo wor worl world hello w h he hel hell hello w hello w --> hello --> w GET /my_index/my_type/_search { "query": { "match_phrase": { "title": "hello w" } } } ~~~ 如果用match，只有hello的也會出來，全文檢索，只是分數比較低推薦使用match_phrase，要求每個term都有，而且position剛好靠著1位，符合我們的期望的 ### 1.12 糾錯查詢數據 ~~~ POST /my_index/my_type/_bulk { "index": { "_id": 1 }} { "text": "Surprise me!"} { "index": { "_id": 2 }} { "text": "That was surprising."} { "index": { "_id": 3 }} { "text": "I wasn't surprised."} ~~~ 查詢 ~~~ GET /my_index/my_type/_search { "query": { "fuzzy": { "text": { "value": "surprize", "fuzziness": 2 # 最多糾正錯誤次數 } } } } ~~~ 得到 ~~~ "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.22585157, "_source": { "text": "Surprise me!" } }, { "_index": "my_index", "_type": "my_type", "_id": "3", "_score": 0.1898702, "_source": { "text": "I wasn't surprised." } } ] } } ~~~ * 自動糾錯 ~~~ GET /my_index/my_type/_search { "query": { "match": { "text": { # field "query": "SURPIZE ME", "fuzziness": "AUTO", "operator": "and" } } } } ~~~ 得到 ~~~ "hits": { "total": 1, "max_score": 0.44248468, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.44248468, "_source": { "text": "Surprise me!" } } ] } } ~~~ ### 1.13 histogram數值區間分組 > histogram對于某個數值field，按照一定的區間間隔分組， "interval": 2000指的是按照步長為2000進行分組 1. 按照價格，每2000是一組 ~~~ GET /tvs/_search { "size": 0, "aggs": { "group_by_price": { "histogram": { "field": "price", "interval": 2000 } } } } ~~~ 得到 ~~~ "buckets": [ { "key": 0, "doc_count": 3 # 0-2000有3個 }, { "key": 2000, "doc_count": 4 # 2000-4000有4個 }, { "key": 4000, "doc_count": 0 }, { "key": 6000, "doc_count": 0 }, { "key": 8000, "doc_count": 1 } ] } } } ~~~ ### 1.14 date_histogram 時間區間聚合 1. 求每個月銷售總量 > min_doc_count：0 > 即使某個日期interval，2017-01-01~2017-01-31中，一條數據都沒有，那么這個區間也是要返回的，不然默認是會過濾掉這個區間的 "interval": "month" ：以月為單位聚合 "extended_bounds":{ "min":"2016-01-01", "max":"2017-12-12" }：指定時間邊界 ~~~ GET tvs/_search { "size": 0, "aggs": { "sales": { "date_histogram": { "field": "sold_date", "interval": "month", "format": "yyyy-MM-dd", "min_doc_count": 0, "extended_bounds":{ "min":"2016-01-01", "max":"2017-12-12" } }, "aggs": { "sum_price_month": { "sum": { "field": "price" } } } } } } ~~~ ### 1.15 單聚與整聚 1. 求出長虹電視銷售平均價格和所有電視銷售平均價的對比 ~~~ GET tvs/_search { "size": 0, "query": { "term": { "brand": { "value": "長虹" } } }, "aggs": { # 根據查詢聚合 "single_avg_price": { "avg": { "field": "price" } }, "globle":{ # 名稱 "global": {}, 構造一個整體桶 "aggs": { "globle_avg_price": { "avg": { "field": "price" } } } } } } ~~~ 得到 ~~~ }, "aggregations": { "globle": { "doc_count": 8, "globle_avg_price": { "value": 2650 } }, "single_avg_price": { "value": 1666.6666666666667 } } } ~~~ ### 1.15 聚合去重 1. 查看每個季度，都有哪些品牌銷售 ~~~ GET /tvs/_search { "size": 0, "aggs": { "groupby_mounth": { "date_histogram": { "field": "sold_date", "format": "yyyy-MM-dd", "interval": "quarter", "extended_bounds":{ "min":"2016-01-01", "max":"2017-08-08" } }, "aggs": { "distinct_brand": { "cardinality": { "field": "brand" # 在桶內，對品牌去重 } } } } } } ~~~ 得到 ~~~ "hits": [] }, "aggregations": { "groupby_mounth": { "buckets": [ { "key_as_string": "2016-01-01", "key": 1451606400000, "doc_count": 0, "distinct_brand": { "value": 0 } }, { "key_as_string": "2016-04-01", "key": 1459468800000, "doc_count": 1, "distinct_brand": { "value": 1 } }, { "key_as_string": "2016-07-01", "key": 1467331200000, "doc_count": 2, "distinct_brand": { "value": 1 } }, { "key_as_string": "2016-10-01", "key": 1475280000000, "doc_count": 3, "distinct_brand": { "value": 1 } }, { "key_as_string": "2017-01-01", "key": 1483228800000, "doc_count": 2, "distinct_brand": { "value": 2 } ~~~ 2. 控制去重精準度 > 1. cardinality，count(distinct)，5%的錯誤率，性能在100ms左右 > 2. "precision_threshold": 100：表示brand去重，如果brand的unique value，在100個以內，小米，長虹，三星，TCL，HTL。。。，幾乎保證去重100%準確 > 3. 為了保證去重的準確性，可以根據需要調大precision_threshold的值 > 4. 小缺點： > cardinality算法，會占用precision_threshold * 8 byte 內存消耗，100 * 8 = 800個字節占用內存很小。。。而且unique value如果的確在值以內，那么可以確保100%準確100，數百萬的unique value，錯誤率在5%以內 ~~~ GET /tvs/_search { "size": 0, "aggs": { "groupby_mounth": { "date_histogram": { "field": "sold_date", "format": "yyyy-MM-dd", "interval": "quarter", "extended_bounds":{ "min":"2016-01-01", "max":"2017-08-08" } }, "aggs": { "distinct_brand": { "cardinality": { "field": "brand", "precision_threshold": 100 # 保證不同的值在100以內的話，去重精準性 } } } } } } ~~~ ## 2. 過濾 ### 2.1 bool過濾 >* 對于精準值，使用過濾。合并多個過濾條件查詢結果的布爾邏輯，包含以下操作符： > 1. must：多個查詢條件必須滿足，相當于and > 2. must_not : 多個查詢條件的相反匹配，相當于not > 3. should ：至少有一個查詢條件匹配，相當于or ~~~ ~~~ { "bool": { "must": { "term": { "folder":"inbox"}}, "must_not": { "term": { "tag": "spam" }}, "should": [ { "term": { "starred": true }}, { "term": { "unread": true }} ] } } ~~~ 匹配數量 ~~~ GET /_search { "query": { "common": { "body": { "query": "nelly the elephant not as a cartoon", "cutoff_frequency": 0.001, "minimum_should_match": { "low_freq" : 2, "high_freq" : 3 } } } } } ~~~ ### 2.2 term:過濾主要用于精確匹配哪些值，比如數字、日期布爾值或not_analyzed的字符串(未經分析的文本數據類型)： ### 2.3 terms：過濾 > * 與term相似，但是可以指定多個匹配條件，如果某一字段有多個值，那么文檔需要一起做匹配例如，想要查找價格是20或者30的商品 ~~~ GET /my_store/products/_search { "query": { "bool": { "filter": { "terms": { "price":[20,30] } } } } } ~~~ ### 2.4 range ：過濾 > 過濾某一區間的數據 > 1. gt 大于 > 2. gte 大于等于 > 3. lt 小于 > 4. lte 小于等于 > * 查找價格大于等于10小于等于20的商品 ~~~ GET /my_store/_search { "query": { "bool": { "filter": { "range": { "price": { "gte": 10, "lte": 20 } } } } } } ~~~ > 查找最近一小時的文檔 ~~~ "range" : { "timestamp" : { "gt" : "now-1h" } } ~~~ ### 2.5 exists過濾器返回包含某一字段的文檔 > * 查找tags字段有值的文檔 ~~~ GET /my_index/_search { "query": { "bool": { "filter": { "exists": { "field": "tags" } } } } } ~~~ ### 2.6 missing過濾器與exists過濾相反，返回沒有指定字段值的文檔 ~~~ GET /my_index/_search { "query": { "bool": { "filter": { "missing": { "field": "tags" } } } } } ~~~ ## 3. 復合查詢 > 通常情況下，一條過濾語句需要過濾子句的輔助，全文搜索除外。一條查詢語句可以包含過濾子句，反之亦然。search API中只能包含 query 語句，所以我們需要用 bool 來同時包含"query"和"filter"子句：查詢姓Smith的人，要求年齡是25 ~~~ GET /_search { "query": { "bool": { "must": [ { "match": { "last_name":"Smith"}} ], "filter": [ { "range": { "age": { "lte" :25}}} ] } } } # 使用term GET /_search { "query": { "bool": { "must": [ {"match": { "last_name": "Smith" }} ], "filter": [ {"term":{"age":25}} ] } } } ~~~ ## 4. 索引管理 #### 4.1 創建自定義索引 ~~~ PUT /my_index { "settings": { ... any settings ... }, "mappings": { "type_one": { ... any mappings ... }, "type_two": { ... any mappings ... }, ... } ~~~ #### 4.2 刪除索引 ~~~ DELETE /index_one,index_two DELETE /index_* ~~~ 創建只有一個分片，沒有副本的索引 ~~~ PUT /my_temp_index { "settings": { "number_of_shards" : 1, "number_of_replicas" : 0 } } ~~~ 動態修改副本個數 ~~~ PUT /my_temp_index/_settings { "number_of_replicas": 1 } ~~~ #### 4.3 更新 > * 如果field存在就更新，不存在就創建 ~~~ POST my_store/products/1/_update { "doc": { "bookname" : "elasticsearch" } } ~~~ ## 5. 插入文檔 ### 5.1 _bulk ~~~ 每個操作需要兩個json字符串，語法如下 {“action”：{“metadata”}} {“data”} ~~~ > 有哪些類型的操作可以執行呢？ > （1）delete：刪除一個文檔，只要1個json串就可以了 > （2）create：PUT /index/type/id/_create，強制創建 > （3）index：普通的put操作，可以是創建文檔，也可以是全量替換文檔 > （4）update：執行的partial update操作插入數據 ~~~ POST /my_store/products/_bulk { "index": { "_id": 1 }} { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30, "productID" : "QQPX-R-3956-#aD8" } ~~~ 查詢數據 ~~~ GET /my_store/products/_search { "query": { "bool": { "filter": {"term": { "price": "20" }} } } } ~~~ ### 5.2 批量數據導入elasticsearch * 數據文件格式如5.1中導入命令：在數據文件目錄下執行 ~~~ curl -H 'Content-Type: application/x-ndjson' -XPUT 'http://192.168.2.88:9200/bank/account/_bulk?pretty' --data-binary @accounts.json ~~~ ## 6. mapping > * 可以理解為為文檔創建模型（scheme），規定每個字段的信息 > * 可以為字段添加index參數，指定字符串以何種方式索引 > * index index：和_index區別開來，_index是文檔的索引，而index是字段的描述，表示字段以何種方式被索引 index參數有以下值; 1. analyzed : 以全文方式索引這個字段，先分析、分詞、倒排索引（全文索引） 2. not_analyzed:索引這個字段，使之可以被搜索，但是索引內容和指定值一樣。不分析此字段(精準值匹配) 3. no ：不索引這個字段，這個字段不會被檢索到 ## 7. 聚合 ### 查詢聚合（多次分組） 1. 按照country分組 2. 在country組內按照join_date分組 3. 接著按照最小組內求平均年齡 country（join_date（avg）） * 元數據信息 ~~~ PUT /company { "mappings": { "employee": { "properties": { "age": { "type": "long" }, "country": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } }, "fielddata": true # 指定country為正排索引，因為要分組，所以不能被分詞,其實field不指定也行 }, "join_date": { "type": "date" # date類型本身就不會被分詞，不用指定 }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", # "type": "keyword",代表不分詞 "ignore_above": 256 } } }, "position": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "salary": { "type": "long" } } } } } ~~~ * 查詢 ~~~ GET company/_search {"size": 0, "aggs": { "group_by_country": { # 1. 按照country第一次分組 "terms": { "field": "country" }, "aggs": { "group_by_date": { # 2. 按照join_date第二次分組 "terms": { "field": "join_date" }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } } } } ~~~ * 結果 ~~~ "group_by_country": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ # country組 { "key": "china", "doc_count": 3, "group_by_date": { # group_by_date"組，有兩個2016-01-01和2017-01-01 "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": 1483228800000, "key_as_string": "2017-01-01T00:00:00.000Z", "doc_count": 2, "avg_age": { "value": 31 } }, { "key": 1451606400000, "key_as_string": "2016-01-01T00:00:00.000Z", "doc_count": 1, "avg_age": { "value": 32 } } ] ~~~ > 用state字段分組，并且計算出每組的個數，類似于mysql的分組，term用于按照指定的field分組，并給組內成員個數 `SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC` ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" } } } } ~~~ > "size": 0 ：不顯示查詢的hits部分，只查看聚合的結果，terms是聚合的意思 * 在以上分完組的前提下，對每組的余額求平均數 ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } ~~~ 返回桶中桶（組中平均） ~~~ "aggregations": { "group_by_state": { # 第一層聚合的名字 "doc_count_error_upper_bound": 20, "sum_other_doc_count": 770, "buckets": [ { "key": "ID", "doc_count": 27, "average_balance": { "value": 24368.777777777777 } }, { "key": "TX", "doc_count": 27, "average_balance": { # 平均值 "value": 27462.925925925927 } }, { ~~~ * 聚合排序,aggs{group,avg} * 按照average_balance求出的平均值排序 ~~~ GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" } }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } } ~~~ * 按照價格區間分組 ~~~ GET /ecommerce/product/_search { "size": 0, "aggs": { "group_by_price": { "range": { "field": "price", "ranges": [ { "from": 0, "to": 20 }, { "from": 20, "to": 40 }, { "from": 40, "to": 50 } ] }, "aggs": { "group_by_tags": { "terms": { "field": "tags" }, "aggs": { "average_price": { "avg": { "field": "price" } } } } } } } } ~~~ ## 8. geo query 地理位置查詢 > 有以下幾種物理位置查詢方式 > 1. geo_shape查詢 > 查找具有與指定的地理形狀相交或與之相交的地理形狀的文檔。 > 2. geo_bounding_box查詢 > 查找具有落入指定矩形中的地理位置的文檔。 > 3. geo_distance查詢 > 在中心點的指定距離內查找具有地理點的文檔。 > 4. geo_distance_range查詢 > 像geo_distance查詢一樣，但距離中心點指定的距離開始。 > 5. geo_polygon查詢 > 查找指定多邊形內具有地理位置的文檔。 ### 8.1 地理坐標點地理坐標點是指地球表面可以用經緯度描述的一個點。地理坐標點可以用來計算兩個坐標間的距離，還可以判斷一個坐標是否在一個區域中，或在聚合中。地理坐標點不能被動態映射（dynamic mapping）自動檢測，而是需要顯式聲明對應字段類型為 geo-point ： ~~~ PUT /attractions { "mappings": { # 映射 "restaurant": { # 類型 "properties": { "name": { # field "type": "string" }, "location": { "type": "geo_point" } } } } ~~~ ## 9. _termvectors 統計詞條信息可以對詞語進行過濾，常用的過濾器參數如： max_num_terms 最大的詞條數目 min_term_freq 最小的詞頻，比如忽略那些在字段中出現次數小于一定值的詞條。 max_term_freq 最大的詞頻 min_doc_freq 最小的文檔頻率，比如忽略那些在文檔中出現次數小于一定的值的詞條 max_doc_freq 最大的文檔頻率 min_word_length 詞的最小長度 max_word_length 詞的最大長度 1. 對content字段進行詞頻統計 ~~~ GET news/new/1/_termvectors { "fields": ["content"] } ~~~ 得到 ~~~ "terms": { "30": { "term_freq": 1, "tokens": [ 。。。。。 "與": { "term_freq": 1, # 詞出現的次數 "tokens": [ { "position": 1, "start_offset": 2, "end_offset": 3 } ] 。。。。。。 ~~~ 2. 對詞進行過濾 ~~~ GET /news/new/9/_termvectors { "fields": ["content"], "filter": { "min_word_length": 2, 詞的長度大于1,這樣不會出現單詞字了 "min_term_freq": 2 # 詞出現的次數最少有兩次 } } ~~~ 3. java api ~~~ public List<Map<String,Object>> termVectos(String index, String type, String id,String field) throws IOException { TermVectorsRequest.FilterSettings filterSettings = new TermVectorsRequest.FilterSettings(); filterSettings.minWordLength = 2; filterSettings.maxNumTerms = 10000; //返回最大結果數 TermVectorsResponse resp = client.prepareTermVectors(index, type, id) .setFilterSettings(filterSettings) .setSelectedFields(field) .execute().actionGet(); //獲取字段 Fields fields = resp.getFields(); Iterator<String> iterator = fields.iterator(); List<Map<String,Object>> result = new ArrayList<Map<String, Object>>(); Map<String,Object> temp = null; while (iterator.hasNext()){ String dfield = iterator.next(); Terms terms = fields.terms(dfield); //獲取字段對應的terms TermsEnum termsEnum = terms.iterator(); //termsEnum包含詞語統計信息 while (termsEnum.next() != null){ String word = termsEnum.term().utf8ToString(); int freq = termsEnum.postings(null,120).freq(); temp = new HashMap<String, Object>(); temp.put("word",word); temp.put("freq",freq); result.add(temp); } } return result; } ~~~ ## 10.高亮 ~~~ GET /ecommerce/product/_search { "query" : { "match" : { "producer" : "producer" } }, "highlight": { "fields" : { "producer" : {} } } } ~~~ 得到 ~~~ "_index": "ecommerce", "_type": "product", "_id": "1", "_score": 0.25811607, "_source": { "name": "gaolujie yagao", "desc": "gaoxiao meibai", "price": 30, "producer": "gaolujie producer", "tags": [ "meibai", "fangzhu" ] }, "highlight": { "producer": [ "gaolujie <em>producer</em>" # 高亮標記 ] } }, ~~~ ## 11. 插入數據 ### 11.1 手動指定document id （1）根據應用情況來說，是否滿足手動指定document id的前提： > 一般來說，是從某些其他的系統中，導入一些數據到es時，會采取這種方式，就是使用系統中已有數據的唯一標識，作為es中document的id。舉個例子，比如說，我們現在在開發一個電商網站，做搜索功能，或者是OA系統，做員工檢索功能。這個時候，數據首先會在網站系統或者IT系統內部的數據庫中，會先有一份，此時就肯定會有一個數據庫的primary key（自增長，UUID，或者是業務編號）。如果將數據導入到es中，此時就比較適合采用數據在數據庫中已有的primary key。 > 如果說，我們是在做一個系統，這個系統主要的數據存儲就是es一種，也就是說，數據產生出來以后，可能就沒有id，直接就放es一個存儲，那么這個時候，可能就不太適合說手動指定document id的形式了，因為你也不知道id應該是什么，此時可以采取下面要講解的讓es自動生成id的方式。（2）put /index/type/id put指定ID ~~~ PUT /test_index/test_type/2 { "test_content": "my test" } ~~~ ### 11.2 自動生成document id （1）post /index/type ~~~ POST /test_index/test_type { "test_content": "my test" } { "_index": "test_index", "_type": "test_type", "_id": "AVp4RN0bhjxldOOnBxaE", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } ~~~ （2）自動生成的id，長度為20個字符，URL安全，base64編碼，GUID，分布式系統并行生成時不可能會發生沖突 ## 12. document修改 ### 12.1 全量替換（put） > 1. 全量替換 > 如果document不存在，則創建document，version為1，如果存在，則用新put的數據替換原來的document，version加1，舊的文檔不會被馬上刪除，但是也不會被訪問了。不會立即物理刪除，只會將其標記為deleted，當數據越來越多的時候，es在后臺自動刪除 > 2. document的強制創建 > 創建文檔與全量替換的語法是一樣的，有時我們只是想新建文檔，不想替換文檔，如果強制進行創建呢？ > PUT /index/type/id?op_type=create，PUT /index/type/id/_create ## 13. partial update 1、什么是partial update？ > PUT /index/type/id，創建文檔&替換文檔，就是一樣的語法 > 一般對應到應用程序中，每次的執行流程基本是這樣的： > （1）應用程序先發起一個get請求，獲取到document，展示到前臺界面，供用戶查看和修改 > （2）用戶在前臺界面修改數據，發送到后臺 > （3）后臺代碼，會將用戶修改的數據在內存中進行執行，然后封裝好修改后的全量數據 > （4）然后發送PUT請求，到es中，進行全量替換 > （5）es將老的document標記為deleted，然后重新創建一個新的document partial update ~~~ post /index/type/id/_update { "doc": { "要修改的少數幾個field即可，不需要全量的數據" } } ~~~ 看起來，好像就比較方便了，每次就傳遞少數幾個發生修改的field即可，不需要將全量的document數據發送過去 partial update，看起來很方便的操作，實際內部的原理是什么樣子的，然后它的優點是什么 3、上機動手實戰演練partial update ~~~ PUT /test_index/test_type/10 { "test_field1": "test1", "test_field2": "test2" } POST /test_index/test_type/10/_update { "doc": { "test_field2": "updated test2" } } ~~~ ## 14. 控制查詢精度 > 全文搜索： > 1. 全文搜索有兩種辦法，match query和should > 2. 控制搜索精度and operator（和），minimum_should_match（最少匹配數量）準備數據 ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"title" : "this is java and elasticsearch blog"} } { "update": { "_id": "2"} } { "doc" : {"title" : "this is java blog"} } { "update": { "_id": "3"} } { "doc" : {"title" : "this is elasticsearch blog"} } { "update": { "_id": "4"} } { "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} } { "update": { "_id": "5"} } { "doc" : {"title" : "this is spark blog"} } ~~~ ### 1. match query 1. 查詢包含 Java 或者elasticsearch，得到四個文檔 ~~~ "title": "this is java, elasticsearch, hadoop blog" # elasticsearch java "title": "this is java and elasticsearch blog" # elasticsearch java "title": "this is elasticsearch blog" # elasticsearch "title": "this is java blog" # java ~~~ 2. 更精準一些，and 關鍵字還有or關鍵字，不過沒啥意義了，match本身就有or的作用,和1.是一樣的作用，為什么要多寫查詢包含Java和elasticsearch的文檔 ~~~ GET /forum/_search { "query": { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } } ~~~ 得到兩條，將上一次的結果過濾掉了兩條 ~~~ "title": "this is java, elasticsearch, hadoop blog" "title": "this is java and elasticsearch blog" ~~~ 3. 最小匹配"minimum_should_match": "75%" java elasticsearch spark hadoop 中至少有3個（75%）關鍵字出現 ~~~ GET forum/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "75%" # 上邊的查詢條件至少滿足75% } } } } ~~~ 得到 ~~~ "title": "this is java, elasticsearch, hadoop blog" ~~~ * java elasticsearch spark hadoop 中至少有三個關鍵字2個（50%） ~~~ GET forum/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "50%" } } } } ~~~ 搜索條件放寬了，多搜索出一條數據 ~~~ "title": "this is java, elasticsearch, hadoop blog" "title": "this is java and elasticsearch blog" ~~~ ### 2. should ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match": {"title": "java"}}, {"match": {"title": "elasticsearch"}}, {"match": {"title": "spark"}}, {"match": {"title": "hadoop"}} ], "minimum_should_match": 3 } } } ~~~ 和3.作用相同 > 默認情況下，should是可以不匹配任何一個的，比如上面的搜索中，this is java blog，就不匹配任何一個should條件 > 但是有個例外的情況，如果沒有must的話，那么should中必須至少匹配一個才可以 > 比如下面的搜索，should中有4個條件，默認情況下，只要滿足其中一個條件，就可以匹配作為結果返回