elasticsearch基礎_2 · TUNA-daily

[TOC] ## 1. match query 底層轉換 > 參考【操作】中 14 控制搜索精準度 ~~~ { "match": { "title": "java elasticsearch"} } ~~~ 1. 使用諸如上面的match query進行多值搜索的時候，es會在底層自動將這個match query轉換為bool的語法 bool should，指定多個搜索詞，同時使用term query ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 2. and match如何轉換為term+must ~~~ { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } ~~~ 底層轉換成 ~~~ { "bool": { "must": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 3. minimum_should_match如何轉換 ~~~ { "match": { "title": { "query": "java elasticsearch hadoop spark", "minimum_should_match": "75%" } } } ~~~ 底層轉換成 ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }}, { "term": { "title": "hadoop" }}, { "term": { "title": "spark" }} ], "minimum_should_match": 3 } } ~~~ ## 2. boost 控制搜索權重 > 需求： > 搜索標題中包含java的帖子，同時呢，如果標題中包含hadoop或elasticsearch就優先搜索出來，同時呢，如果一個帖子包含java hadoop，一個帖子包含java elasticsearch，包含hadoop的帖子要比elasticsearch優先搜索出來 > ~~~ GET /forum/_search { "query": { "bool": { "must": [ {"match": {"title": "java"}} ], "should": [ {"match":{"title": {"query": "elasticsearch","boost":3}}}, {"match":{"title": {"query": "hadoop","boost":2}}} ] } } } ~~~ ## 3. dis_max 多字段查詢取最優（相關度分值最高） 1. 查找title或者content字段中含有 Java solution的文檔 ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match":{"title": "java solution"}}, {"match":{"content": "java solution"}} ] } } } ~~~ 得到 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.8849759, "_source": { "title": "this is java blog", "content": "i think java is the best programming language" }, "highlight": { "title": [ "this is java blog" ], "content": [ "i think java is the best programming language" ] } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.7120095, "_source": { "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" }, "highlight": { "title": [ "this is java, elasticsearch, hadoop blog" ], "content": [ "elasticsearch and hadoop are all very good solution, i am a beginner" ] } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" }, "highlight": { "content": [ "spark is best big data solution based on scala ,an programming language similar to java" ] } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" }, "highlight": { "title": [ "this is java and elasticsearch blog" ] ~~~ * 顯然id=5的文檔content字段，既有Java也有solution，但是相關度評分確不是最高的，這不是我們想要的結果計算分值大致如下 ~~~ 計算每個document的relevance score：每個query的分數，乘以matched query數量，除以總query數量算一下doc4的分數 { "match": { "title": "java solution" }}，針對doc4，是有一個分數的 { "match": { "content": "java solution" }}，針對doc4，也是有一個分數的所以是兩個分數加起來，比如說，1.1 + 1.2 = 2.3 matched query數量 = 2 總query數量 = 2 2.3 * 2 / 2 = 2.3 算一下doc5的分數，只有一個query有分 { "match": { "title": "java solution" }}，針對doc5，是沒有分數的 { "match": { "content": "java solution" }}，針對doc5，是有一個分數的所以說，只有一個query是有分數的，比如2.3 matched query數量 = 1 總query數量 = 2 2.3 * 1 / 2 = 1.15 doc5的分數 = 1.15 < doc4的分數 = 2.3 ~~~ 2. dis_max query 出場 * 選取查詢最高的相關度得分，不是取平均 best fields策略，就是說，搜索到的結果，應該是某一個field中匹配到了盡可能多的關鍵詞，被排在前面；而不是盡可能多的field匹配到了少數的關鍵詞，排在了前面 ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"FIELD": "java solution"}} ] } } } ~~~ * 這樣id=5的文檔排在前邊了 ~~~ "hits": { "total": 4, "max_score": 0.68640786, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.68640786, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "articleID": "hjPX-R-hhh-#aDn", "userID": 3, "hidden": true, "postDate": "2017-01-04", "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.5565415, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } } ~~~ 3. dis_max只考慮分值最高的查詢，所有存在一定的缺陷，加入tie_breaker,可以優化dis_max ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"content": "java solution"}} ], "tie_breaker": 0.3 } } } ~~~ tie_breake（0-1）會乘以除最高分值以外的其他分值，然后綜合最高分得到一個最終分數，將其他查詢的結果也考慮了進去。 4. multi_match實現dis_max ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ## 4. fields策略 best-fields策略，主要是說將某一個field匹配盡可能多的關鍵詞的doc優先返回回來(默認) most-fields策略，主要是說盡可能返回更多field匹配到某個關鍵詞的doc，優先返回回來 ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"sub_title" : "learning more courses"} } { "update": { "_id": "2"} } { "doc" : {"sub_title" : "learned a lot of course"} } { "update": { "_id": "3"} } { "doc" : {"sub_title" : "we have a lot of fun"} } { "update": { "_id": "4"} } { "doc" : {"sub_title" : "both of them are good"} } { "update": { "_id": "5"} } { "doc" : {"sub_title" : "haha, hello world"} } ~~~ ### 4.1 match搜索 1. 使用match，對sub_title進行搜索，sub_title使用的是english分詞器，回把復數，動名詞，過去式轉換成最原始的詞，搜索learning courses 也會和對應的field使用相同的分詞器，被分成 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title": "learning courses" } } } ~~~ * 搜索轉換 ~~~ GET _analyze { "analyzer": "english", "text": "learning courses" } ~~~ 得到 ~~~ { "tokens": [ { "token": "learn", "start_offset": 0, "end_offset": 8, "type": "<ALPHANUM>", "position": 0 }, { "token": "cours", "start_offset": 9, # term position，在近似匹配中會用到，表示兩個詞的距離（match_phrase） "end_offset": 16, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 搜索 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title.std": "learning courses" } } } ~~~ 得到 learning more courses由于english分詞導致排在了后邊！！！！ ~~~ "hits": [ 。。。 "sub_title": "learned a lot of course" 。。。。 "sub_title": "learning more courses" } } ] } } ~~~ 這時我們用sub_title的子field（標準分詞器）查得到結果，符合我們的預期 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. multi_match 多field查詢，就涉及到了field策略 > 1. 默認best_field查詢 ~~~ GET /forum/_search { "query": { "multi_match": { "query": "learning courses", "fields": ["sub_title","sub_title.std"] } } } ~~~ 結果learned a lot of course 排在了前面 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. most_field策略雖然learned a lot of course仍然在前面，但是他的分值幾乎沒有變化，而learning more courses分值增加，說明了most_field策略很好的照顧到了所有請求。 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 1.012641, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ > best_fields與most_fields的區別： > （1）best_fields，是對多個field進行搜索，挑選某個field匹配度最高的那個分數，同時在多個query最高分相同的情況下，在一定程度上考慮其他query的分數。簡單來說，你對多個field進行搜索，就想搜索到某一個field盡可能包含更多關鍵字的數據 > 優點：通過best_fields策略，以及綜合考慮其他field，還有minimum_should_match支持，可以盡可能精準地將匹配的結果推送到最前面 > 缺點：除了那些精準匹配的結果，其他差不多大的結果，排序結果不是太均勻，沒有什么區分度了 > 實際的例子：百度之類的搜索引擎，最匹配的到最前面，但是其他的就沒什么區分度了 > （2）most_fields，綜合多個field一起進行搜索，盡可能多地讓所有field的query參與到總分數的計算中來，此時就會是個大雜燴，出現類似best_fields案例最開始的那個結果，結果不一定精準，某一個document的一個field包含更多的關鍵字，但是因為其他document有更多field匹配到了，所以排在了前面；所以需要建立類似sub_title.std這樣的field，盡可能讓某一個field精準匹配query string，貢獻更高的分數，將更精準匹配的數據排到前面 > 優點：將盡可能匹配更多field的結果推送到最前面，整個排序結果是比較均勻的 > 缺點：可能那些精準匹配的結果，無法推送到最前面 > 實際的例子：wiki，明顯的most_fields策略，搜索結果比較均勻，但是的確要翻好幾頁才能找到最匹配的結果 > 3. cross_fields 適用于橫跨多個field，搜索一個事務，比如人名，地名 ~~~ GET /forum/article/_search { "query": { "multi_match": { "query": "Peter Smith", "type": "cross_fields", "operator": "and", "fields": ["author_first_name", "author_last_name"] } } } ~~~ > 問題1：只是找到盡可能多的field匹配的doc，而不是某個field完全匹配的doc --> 解決，要求每個term都必須在任何一個field中出現 > Peter，Smith > 要求Peter必須在author_first_name或author_last_name中出現 > 要求Smith必須在author_first_name或author_last_name中出現 > Peter Smith可能是橫跨在多個field中的，所以必須要求每個term都在某個field中出現，組合起來才能組成我們想要的標識，完整的人名 > 原來most_fiels，可能像Smith Williams也可能會出現，因為most_fields要求只是任何一個field匹配了就可以，匹配的field越多，分數越高 > 問題2：most_fields，沒辦法用minimum_should_match去掉長尾數據，就是匹配的特別少的結果 --> 解決，既然每個term都要求出現，長尾肯定被去除掉了 > java hadoop spark --> 這3個term都必須在任何一個field出現了 > 比如有的document，只有一個field中包含一個java，那就被干掉了，作為長尾就沒了 > 問題3：TF/IDF算法，比如Peter Smith和Smith Williams，搜索Peter Smith的時候，由于first_name中很少有Smith的，所以query在所有document中的頻率很低，得到的分數很高，可能Smith Williams反而會排在Peter Smith前面 --> 計算IDF的時候，將每個query在每個field中的IDF都取出來，取最小值，就不會出現極端情況下的極大值了 > Peter Smith > Peter > Smith > Smith，在author_first_name這個field中，在所有doc的這個Field中，出現的頻率很低，導致IDF分數很高；Smith在所有doc的author_last_name field中的頻率算出一個IDF分數，因為一般來說last_name中的Smith頻率都較高，所以IDF分數是正常的，不會太高；然后對于Smith來說，會取兩個IDF分數中，較小的那個分數。就不會出現IDF分過高的情況。