<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                合規國際互聯網加速 OSASE為企業客戶提供高速穩定SD-WAN國際加速解決方案。 廣告
                [TOC] ## 1. match query 底層轉換 > 參考【操作】中 14 控制搜索精準度 ~~~ { "match": { "title": "java elasticsearch"} } ~~~ 1. 使用諸如上面的match query進行多值搜索的時候,es會在底層自動將這個match query轉換為bool的語法 bool should,指定多個搜索詞,同時使用term query ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 2. and match如何轉換為term+must ~~~ { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } ~~~ 底層轉換成 ~~~ { "bool": { "must": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } } ~~~ 3. minimum_should_match如何轉換 ~~~ { "match": { "title": { "query": "java elasticsearch hadoop spark", "minimum_should_match": "75%" } } } ~~~ 底層轉換成 ~~~ { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }}, { "term": { "title": "hadoop" }}, { "term": { "title": "spark" }} ], "minimum_should_match": 3 } } ~~~ ## 2. boost 控制搜索權重 > 需求: > 搜索標題中包含java的帖子,同時呢,如果標題中包含hadoop或elasticsearch就優先搜索出來,同時呢,如果一個帖子包含java hadoop,一個帖子包含java elasticsearch,包含hadoop的帖子要比elasticsearch優先搜索出來 > ~~~ GET /forum/_search { "query": { "bool": { "must": [ {"match": {"title": "java"}} ], "should": [ {"match":{"title": {"query": "elasticsearch","boost":3}}}, {"match":{"title": {"query": "hadoop","boost":2}}} ] } } } ~~~ ## 3. dis_max 多字段查詢取最優(相關度分值最高) 1. 查找title或者content字段中含有 Java solution的文檔 ~~~ GET /forum/_search { "query": { "bool": { "should": [ {"match":{"title": "java solution"}}, {"match":{"content": "java solution"}} ] } } } ~~~ 得到 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.8849759, "_source": { "title": "this is java blog", "content": "i think java is the best programming language" }, "highlight": { "title": [ "this is <em>java</em> blog" ], "content": [ "i think <em>java</em> is the best programming language" ] } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.7120095, "_source": { "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" }, "highlight": { "title": [ "this is <em>java</em>, elasticsearch, hadoop blog" ], "content": [ "elasticsearch and hadoop are all very good <em>solution</em>, i am a beginner" ] } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" }, "highlight": { "content": [ "spark is best big data <em>solution</em> based on scala ,an programming language similar to <em>java</em>" ] } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" }, "highlight": { "title": [ "this is <em>java</em> and elasticsearch blog" ] ~~~ * 顯然id=5的文檔content字段,既有Java也有solution,但是相關度評分確不是最高的,這不是我們想要的結果 計算分值大致如下 ~~~ 計算每個document的relevance score:每個query的分數,乘以matched query數量,除以總query數量 算一下doc4的分數 { "match": { "title": "java solution" }},針對doc4,是有一個分數的 { "match": { "content": "java solution" }},針對doc4,也是有一個分數的 所以是兩個分數加起來,比如說,1.1 + 1.2 = 2.3 matched query數量 = 2 總query數量 = 2 2.3 * 2 / 2 = 2.3 算一下doc5的分數,只有一個query有分 { "match": { "title": "java solution" }},針對doc5,是沒有分數的 { "match": { "content": "java solution" }},針對doc5,是有一個分數的 所以說,只有一個query是有分數的,比如2.3 matched query數量 = 1 總query數量 = 2 2.3 * 1 / 2 = 1.15 doc5的分數 = 1.15 < doc4的分數 = 2.3 ~~~ 2. dis_max query 出場 * 選取查詢最高的相關度得分,不是取平均 best fields策略,就是說,搜索到的結果,應該是某一個field中匹配到了盡可能多的關鍵詞,被排在前面;而不是盡可能多的field匹配到了少數的關鍵詞,排在了前面 ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"FIELD": "java solution"}} ] } } } ~~~ * 這樣id=5的文檔排在前邊了 ~~~ "hits": { "total": 4, "max_score": 0.68640786, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 0.68640786, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language" } }, { "_index": "forum", "_type": "article", "_id": "5", "_score": 0.56008905, "_source": { "articleID": "hjPX-R-hhh-#aDn", "userID": 3, "hidden": true, "postDate": "2017-01-04", "title": "this is spark blog", "content": "spark is best big data solution based on scala ,an programming language similar to java" } }, { "_index": "forum", "_type": "article", "_id": "4", "_score": 0.5565415, "_source": { "articleID": "QQPX-R-3956-#aD8", "userID": 2, "hidden": true, "postDate": "2017-01-02", "title": "this is java, elasticsearch, hadoop blog", "content": "elasticsearch and hadoop are all very good solution, i am a beginner" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.26742277, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article" } } ] } } ~~~ 3. dis_max只考慮分值最高的查詢,所有存在一定的缺陷,加入tie_breaker,可以優化dis_max ~~~ GET forum/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "java solution"}}, {"match": {"content": "java solution"}} ], "tie_breaker": 0.3 } } } ~~~ tie_breake(0-1)會乘以除最高分值以外的其他分值,然后綜合最高分得到一個最終分數,將其他查詢的結果也考慮了進去。 4. multi_match實現dis_max ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ## 4. fields策略 best-fields策略,主要是說將某一個field匹配盡可能多的關鍵詞的doc優先返回回來(默認) most-fields策略,主要是說盡可能返回更多field匹配到某個關鍵詞的doc,優先返回回來 ~~~ GET forum/_search { "query": { "multi_match": { "query": "java solution", "fields": ["title^2","content"], "type": "best_fields", "minimum_should_match":"50%" } } } ~~~ ~~~ POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"sub_title" : "learning more courses"} } { "update": { "_id": "2"} } { "doc" : {"sub_title" : "learned a lot of course"} } { "update": { "_id": "3"} } { "doc" : {"sub_title" : "we have a lot of fun"} } { "update": { "_id": "4"} } { "doc" : {"sub_title" : "both of them are good"} } { "update": { "_id": "5"} } { "doc" : {"sub_title" : "haha, hello world"} } ~~~ ### 4.1 match搜索 1. 使用match,對sub_title進行搜索,sub_title使用的是english分詞器,回把復數,動名詞,過去式轉換成最原始的詞,搜索learning courses 也會和對應的field使用相同的分詞器,被分成 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title": "learning courses" } } } ~~~ * 搜索轉換 ~~~ GET _analyze { "analyzer": "english", "text": "learning courses" } ~~~ 得到 ~~~ { "tokens": [ { "token": "learn", "start_offset": 0, "end_offset": 8, "type": "<ALPHANUM>", "position": 0 }, { "token": "cours", "start_offset": 9, # term position,在近似匹配中會用到,表示兩個詞的距離(match_phrase) "end_offset": 16, "type": "<ALPHANUM>", "position": 1 } ] } ~~~ 搜索 ~~~ GET /forum/article/_search { "query": { "match": { "sub_title.std": "learning courses" } } } ~~~ 得到 learning more courses由于english分詞導致排在了后邊!!!! ~~~ "hits": [ 。。。 "sub_title": "learned a lot of course" 。。。。 "sub_title": "learning more courses" } } ] } } ~~~ 這時我們用sub_title的子field(標準分詞器)查 得到結果,符合我們的預期 ~~~ "hits": [ { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. multi_match 多field查詢,就涉及到了field策略 > 1. 默認best_field查詢 ~~~ GET /forum/_search { "query": { "multi_match": { "query": "learning courses", "fields": ["sub_title","sub_title.std"] } } } ~~~ 結果learned a lot of course 排在了前面 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 0.5063205, "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ 2. most_field策略 雖然learned a lot of course仍然在前面,但是他的分值幾乎沒有變化,而learning more courses分值增加,說明了most_field策略很好的照顧到了所有請求。 ~~~ "hits": { "total": 2, "max_score": 1.219939, "hits": [ { "_index": "forum", "_type": "article", "_id": "2", "_score": 1.219939, "_source": { "articleID": "KDKE-B-9947-#kL5", "userID": 1, "hidden": false, "postDate": "2017-01-02", "title": "this is java blog", "content": "i think java is the best programming language", "sub_title": "learned a lot of course" } }, { "_index": "forum", "_type": "article", "_id": "1", "_score": 1.012641, "_source": { "articleID": "XHDK-A-1293-#fJ3", "userID": 1, "hidden": false, "postDate": "2017-01-01", "title": "this is java and elasticsearch blog", "content": "i like to write best elasticsearch article", "sub_title": "learning more courses" } } ] } } ~~~ > best_fields與most_fields的區別: > (1)best_fields,是對多個field進行搜索,挑選某個field匹配度最高的那個分數,同時在多個query最高分相同的情況下,在一定程度上考慮其他query的分數。簡單來說,你對多個field進行搜索,就想搜索到某一個field盡可能包含更多關鍵字的數據 > 優點:通過best_fields策略,以及綜合考慮其他field,還有minimum_should_match支持,可以盡可能精準地將匹配的結果推送到最前面 > 缺點:除了那些精準匹配的結果,其他差不多大的結果,排序結果不是太均勻,沒有什么區分度了 > 實際的例子:百度之類的搜索引擎,最匹配的到最前面,但是其他的就沒什么區分度了 > (2)most_fields,綜合多個field一起進行搜索,盡可能多地讓所有field的query參與到總分數的計算中來,此時就會是個大雜燴,出現類似best_fields案例最開始的那個結果,結果不一定精準,某一個document的一個field包含更多的關鍵字,但是因為其他document有更多field匹配到了,所以排在了前面;所以需要建立類似sub_title.std這樣的field,盡可能讓某一個field精準匹配query string,貢獻更高的分數,將更精準匹配的數據排到前面 > 優點:將盡可能匹配更多field的結果推送到最前面,整個排序結果是比較均勻的 > 缺點:可能那些精準匹配的結果,無法推送到最前面 > 實際的例子:wiki,明顯的most_fields策略,搜索結果比較均勻,但是的確要翻好幾頁才能找到最匹配的結果 > 3. cross_fields 適用于橫跨多個field,搜索一個事務,比如人名,地名 ~~~ GET /forum/article/_search { "query": { "multi_match": { "query": "Peter Smith", "type": "cross_fields", "operator": "and", "fields": ["author_first_name", "author_last_name"] } } } ~~~ > 問題1:只是找到盡可能多的field匹配的doc,而不是某個field完全匹配的doc --> 解決,要求每個term都必須在任何一個field中出現 > Peter,Smith > 要求Peter必須在author_first_name或author_last_name中出現 > 要求Smith必須在author_first_name或author_last_name中出現 > Peter Smith可能是橫跨在多個field中的,所以必須要求每個term都在某個field中出現,組合起來才能組成我們想要的標識,完整的人名 > 原來most_fiels,可能像Smith Williams也可能會出現,因為most_fields要求只是任何一個field匹配了就可以,匹配的field越多,分數越高 > 問題2:most_fields,沒辦法用minimum_should_match去掉長尾數據,就是匹配的特別少的結果 --> 解決,既然每個term都要求出現,長尾肯定被去除掉了 > java hadoop spark --> 這3個term都必須在任何一個field出現了 > 比如有的document,只有一個field中包含一個java,那就被干掉了,作為長尾就沒了 > 問題3:TF/IDF算法,比如Peter Smith和Smith Williams,搜索Peter Smith的時候,由于first_name中很少有Smith的,所以query在所有document中的頻率很低,得到的分數很高,可能Smith Williams反而會排在Peter Smith前面 --> 計算IDF的時候,將每個query在每個field中的IDF都取出來,取最小值,就不會出現極端情況下的極大值了 > Peter Smith > Peter > Smith > Smith,在author_first_name這個field中,在所有doc的這個Field中,出現的頻率很低,導致IDF分數很高;Smith在所有doc的author_last_name field中的頻率算出一個IDF分數,因為一般來說last_name中的Smith頻率都較高,所以IDF分數是正常的,不會太高;然后對于Smith來說,會取兩個IDF分數中,較小的那個分數。就不會出現IDF分過高的情況。
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看