Highlighting · Elasticsearch 5.4 中文文檔

# Highlighting 原文鏈接 : [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html) 譯文鏈接 : [http://www.apache.wiki/pages/editpage.action?pageId=4883096](http://www.apache.wiki/pages/editpage.action?pageId=488308) 貢獻者 : [ping](/display/~wangyangting) 允許突出顯示一個或多個字段的搜索結果。實現使用 lucene 普通熒光筆，快速向量熒光筆（fvh）或 postings 熒光筆。以下是一個搜索請求正文的示例： ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "content" : {} } } } ``` 在上述情況下，內容字段將為每個搜索命中突出顯示（每個搜索命中內將有另一個元素，稱為突出顯示，其中包括突出顯示的字段和突出顯示的片段）。 Note：為了執行突出顯示，需要字段的實際內容。如果有問題的字段被存儲（在映射中存儲設置為 true），它將被使用，否則，實際的 _source 將被加載，并且相關字段將從中提取。 _all 字段不能從 _source 中提取，因此它只能用于突出顯示，如果它映射到將 store 設置為 true。字段名稱支持通配符符號。例如，使用 comment_ * 將導致所有與表達式匹配的文本和關鍵字字段（以及 5.0 之前的字符串）被突出顯示。請注意，所有其他字段將不會突出顯示。如果您使用自定義映射器并要在字段上突出顯示，則必須顯式提供字段名稱。 ### Plain highlighter 熒光筆的默認選擇是普通類型，并使用Lucene熒光筆。它試圖在理解詞重要性和短語查詢中的任何詞定位標準方面反映查詢匹配邏輯。 warning：如果你想突出很多文檔中的大量字段與復雜的查詢，這個熒光筆不會快。在努力準確地反映查詢邏輯，它創建一個微小的內存索引，并通過 Lucene 的查詢執行計劃程序重新運行原始查詢條件，以獲取當前文檔的低級別匹配信息。這對于每個字段和需要突出顯示的每個文檔重復。如果這在您的系統中出現性能問題，請考慮使用替代熒光筆。 ### Postings highlighter 如果 index_options 設置為映射中的偏移，則將使用 postings highlighter 而不是純色熒光筆。帖子熒光筆： * 速度更快，因為它不需要重新分析要突出顯示的文本：文檔越大，性能增益越好 * 比快速向量熒光筆所需的 term_vectors 需要更少的磁盤空間 * 將文本分成句子并突出顯示。非常適合自然語言，而不是與包含例如 html 標記的字段 * 將文檔視為整個語料庫，并使用 BM25 算法對單個句子進行評分，如同它們是該語料庫中的文檔以下是一個在索引映射中設置內容字段的示例，以允許使用其上的?postings highlighter?來突出顯示： ``` { "type_name" : { "content" : {"index_options" : "offsets"} } } ``` Note：請注意，postings highlighter?指的是執行簡單的查詢術語突出顯示，而不考慮其位置。這意味著，當與短語查詢結合使用時，它將突出顯示查詢所構成的所有術語，而不管它們是否實際上是查詢匹配的一部分，從而有效地忽略了它們的位置。 Warning： postings highlighter 不支持突出顯示一些復雜的查詢，例如類型設置為match_phrase_prefix的匹配查詢。在這種情況下，不會返回高亮顯示的片段。 ### Fast vector highlighter 如果通過在映射中將 term_vector 設置為 with_positions_offsets 來提供 term_vector 信息，則將使用快速向量熒光筆而不是普通熒光筆。快速矢量熒光筆： * 是更快，特別是對于大字段（> 1MB） * 可以使用 boundary_chars，boundary_max_scan 和 fragment_offset 進行定制（見下文） * 需要將 term_vector 設置為 with_positions_offsets，這會增加索引的大小 * 可以將多個字段的匹配合并為一個結果。請參閱 matched_fields * 可以為不同位置的匹配分配不同的權重，以便在突出顯示促銷詞組匹配的 Boosting Query 時，可以將詞組匹配排在匹配項上下面是一個設置內容字段以允許使用快速向量熒光筆突出顯示的示例（這將導致索引更大）： ``` { "type_name" : { "content" : {"term_vector" : "with_positions_offsets"} } } ``` ### Force highlighter type 類型字段允許強制特定的熒光筆類型。這對于需要在啟用 term_vectors 的字段上使用純色熒光筆時非常有用。允許的值是：plain，postings 和 fvh。以下是強制使用純熒光筆的示例： ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "content" : {"type" : "plain"} } } } ``` ### Force highlighting on source 強制高亮顯示源上的高亮顯示字段，即使字段單獨存儲。默認為 false。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "content" : {"force_source" : true} } } } ``` ### Highlighting Tags ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "pre_tags" : ["<tag1>"], "post_tags" : ["</tag1>"], "fields" : { "_all" : {} } } } ``` 使用快速向量熒光筆可以有更多的標簽，“重要性”是有序的。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "_all" : {} } } } ``` 還有內置的“標簽”模式，當前有一個模式稱為樣式與以下 pre_tags： ``` , , , , , , , , , ``` 和 作為 post_tags。如果你認為更好的內置標簽模式，只是發送電子郵件到郵件列表或打開一個問題。以下是切換標記模式的示例： ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "tags_schema" : "styled", "fields" : { "content" : {} } } } ``` ### Encoder 編碼器參數可用于定義高亮顯示的文本的編碼方式。它可以是默認（無編碼）或 html（將轉義 html，如果你使用 html 突出顯示標簽）。 ### Highlighted Fragments 每個高亮顯示的字段可以控制高亮的片段的大小（以字符為單位）（默認值為 100 ），以及要返回的最大片段數（默認值為 5 ）。例如： ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "content" : {"fragment_size" : 150, "number_of_fragments" : 3} } } } ``` 當使用 postings highlighter 時，fragment_size 被忽略，因為它輸出句子不考慮它們的長度。除此之外，還可以指定高亮顯示的片段需要按照分數排序： ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "order" : "score", "fields" : { "content" : {"fragment_size" : 150, "number_of_fragments" : 3} } } } ``` ?如果 number_of_fragments 值設置為 0，則不會生成片段，而是返回字段的整個內容，當然它會突出顯示。如果短文本（例如文檔標題或地址）需要高亮顯示，但不需要分段，這可能非常方便。請注意，在這種情況下會忽略 fragment_size。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "_all" : {}, "bio.title" : {"number_of_fragments" : 0} } } } ``` ?當使用 fvh 時，可以使用 fragment_offset 參數來控制從開始突出顯示的邊距。在沒有匹配的片段高亮的情況下，默認是不返回任何東西。相反，我們可以通過將 no_match_size（默認為 0 ）設置為要返回的文本的長度，從字段的開頭返回一段文本。實際長度可能比指定的短，因為它試圖在單詞邊界上斷開。當使用 postings 熒光筆時，不可能控制片段的實際大小，因此當 no_match_size 大于 0 時，第一個句子返回。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "fields" : { "content" : { "fragment_size" : 150, "number_of_fragments" : 3, "no_match_size": 150 } } } } ``` ### Highlight query 也可以通過設置 highlight_query 來高亮顯示搜索查詢之外的查詢。如果使用 rescore 查詢，這是特別有用的，因為這些查詢在默認情況下不會通過高亮顯示來考慮。 Elasticsearch 不會驗證 highlight_query 以任何方式包含搜索查詢，因此可以定義它，因此合法的查詢結果根本不會突出顯示。通常最好在 highlight_query 中包含搜索查詢。下面是在 highlight_query 中包含搜索查詢和 rescore 查詢的示例。 ``` GET /_search { "stored_fields": [ "_id" ], "query" : { "match": { "content": { "query": "foo bar" } } }, "rescore": { "window_size": 50, "query": { "rescore_query" : { "match_phrase": { "content": { "query": "foo bar", "slop": 1 } } }, "rescore_query_weight" : 10 } }, "highlight" : { "order" : "score", "fields" : { "content" : { "fragment_size" : 150, "number_of_fragments" : 3, "highlight_query": { "bool": { "must": { "match": { "content": { "query": "foo bar" } } }, "should": { "match_phrase": { "content": { "query": "foo bar", "slop": 1, "boost": 10.0 } } }, "minimum_should_match": 0 } } } } } } ``` 注意，在這種情況下，文本片段的分數是由 Lucene 高亮顯示框架計算的。對于實現細節，您可以檢查 ScoreOrderFragmentsBuilder.java 類。另一方面，當使用過帳突出顯示器時，如上所述，使用 **BM25** 算法對分段進行打分。 ### Global Settings 高亮設置可以在全局級別設置，然后在字段級別覆蓋。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "number_of_fragments" : 3, "fragment_size" : 150, "fields" : { "_all" : { "pre_tags" : [""], "post_tags" : [""] }, "bio.title" : { "number_of_fragments" : 0 }, "bio.author" : { "number_of_fragments" : 0 }, "bio.content" : { "number_of_fragments" : 5, "order" : "score" } } } } ``` ### Require Field Match require_field_match 可以設置為 false，這將導致任何字段被高亮顯示，而不管查詢是否與它們具體匹配。默認行為是 true，這意味著只有包含查詢匹配的字段才會高亮顯示。 ``` GET /_search { "query" : { "match": { "user": "kimchy" } }, "highlight" : { "require_field_match": false, "fields": { "_all" : { "pre_tags" : [""], "post_tags" : [""] } } } } ``` ### Boundary Characters 當使用快速向量熒光筆高亮顯示字段時，可以配置 boundary_chars 以定義什么構成用于高亮顯示的邊界。它是一個單字符串，其中定義了每個邊界字符。它默認為。，！？ \ t \ n。 boundary_max_scan 允許控制查找邊界字符的距離，默認值為 20。 ### Matched Fields 快速矢量熒光筆可以組合多個字段上的匹配，以使用 matched_fields 突出顯示單個字段。這對于以不同方式分析相同字符串的多字段來說是最直觀的。所有 matched_fields 必須將 term_vector 設置為with_positions_offsets，但只會加載匹配的組合字段，因此只有該字段可以從 store 設置為 yes 時受益。在下面的示例中，content 由英語分析器分析，content.plain?由標準分析器分析。 ``` GET /_search { "query": { "query_string": { "query": "content.plain:running scissors", "fields": ["content"] } }, "highlight": { "order": "score", "fields": { "content": { "matched_fields": ["content", "content.plain"], "type" : "fvh" } } } } ``` 以上匹配 “run with scissors” 和 “running with scissors”，并高亮顯示 “running” 和 “scissors”，但不是 “run”。如果兩個短語出現在一個大的文檔中，則 “running with scissors” 在片段列表中的 “run with scissors” 上排序，因為該片段中有更多匹配項。 ``` GET /_search { "query": { "query_string": { "query": "running scissors", "fields": ["content", "content.plain^10"] } }, "highlight": { "order": "score", "fields": { "content": { "matched_fields": ["content", "content.plain"], "type" : "fvh" } } } } ``` The above highlights "run" as well as "running" and "scissors" but still sorts "running with scissors" above "run with scissors" because the plain match ("running") is boosted. 上面高亮了 "run"?以及 "running"?和 "scissors"，但仍然排序 “"running with scissors"?上面 "run with scissors"，因為?plain match ("running")?提高。 ``` GET /_search { "query": { "query_string": { "query": "running scissors", "fields": ["content", "content.plain^10"] } }, "highlight": { "order": "score", "fields": { "content": { "matched_fields": ["content.plain"], "type" : "fvh" } } } } ``` 上面的查詢不會突出顯示 "run"?或 "scissor"，但顯示沒有列出在匹配字段中匹配匹配的字段（內容）。 Note：從技術上講，也可以將字段添加到與共同匹配的字段不共享相同底層字符串的 matched_fields。?結果可能沒有什么意義，如果一個匹配是在文本的末尾，那么整個查詢將失敗。 Note：將 matching_fields 設置為非空數組時涉及少量開銷，因此始終優選 ``` "highlight": { "fields": { "content": {} } } ``` 較于 ``` "highlight": { "fields": { "content": { "matched_fields": ["content"], "type" : "fvh" } } } ``` 。 ### Phrase Limit 快速向量熒光筆有一個 phrase_limit 參數，阻止它分析太多的短語和吃大量的內存。它默認為 256，所以只有文檔中前 256 個匹配的短語被考慮。您可以使用 phrase_limit 參數提高限制，但請記住，評分更多的短語會消耗更多的時間和內存。如果使用 matched_fields，請記住每個匹配字段的 phrase_limit 短語會被考慮。 ### Field Highlight Order Elasticsearch 按照它們發送的順序高亮顯示字段。每個 json spec 對象是無序的，但如果你需要明確的字段的高亮顯示的順序，你可以使用數組的字段，如： ``` "highlight": { "fields": [ {"title":{ /*params*/ }}, {"text":{ /*params*/ }} ] } ``` 沒有一個內置于 Elasticsearch 的熒光筆關心字段高亮顯示的順序，但插件可能。