Term Vectors · Elasticsearch 5.4 中文文檔

# Term Vectors ## Term Vectors（詞條向量）返回有關特定文檔字段中的詞條的信息和統計信息。文檔可以存儲在索引中或由用戶人工提供。詞條向量默認為[實時](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Get_API.html#realtime)，不是近實時。這可以通過將`realtime`參數設置為`false`來更改。 ``` GET /twitter/tweet/1/_termvectors ``` 可選的，您可以使用`url`中的參數指定檢索信息的字段： ``` GET /twitter/tweet/1/_termvectors?fields=message ``` 或通過在請求主體中添加請求的字段（參見下面的示例）。也可以使用通配符指定字段，類似于[多匹配查詢](https://aqlu.gitbooks.io/elasticsearch-reference/content/Query_DSL/Full_text_queries/Multi_Match_Query.html) > 警告 > > 請注意`/_termvector`的使用方式在2.0中已廢棄，請使用`_termvectors`替代。 ## 返回值請求可以得到三種類型的值：詞條信息，詞條統計和字段統計。默認情況下，所有詞條信息與字段統計信息都會被返回，但不包含詞條統計信息。 ### 詞條信息 * 在字段中的詞頻（總是返回） * 詞條位置（`positions`:?`true`） * 開始與結束的偏移量（`offsets`:?`true`） * 詞條有效載荷（`payloads`:?`true`），base64編碼的字節如果請求的信息沒有存儲在索引中，如果可能它將被即時計算。另外，對于甚至不存在于索引中但由用戶提供的文檔，也可以計算詞條向量。 > 警告 > > 開始與結束的偏移量假設UTF-16編碼被使用。如果要使用這些偏移量來從原始文本中獲取詞條，則應確保使用UTF-16對正在使用的子字符串進行編碼。 ### 詞條統計設置`term_statistics`為`true`（默認為`false`）將返回： * 總詞頻（所有文件中的詞條頻率） * 文檔頻率（包含詞條的文檔數）默認情況下這些值不返回,因為詞條統計數據會嚴重影響性能。 ### 字段統計將`field_statistics`設置為`false`（默認值為true）將省略： * 文檔數（包含此字段的文檔數） * 文檔頻率的總和（本字段中所有詞條的文檔頻率的總和） * 詞頻的總和（該字段中每個詞條的詞頻的總和） ### 詞條過濾使用參數`filter`，返回的詞條也可以根據其`tf-idf`分數進行過濾。這可能是有用的良好特征向量，以便找到文檔。此功能的工作方式與[More Like This Query](https://aqlu.gitbooks.io/elasticsearch-reference/content/Query_DSL/Specialized_queries/More_Like_This_Query.html)的[第二章節](https://aqlu.gitbooks.io/elasticsearch-reference/content/Query_DSL/Specialized_queries/More_Like_This_Query.html#mlt-query-term-selection)相似。參見示[例5](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Term_Vectors.html#docs-termvectors-terms-filtering)的使用。支持以下子參數： | 參數名 | 描述 | | --- | --- | | `max_num_terms` | 每個字段必須返回的最大詞條數。默認為`25`。 | | `min_term_freq` | 在源文檔中忽略少于此頻率的單詞。默認為`1`。 | | `max_term_freq` | 在源文檔中忽略超過此頻率的單詞。默認為無界。 | | `min_doc_freq` | 忽略文檔頻率少于此參數的詞條。默認為`1`。 | | `max_doc_freq` | 忽略文檔頻率大于此參數的詞條。默認為無界。 | | `min_word_length` | 字詞長度低于此參數的將被忽略。默認為`0`。 | | `max_word_length` | 字詞長度大于此參數的將被忽略。默認為無界（`0`）。 | ## 行為詞條和字段統計數據不準確。刪除的文件不被考慮。這些信息只能用于所請求文檔所在的分片。因此，術語和字段統計信息僅用作相對度量，而絕對數字在此上下文中無意義。默認情況下，當請求人造文檔的詞條向量時，隨機選擇獲取統計信息的分片。使用`routing`將命中特定的分片。 ### 示例：返回存儲詞條向量首先，我們創建一個存儲詞條向量、有效載荷等的索引： ``` PUT /twitter/ { "mappings": { "tweet": { "properties": { "text": { "type": "text", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" }, "fullname": { "type": "text", "term_vector": "with_positions_offsets_payloads", "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } } } ``` 然后，我們添加一些文檔： ``` PUT /twitter/tweet/1 { "fullname" : "John Doe", "text" : "twitter test test test " } PUT /twitter/tweet/2 { "fullname" : "Jane Doe", "text" : "Another twitter test ..." } ``` 以下請求返回文檔`1`（John Doe）中字段`text`的所有信息和統計信息： ``` GET /twitter/tweet/1/_termvectors { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ``` 響應： ``` { "_id": "1", "_index": "twitter", "_type": "tweet", "_version": 1, "found": true, "took": 6, "term_vectors": { "text": { "field_statistics": { "doc_count": 2, "sum_doc_freq": 6, "sum_ttf": 8 }, "terms": { "test": { "doc_freq": 2, "term_freq": 3, "tokens": [ { "end_offset": 12, "payload": "d29yZA==", "position": 1, "start_offset": 8 }, { "end_offset": 17, "payload": "d29yZA==", "position": 2, "start_offset": 13 }, { "end_offset": 22, "payload": "d29yZA==", "position": 3, "start_offset": 18 } ], "ttf": 4 }, "twitter": { "doc_freq": 2, "term_freq": 1, "tokens": [ { "end_offset": 7, "payload": "d29yZA==", "position": 0, "start_offset": 0 } ], "ttf": 2 } } } } } ``` ### 示例：自動生成詞條向量未明確存儲在索引中的詞條向量將自動計算。以下請求返回文檔`1`中字段的所有信息和統計信息，即使詞條尚未明確存儲在索引中。請注意，對于字段`text`，術語不會重新生成。 ``` GET /twitter/tweet/1/_termvectors { "fields" : ["text", "some_field_without_term_vectors"], "offsets" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true } ``` ### 示例：人造文檔還可以為人造文檔生成詞條向量，也就是生成索引中不存在的文檔。例如，以下請求將返回與示例1中相同的結果。所使用的映射由索引和類型確定。如果動態映射打開（默認），則不在原始映射中的文檔字段將被動態創建。 ``` GET /twitter/tweet/_termvectors { "doc" : { "fullname" : "John Doe", "text" : "twitter test test test" } } ``` #### Per-field 分析器另外，可以通過使用`per_field_analyzer`參數來提供不同于當前的分析器。這對于以任何方式生成詞條向量是有用的，特別是在使用人造文檔時。當為已經存儲的詞條向量提供分析器時，將重新生成項向量。 ``` GET /twitter/tweet/_termvectors { "doc" : { "fullname" : "John Doe", "text" : "twitter test test test" }, "fields": ["fullname"], "per_field_analyzer" : { "fullname": "keyword" } } ``` 響應： ``` { "_index": "twitter", "_type": "tweet", "_version": 0, "found": true, "took": 6, "term_vectors": { "fullname": { "field_statistics": { "sum_doc_freq": 2, "doc_count": 4, "sum_ttf": 4 }, "terms": { "John Doe": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 8 } ] } } } } } ``` ### 示例：詞條過濾最后，返回的詞條可以根據他們的`tf-idf`分數進行過濾。在下面的例子中，我們從具有給定“plot”字段值的人造文檔中獲取三個“interesting”的關鍵字。請注意，關鍵字“Tony”或任何停止詞不是響應的一部分，因為它們的`tf-idf`必須太低。 ``` GET /imdb/movies/_termvectors { "doc": { "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil." }, "term_statistics" : true, "field_statistics" : true, "positions": false, "offsets": false, "filter" : { "max_num_terms" : 3, "min_term_freq" : 1, "min_doc_freq" : 1 } } ``` 響應： ``` { "_index": "imdb", "_type": "movies", "_version": 0, "found": true, "term_vectors": { "plot": { "field_statistics": { "sum_doc_freq": 3384269, "doc_count": 176214, "sum_ttf": 3753460 }, "terms": { "armored": { "doc_freq": 27, "ttf": 27, "term_freq": 1, "score": 9.74725 }, "industrialist": { "doc_freq": 88, "ttf": 88, "term_freq": 1, "score": 8.590818 }, "stark": { "doc_freq": 44, "ttf": 47, "term_freq": 1, "score": 9.272792 } } } } } ```