Reindex API · Elasticsearch 5.4 中文文檔

# Reindex API ## Reindex API > 重要 > > Reindex不會嘗試設置目標索引。它不會復制源索引的設置信息。您應該在運行`_reindex`操作之前設置目標索引，包括設置映射，分片數，副本等。 `_reindex`的最基本形式只是將文檔從一個索引復制到另一個索引。下面將文檔從`twitter`索引復制到`new_twitter`索引中： ``` POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } } ``` 這將會返回類似以下的信息： ``` { "took" : 147, "timed_out": false, "created": 120, "updated": 0, "deleted": 0, "batches": 1, "version_conflicts": 0, "noops": 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0, "requests_per_second": -1.0, "throttled_until_millis": 0, "total": 120, "failures" : [ ] } ``` 和[_update_by_query](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Update_By_Query_API.html)一樣，`_reindex`獲取源索引的快照，但其目標索引必須是不同的索引，因此不會發生版本沖突。?`dest`元素可以像索引API一樣進行配置，以控制樂觀并發控制。只需將`version_type 空著`（像上面一樣）或將version_type設置為`internal則`Elasticsearch強制性的將文檔轉儲到目標中，覆蓋具有相同類型和ID的任何內容： ``` POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "internal" } } ``` 將`version_type`設置為`external`將導致Elasticsearch從源文件中保留版本，創建缺失的所有文檔，并更新在目標索引中比源索引中版本更老的所有文檔： ``` POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" } } ``` 設置`op_type`為`create`將導致`_reindex`僅在目標索引中創建缺少的文檔。所有存在的文檔將導致版本沖突： ``` POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } } ``` 默認情況下，版本沖突將中止`_reindex`進程，但您可以通過請求體設置`"conflict":"proceed"`來在沖突時進行計數： ``` POST _reindex { "conflicts": "proceed", "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } } ``` 您可以通過向`source`添加`type`或添加`query`來限制文檔。下面會將`kimchy`發布的`tweet`復制到`new_twitter`中： ``` POST _reindex { "source": { "index": "twitter", "type": "tweet", "query": { "term": { "user": "kimchy" } } }, "dest": { "index": "new_twitter" } } ``` `source`中的`index`和`type`都可以是一個列表，允許您在一個請求中從大量的來源進行復制。下面將從`twitter`和`blog`索引中的`tweet`和`post`類型中復制文檔。它也包含`twitter`索引中`post`類型以及`blog`索引中的`tweet`類型。如果你想更具體，你將需要使用`query`。它也沒有努力處理ID沖突。目標索引將保持有效，但由于迭代順序定義不正確，預測哪個文檔可以保存下來是不容易的。 ``` POST _reindex { "source": { "index": ["twitter", "blog"], "type": ["tweet", "post"] }, "dest": { "index": "all_together" } } ``` 還可以通過設置大小限制處理的文檔的數量。下面只會將單個文檔從`twitter`復制到`new_twitter`： ``` POST _reindex { "size": 1, "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } } ``` 如果你想要從`twitter`索引獲得一個特定的文檔集合你需要排序。排序使滾動效率更低，但在某些情況下它是值得的。如果可能，更喜歡更多的選擇性查詢`size`和`sort`。這將從`twitter復`制`10000`個文檔到`new_twitter`： ``` POST _reindex { "size": 10000, "source": { "index": "twitter", "sort": { "date": "desc" } }, "dest": { "index": "new_twitter" } } ``` `source`部分支持[搜索請求](https://aqlu.gitbooks.io/elasticsearch-reference/content/Search_APIs/Request_Body_Search.html)中支持的所有元素。例如，只使用原始文檔的一部分字段，使用源過濾如下所示： ``` POST _reindex { "source": { "index": "twitter", "_source": ["user", "tweet"] }, "dest": { "index": "new_twitter" } } ``` 像`update_by_query`一樣，`_reindex`支持修改文檔的腳本。與`_update_by_query`不同，腳本允許修改文檔的元數據。此示例修改了源文檔的版本： ``` POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" }, "script": { "inline": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", "lang": "painless" } } ``` 就像在`_update_by_query`中一樣，您可以設置`ctx.op`來更改在目標索引上執行的操作： `noop` 如果您的腳本決定不必進行任何更改，請設置?`ctx.op ="noop"`?。這將導致`_update_by_query`?從其更新中忽略該文檔。這個沒有操作將被報告在[響應體](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Update_By_Query_API.html#response-body)的?`noop`?計數器上。 `delete` 如果您的腳本決定必須刪除該文檔，請設置`ctx.op="delete"`。刪除將在[響應體](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Reindex_API.html#response-body)的?`deleted`?計數器中報告。將`ctx.op`設置為其他任何內容都是錯誤。在`ctx`中設置任何其他字段是一個錯誤。想想可能性！只要小心點，有很大的力量...你可以改變： * `_id` * `_type` * `_index` * `_version` * `_routing` * `_parent` 將`_version`設置為`null`或從`ctx`映射清除就像在索引請求中不發送版本一樣。這將導致目標索引中的文檔被覆蓋，無論目標版本或`_reindex`請求中使用的版本類型如何。默認情況下，如果`_reindex`看到具有路由的文檔，則路由將被保留，除非腳本被更改。您可以根據`dest`請求設置`routing`來更改： `keep` ``` 將批量請求的每個匹配項的路由設置為匹配上的路由。默認值。 ``` `discard` ``` 將批量請求的每個匹配項的路由設置為null。 ``` `=<某些文本>` ``` 將批量請求的每個匹配項的路由設置為`=`之后的文本。 ``` 例如，您可以使用以下請求將`source`索引的所有公司名稱為`cat`的文檔復制到路由設置為`cat`的`dest`索引。 ``` POST _reindex { "source": { "index": "source", "query": { "match": { "company": "cat" } } }, "dest": { "index": "dest", "routing": "=cat" } } ``` 默認情況下，`_reindex`批量滾動處理大小為`1000`.您可以在`source`元素中指定`size`字段來更改批量處理大小： ``` POST _reindex { "source": { "index": "source", "size": 100 }, "dest": { "index": "dest", "routing": "=cat" } } ``` Reindex也可以使用[Ingest Node]功能來指定`pipeline`, 就像這樣： ``` POST _reindex { "source": { "index": "source" }, "dest": { "index": "dest", "pipeline": "some_ingest_pipeline" } } ``` ## 從遠程重建索引 Reindex支持從遠程Elasticsearch群集重建索引： ``` POST _reindex { "source": { "remote": { "host": "http://otherhost:9200", "username": "user", "password": "pass" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } } ``` `host`參數必須包含`scheme`，`host`和`port`（例如?`https：// otherhost:9200`）。用戶名和密碼參數是可選的，當它們存在時，索引將使用基本認證連接到遠程Elasticsearch節點。使用基本認證時請務必使用`https`，密碼將以純文本格式發送。必須在`elasticsearch.yaml`中使用`reindex.remote.whitelist`屬性將遠程主機明確列入白名單。它可以設置為允許的遠程`host`和`port`組合的逗號分隔列表（例如`otherhost:9200,another:9200,127.0.10.*:9200,localhost:*`）。白名單忽略了`scheme`?——僅使用主機和端口。此功能應適用于您可能找到的任何版本的Elasticsearch的遠程群集。這應該允許您從任何版本的Elasticsearch升級到當前版本，通過從舊版本的集群重新建立索引。要啟用發送到舊版本Elasticsearch的查詢，`query`參數將直接發送到遠程主機，無需驗證或修改。來自遠程服務器的重新索引使用默認為最大大小為`100mb`的堆棧緩沖區。如果遠程索引包含非常大的文檔，則需要使用較小的批量大小。下面的示例設置非常非常小的批量大小`10`。 ``` POST _reindex { "source": { "remote": { "host": "http://otherhost:9200" }, "index": "source", "size": 10, "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } } ``` 也可以使用`socket_timeout`字段在遠程連接上設置`socket`的讀取超時，并使用`connect_timeout`字段設置連接超時。兩者默認為三十秒。此示例將套接字讀取超時設置為一分鐘，并將連接超時設置為十秒： ``` POST _reindex { "source": { "remote": { "host": "http://otherhost:9200", "socket_timeout": "1m", "connect_timeout": "10s" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } } ``` ## URL參數除了標準參數像`pretty`之外，“Reindex API”還支持`refresh`、`wait_for_completion`、`wait_for_active_shards`、`timeout`以及`requests_per_second`。發送`refresh`將在更新請求完成時更新索引中的所有分片。這不同于 Index API 的`refresh`參數，只會導致接收到新數據的分片被索引。如果請求包含`wait_for_completion=false`，那么Elasticsearch將執行一些預檢檢查、啟動請求、然后返回一個任務，可以與[Tasks API](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Reindex_API.html#docs-delete-by-query-task-api)一起使用來取消或獲取任務的狀態。Elasticsearch還將以`.tasks/task/${taskId}`作為文檔創建此任務的記錄。這是你可以根據是否合適來保留或刪除它。當你完成它時，刪除它可以讓Elasticsearch回收它使用的空間。 `wait_for_active_shards`控制在繼續請求之前必須有多少個分片必須處于活動狀態，詳見[這里](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Index_API.html#index-wait-for-active-shards)。`timeout`控制每個寫入請求等待不可用分片變成可用的時間。兩者都能正確地在[Bulk API](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Bulk_API.html)中工作。 `requests_per_second`可以設置為任何正數（1.4，6，1000等），來作為“delete-by-query”每秒請求數的節流閥數字，或者將其設置為`-1`以禁用限制。節流是在批量批次之間等待，以便它可以操縱滾動超時。等待時間是批次完成的時間與`request_per_second * requests_in_the_batch`的時間之間的差異。由于分批處理沒有被分解成多個批量請求，所以會導致Elasticsearch創建許多請求，然后等待一段時間再開始下一組。這是“突發”而不是“平滑”。默認值為-1。 ## 響應體 JSON響應類似如下： ``` { "took" : 639, "updated": 0, "created": 123, "batches": 1, "version_conflicts": 2, "retries": { "bulk": 0, "search": 0 } "throttled_millis": 0, "failures" : [ ] } ``` `took` ``` 從整個操作的開始到結束的毫秒數。 ``` `updated` ``` 成功更新的文檔數。 ``` `upcreateddated` ``` 成功創建的文檔數。 ``` `batches` ``` 通過查詢更新的滾動響應數量。 ``` `version_conflicts` ``` 根據查詢更新時，版本沖突的數量。 ``` `retries` ``` 根據查詢更新的重試次數。bluk 是重試的批量操作的數量，search 是重試的搜索操作的數量。 ``` `throttled_millis` ``` 請求休眠的毫秒數，與`requests_per_second`一致。 ``` `failures` ``` 失敗的索引數組。如果這是非空的，那么請求因為這些失敗而中止。請參閱 conflicts 來如何防止版本沖突中止操作。 ``` ## 配合Task API使用您可以使用[Task API](https://aqlu.gitbooks.io/elasticsearch-reference/content/Cluster_APIs/Task_Management_API.html)獲取任何正在運行的重建索引請求的狀態： ``` GET _tasks?detailed=true&actions=*/update/byquery ``` 響應會類似如下： ``` { "nodes" : { "r1A2WoRbTwKZ516z6NEs5A" : { "name" : "r1A2WoR", "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/reindex", "status" : { //① "total" : 6154, "updated" : 3500, "created" : 0, "deleted" : 0, "batches" : 4, "version_conflicts" : 0, "noops" : 0, "retries": { "bulk": 0, "search": 0 }, "throttled_millis": 0 }, "description" : "" } } } } } ``` ① 此對象包含實際狀態。它就像是響應json，重要的添加`total`字段。?`total`是重建索引希望執行的操作總數。您可以通過添加的`updated`、`created`和`deleted`的字段來估計進度。當它們的總和等于`total`字段時，請求將完成。使用任務id可以直接查找任務： ``` GET /_tasks/taskId:1 ``` 這個API的優點是它與`wait_for_completion=false`集成，以透明地返回已完成任務的狀態。如果任務完成并且`wait_for_completion=false`被設置，那么它將返回`results`或`error`字段。此功能的成本是`wait_for_completion=false`在`.tasks/task/${taskId}`創建的文檔，由你自己刪除該文件。 ## 配合取消任務API使用所有重建索引都能使用[Task Cancel API](https://aqlu.gitbooks.io/elasticsearch-reference/content/Cluster_APIs/Task_Management_API.html)取消： ``` POST _tasks/task_id:1/_cancel ``` 可以使用上面的任務API找到`task_id`。取消應盡快發生，但可能需要幾秒鐘。上面的任務狀態API將繼續列出任務，直到它被喚醒取消自身。 ## 重置節流閥 `request_per_second`的值可以在通過查詢刪除時使用`_rethrottle`?API更改： ``` POST _update_by_query/task_id:1/_rethrottle?requests_per_second=-1 ``` 可以使用上面的任務API找到task_id。就像在`_update_by_query`?API中設置它一樣，`request_per_second`可以是`-1`來禁用限制，或者任何十進制數字，如1.7或12，以節制到該級別。加速查詢的會立即生效，但是在完成當前批處理之后，減慢查詢的才會生效。這樣可以防止滾動超時。 ## 修改字段名 `_reindex`可用于使用重命名的字段構建索引的副本。假設您創建一個包含如下所示的文檔的索引： ``` POST test/test/1?refresh { "text": "words words", "flag": "foo" } ``` 但是你不喜歡這個`flag`名稱，而是要用`tag`替換它。?`_reindex`可以為您創建其他索引： ``` POST _reindex { "source": { "index": "test" }, "dest": { "index": "test2" }, "script": { "inline": "ctx._source.tag = ctx._source.remove(\"flag\")" } } ``` 現在你可以得到新的文件： ``` GET test2/test/1 ``` 它看起來像： ``` { "found": true, "_id": "1", "_index": "test2", "_type": "test", "_version": 1, "_source": { "text": "words words", "tag": "foo" } } ``` 或者你可以通過`tag`進行任何你想要的搜索。 ## 手動切片重建索引支持[滾動切片](https://aqlu.gitbooks.io/elasticsearch-reference/content/Search_APIs/Request_Body_Search/Scroll.html#sliced-scroll)，您可以相對輕松地手動并行化處理： ``` POST _reindex { "source": { "index": "twitter", "slice": { "id": 0, "max": 2 } }, "dest": { "index": "new_twitter" } } POST _reindex { "source": { "index": "twitter", "slice": { "id": 1, "max": 2 } }, "dest": { "index": "new_twitter" } } ``` 您可以通過以下方式驗證： ``` GET _refresh POST new_twitter/_search?size=0&filter_path=hits.total ``` 其結果一個合理的`total`像這樣： ``` { "hits": { "total": 120 } } ``` ## 自動切片你還可以讓重建索引使用切片的`_uid`來自動并行的[滾動切片](https://aqlu.gitbooks.io/elasticsearch-reference/content/Search_APIs/Request_Body_Search/Scroll.html#sliced-scroll)。 ``` POST _reindex?slices=5&refresh { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } } ``` 您可以通過以下方式驗證： ``` POST new_twitter/_search?size=0&filter_path=hits.total ``` 其結果一個合理的`total`像這樣： ``` { "hits": { "total": 120 } } ``` 將`slices`添加到`_reindex`中可以自動執行上述部分中使用的手動過程，創建子請求，這意味著它有一些怪癖： * 您可以在[Task API](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Reindex_API.html#docs-delete-by-query-task-api)中看到這些請求。這些子請求是具有`slices`請求任務的“子”任務。 * 獲取`slices`請求任務的狀態只包含已完成切片的狀態。 * 這些子請求可以單獨尋址，例如取消和重置節流閥。 * `slices`的重置節流閥請求將按相應的重新計算未完成的子請求。 * `slices`的取消請求將取消每個子請求。 * 由于`slices`的性質，每個子請求將不會獲得完全均勻的文檔部分。所有文件都將被處理，但有些片可能比其他片大。預期更大的切片可以有更均勻的分布。 * 帶有`slices`請求的`request_per_second`和`size`的參數相應的分配給每個子請求。結合上述關于分布的不均勻性，您應該得出結論，使用切片大小可能不會導致正確的大小文檔為`_reindex`。 * 每個子請求都會獲得源索引的略有不同的快照，盡管這些都是大致相同的時間。 ## 挑選切片數量在這一點上，我們圍繞要使用的`slices`數量提供了一些建議（比如手動并行化時，切片API中的`max`參數）： * 不要使用大的數字，`500`就能造成相當大的CPU抖動。 * 從查詢性能的角度來看，在源索引中使用分片數量的一些倍數更為有效。 * 在源索引中使用完全相同的分片是從查詢性能的角度來看效率最高的。 * 索引性能應在可用資源之間以`slices`數量線性擴展。 * 索引或查詢性能是否支配該流程取決于許多因素，如正在重建索引的文檔和進行`reindexing`的集群。 ## 索引的日常重建您可以使用`_reindex`與[Painless](https://aqlu.gitbooks.io/elasticsearch-reference/content/Modules/Scripting/Painless_Scripting_Language.html)組合來重新每日編制索引，以將新模板應用于現有文檔。假設您有由以下文件組成的索引： ``` PUT metricbeat-2016.05.30/beat/1?refresh {"system.cpu.idle.pct": 0.908} PUT metricbeat-2016.05.31/beat/1?refresh {"system.cpu.idle.pct": 0.105} ``` `metricbeat-*`索引的新模板已經加載到Elaticsearch中，但它僅適用于新創建的索引。Painless可用于重新索引現有文檔并應用新模板。下面的腳本從索引名稱中提取日期，并創建一個附帶有`-1`的新索引。來自`metricbeat-2016.05.31`的所有數據將重新索引到`metricbeat-2016.05.31-1`。 ``` POST _reindex { "source": { "index": "metricbeat-*" }, "dest": { "index": "metricbeat" }, "script": { "lang": "painless", "inline": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'" } } ``` 來自上一個度量索引的所有文檔現在可以在`*-1`索引中找到。 ``` GET metricbeat-2016.05.30-1/beat/1 GET metricbeat-2016.05.31-1/beat/1 ``` 以前的方法也可以與[更改字段的名稱](https://aqlu.gitbooks.io/elasticsearch-reference/content/Document_APIS/Reindex_API.html#docs-reindex-change-name)一起使用，以便將現有數據加載到新索引中，但如果需要，還可以重命名字段。 ## 提取索引的隨機子集 Reindex可用于提取用于測試的索引的隨機子集： ``` POST _reindex { "size": 10, "source": { "index": "twitter", "query": { "function_score" : { "query" : { "match_all": {} }, "random_score" : {} } }, "sort": "_score" //① }, "dest": { "index": "random_twitter" } } ``` ① Reindex默認按`_doc`排序，所以`random_score`不會有任何效果，除非您將排序重寫為`_score`。