批量 · Elasticsearch權威指南（中文版）

## 更新時的批量操作就像`mget`允許我們一次性檢索多個文檔一樣，`bulk` API允許我們使用單一請求來實現多個文檔的`create`、`index`、`update`或`delete`。這對索引類似于日志活動這樣的數據流非常有用，它們可以以成百上千的數據為一個批次按序進行索引。 `bulk`請求體如下，它有一點不同尋常： ```Javascript { action: { metadata }}\n { request body }\n { action: { metadata }}\n { request body }\n ... ``` 這種格式類似于用`"\n"`符號連接起來的一行一行的JSON文檔**流(stream)**。兩個重要的點需要注意： * 每行必須以`"\n"`符號結尾，**包括最后一行**。這些都是作為每行有效的分離而做的標記。 * 每一行的數據不能包含未被轉義的換行符，它們會干擾分析——這意味著JSON不能被美化打印。 > 提示: > 在《批量格式》一章我們介紹了為什么`bulk` API使用這種格式。 **action/metadata**這一行定義了**文檔行為(what action)**發生在**哪個文檔(which document)**之上。 **行為(action)**必須是以下幾種： | 行為 | 解釋 | | -------- | ------------------------------------------------------ | | `create` | 當文檔不存在時創建之。詳見《創建文檔》 | | `index` | 創建新文檔或替換已有文檔。見《索引文檔》和《更新文檔》 | | `update` | 局部更新文檔。見《局部更新》 | | `delete` | 刪除一個文檔。見《刪除文檔》 | 在索引、創建、更新或刪除時必須指定文檔的`_index`、`_type`、`_id`這些**元數據(metadata)**。例如刪除請求看起來像這樣： ```Javascript { "delete": { "_index": "website", "_type": "blog", "_id": "123" }} ``` **請求體(request body)**由文檔的`_source`組成——文檔所包含的一些字段以及其值。它被`index`和`create`操作所必須，這是有道理的：你必須提供文檔用來索引。這些還被`update`操作所必需，而且請求體的組成應該與`update` API（`doc`, `upsert`, `script`等等）一致。刪除操作不需要**請求體(request body)**。 ```Javascript { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "My first blog post" } ``` 如果定義`_id`，ID將會被自動創建： ```Javascript { "index": { "_index": "website", "_type": "blog" }} { "title": "My second blog post" } ``` 為了將這些放在一起，`bulk`請求表單是這樣的： ```Javascript POST /_bulk { "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1> { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "My first blog post" } { "index": { "_index": "website", "_type": "blog" }} { "title": "My second blog post" } { "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} } { "doc" : {"title" : "My updated blog post"} } <2> ``` - <1> 注意`delete`**行為(action)**沒有請求體，它緊接著另一個**行為(action)** - <2> 記得最后一個換行符 Elasticsearch響應包含一個`items`數組，它羅列了每一個請求的結果，結果的順序與我們請求的順序相同： ```Javascript { "took": 4, "errors": false, <1> "items": [ { "delete": { "_index": "website", "_type": "blog", "_id": "123", "_version": 2, "status": 200, "found": true }}, { "create": { "_index": "website", "_type": "blog", "_id": "123", "_version": 3, "status": 201 }}, { "create": { "_index": "website", "_type": "blog", "_id": "EiwfApScQiiy7TIKFxRCTw", "_version": 1, "status": 201 }}, { "update": { "_index": "website", "_type": "blog", "_id": "123", "_version": 4, "status": 200 }} ] }} ``` - <1> 所有子請求都成功完成。每個子請求都被獨立的執行，所以一個子請求的錯誤并不影響其它請求。如果任何一個請求失敗，頂層的`error`標記將被設置為`true`，然后錯誤的細節將在相應的請求中被報告： ```Javascript POST /_bulk { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "Cannot create - it already exists" } { "index": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "But we can update it" } ``` 響應中我們將看到`create`文檔`123`失敗了，因為文檔已經存在，但是后來的在`123`上執行的`index`請求成功了： ```Javascript { "took": 3, "errors": true, <1> "items": [ { "create": { "_index": "website", "_type": "blog", "_id": "123", "status": 409, <2> "error": "DocumentAlreadyExistsException <3> [[website][4] [blog][123]: document already exists]" }}, { "index": { "_index": "website", "_type": "blog", "_id": "123", "_version": 5, "status": 200 <4> }} ] } ``` - <1> 一個或多個請求失敗。 - <2> 這個請求的HTTP狀態碼被報告為`409 CONFLICT`。 - <3> 錯誤消息說明了什么請求錯誤。 - <4> 第二個請求成功了，狀態碼是`200 OK`。這些說明`bulk`請求不是原子操作——它們不能實現事務。每個請求操作時分開的，所以每個請求的成功與否不干擾其它操作。 ## 不要重復你可能在同一個`index`下的同一個`type`里批量索引日志數據。為每個文檔指定相同的元數據是多余的。就像`mget` API，`bulk`請求也可以在URL中使用`/_index`或`/_index/_type`: ```Javascript POST /website/_bulk { "index": { "_type": "log" }} { "event": "User logged in" } ``` 你依舊可以覆蓋元數據行的`_index`和`_type`，在沒有覆蓋時它會使用URL中的值作為默認值： ```Javascript POST /website/log/_bulk { "index": {}} { "event": "User logged in" } { "index": { "_type": "blog" }} { "title": "Overriding the default type" } ``` ## 多大才算太大？整個批量請求需要被加載到接受我們請求節點的內存里，所以請求越大，給其它請求可用的內存就越小。有一個最佳的`bulk`請求大小。超過這個大小，性能不再提升而且可能降低。最佳大小，當然并不是一個固定的數字。它完全取決于你的硬件、你文檔的大小和復雜度以及索引和搜索的負載。幸運的是，這個**最佳點(sweetspot)**還是容易找到的：試著批量索引標準的文檔，隨著大小的增長，當性能開始降低，說明你每個批次的大小太大了。開始的數量可以在1000~5000個文檔之間，如果你的文檔非常大，可以使用較小的批次。通常著眼于你請求批次的物理大小是非常有用的。一千個1kB的文檔和一千個1MB的文檔大不相同。一個好的批次最好保持在5-15MB大小間。