Grok Processor（Grok 處理器） · Elasticsearch 5.4 中文文檔

# Grok Processor（Grok 處理器）原文鏈接 : [https://www.elastic.co/guide/en/elasticsearch/reference/5.3/grok-processor.html](https://www.elastic.co/guide/en/elasticsearch/reference/5.3/grok-processor.html) 譯文鏈接 : [http://www.apache.wiki/pages/viewpage.action?pageId=10027802](http://www.apache.wiki/pages/viewpage.action?pageId=10027802) 貢獻者 : [那伊抹微笑](/display/~wangyangting)，[ApacheCN](/display/~apachecn)，[Apache中文網](/display/~apachechina) 從 **document**（文檔）中的單個 **text** **filed**（文本字段）提取?**structured** **fields**（結構化字段）。您可以選擇從哪個字段來提取所匹配的字段，以及您想要匹配的 **grok?****pattern**。**grok pattern** 就像正則表達式，并且支持可以重用的?**aliased** **expressions**（別名表達式）。此工具非常適用于?**syslog** **logs**，**apache** 和其它的 **webserver** **logs**，**mysql** **logs**，以及一般情況下，用于人類而不是計算機使用的任何的?**log** **format**（日志格式）。該 **processor**（處理器）包含超過 [120 種可重用的 patterns](https://github.com/elastic/elasticsearch/tree/master/modules/ingest-common/src/main/resources/patterns)。如果您需要工具來幫助 **building** **patterns**（構建模式）以匹配 **log**（日志），您將會發現 ?[http://grokdebug.herokuapp.com](http://grokdebug.herokuapp.com/)?和?[http://grokconstructor.appspot.com/](http://grokconstructor.appspot.com/)?應用程序是相當有用的。 ### Grok Basics（Grok 基礎） **Grok** 以 **regular** **expressions**（正則表達式）為基礎，所以在 **Grok** 中的任何正則表達式也是有效的。正則表達式庫是 **Oniguruma**。您可以在[?Onigiruma 網站上](https://github.com/kkos/oniguruma/blob/master/doc/RE)?查看所支持的完整的 **r****egexp syntax**（正在表達式語法）。 **Grok** 通過利用這種正則表達式語言來工作，允許命名現有的 **pattern**（模式），并將它們組合成與您的字段相匹配的更復雜的 **pattern**（模式）。對于重用 **grok pattern**（**grok** 模式）的語法有三種形式 :?**`%{SYNTAX:SEMANTIC}`**，**`%{SYNTAX}`**，`**%{SYNTAX:SEMANTIC:TYPE}**。` 該 **SYNTAX**（語法）是將要匹配您的文本的 **pattern**（模式）的名稱。例如，**3.44** 將會被 **NUMBER** 模式匹配并且?**`55.3.244.1`**?將會被 **IP** 模式匹配。該語法是告訴你如何匹配的。`**NUMBER**?`和?`**IP**?`都是在 **default patterns set**（默認模式集）中提供的 **pattern**（模式）。該 **SEMANTIC**（語義）是您給一段被匹配的文本的標識符。例如，**3.44** 可以是事件的持續時間，所以你可以簡單的稱之為 **duration**。此外，字符串 **55.3.244.1** 可能會標識 **client** 發出的請求。該 **TYPE**（類型）是您希望轉換您命名的 **field**（字段）的 **type**（類型）。**int** 和 **float** 是目前唯一所支持的強制類型。例如，您可能想要去匹配以下文本 :? ``` 3.44 55.3.244.1 ``` ?您可能知道該示例中的消息是一個 **number**（數字），后跟一個 **IP address**（**IP** 地址）。您可以通過使用下列的 **Grok?****expression**（**Grok** 表達式）來匹配這個文本。 ``` %{NUMBER:duration} %{IP:client} ``` ### Using the Grok Processor in a Pipeline（在管道中使用 Grok 表達式） #### Table?20.?Grok Options（表 20\. Grok 選項） | Name（名稱） | Required（必要的） | Default（默認值） | Description（描述） | | --- | --- | --- | --- | | **`field`** | **yes** | **-** | The field to use for grok expression parsing | | **`patterns`** | **yes** | **-** | An ordered list of grok expression to match and extract named captures with. Returns on the first expression in the list that matches. | | **`pattern_definitions`** | **no** | **-** | A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition. | | **`trace_match`** | **no** | **false** | when true,?`_ingest._grok_match_index`?will be inserted into your matched document’s metadata with the index into the pattern found in?`patterns`?that matched. | | **`ignore_missing`** | **no** | **false** | If?`true`?and?`field`?does not exist or is?`null`, the processor quietly exits without modifying the document | 以下是使用提供的 **pattern**（模式）從 **document**（文檔）中的 **string** **field**（字符串字段）中提取和命名結構化字段的示例。 ``` { "message": "55.3.244.1 GET /index.html 15824 0.043" } ``` ?這個 **pattern**（模式）可以是 :? ``` %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration} ``` 以下是一個使用 **Grok** 處理上述 **document**（文檔）的示例 **pipeline**（管道）:? ``` { "description" : "...", "processors": [ { "grok": { "field": "message", "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"] } } ] } ``` 此 **pipeline**（管道）將這些 **named captures**（命名捕獲）作為文檔中的新字段插入，如下所示 :? ``` { "message": "55.3.244.1 GET /index.html 15824 0.043", "client": "55.3.244.1", "method": "GET", "request": "/index.html", "bytes": 15824, "duration": "0.043" } ``` ### Custom Patterns and Pattern Files（自定義模式和模式文件）該 **Grok** **processor** 采用基本的 **pattern**（模式）進行預包裝。這些 **pattern**（模式）可能并不總是有你想要的。**Pattern** 有一個非常基本的格式。每個 **entry** 描述有一個 **name**（名稱）和 **pattern**（模式）本身。您也可以在 **pattern_definitions** 選項下添加您自己的 **pattern**（模式）到 **processor** **definition**（處理器定義）中。以下是一個指定自定義 **pattern** **definitions**（模式定義）的 **pipeline**（管道）:? ``` { "description" : "...", "processors": [ { "grok": { "field": "message", "patterns": ["my %{FAVORITE_DOG:dog} is colored %{RGB:color}"] "pattern_definitions" : { "FAVORITE_DOG" : "beagle", "RGB" : "RED|GREEN|BLUE" } } } ] } ``` ### Providing Multiple Match Patterns（提供多個匹配模式）有時一種 **pattern**（模式）不足以捕捉一個 **field**（字段）的潛在結構。假設我們要匹配包含您最喜歡的貓或狗寵物品種的所有 **message**（消息）。實現這一點的一個方法是提供兩個不同的 **pattern**（模式），而不是一個真正復雜的表達式所捕獲相同的?`**or**?`行為。以下是針對 **simulate** **API**（模擬 **API**）執行的這種配置的示例 :? ``` curl -XPOST 'localhost:9200/_ingest/pipeline/_simulate?pretty' -H 'Content-Type: application/json' -d' { "pipeline": { "description" : "parse multiple patterns", "processors": [ { "grok": { "field": "message", "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"], "pattern_definitions" : { "FAVORITE_DOG" : "beagle", "FAVORITE_CAT" : "burmese" } } } ] }, "docs":[ { "_source": { "message": "I love burmese cats!" } } ] } ' ``` 響應如下 :? ``` { "docs": [ { "doc": { "_type": "_type", "_index": "_index", "_id": "_id", "_source": { "message": "I love burmese cats!", "pet": "burmese" }, "_ingest": { "timestamp": "2016-11-08T19:43:03.850+0000" } } } ] } ``` 兩種 **pattern**（模式）都將使用適當的匹配來設置該字段?**pet**。但是如果要跟蹤是哪以個模式匹配并且填充了字段，該怎么辦呢？我們可以通過使用?**`trace_match`**參數來做到這一點。以下是一個一樣 **pipeline**（管道）的輸出，但是使用的是?"**trace_match**": **true?**的配置 :? ``` { "docs": [ { "doc": { "_type": "_type", "_index": "_index", "_id": "_id", "_source": { "message": "I love burmese cats!", "pet": "burmese" }, "_ingest": { "_grok_match_index": "1", "timestamp": "2016-11-08T19:43:03.850+0000" } } } ] } ``` 在上述響應中，您可以看到匹配的 **pattern**（模式）的 **index**（索引）為?`"**1**"`。這就是說，它是在 **patterns** 中用于匹配的第二個（索引從零開始）模式。這些所跟蹤的元數據可以調試哪些 **patterns**（模式）被匹配到了。這些信息存儲在?**ingest** **metadata**（元數據）中，并且不會被索引。