Histogram Aggregation · Elasticsearch 5.4 中文文檔

# Histogram Aggregation 原文鏈接 : [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html) 譯文鏈接 : [http://www.apache.wiki/display/Elasticsearch](http://www.apache.wiki/display/Elasticsearch)（修改該鏈接為 **ApacheCN** 對應的譯文鏈接）貢獻者 : @于永超，[ApacheCN](/display/~apachecn)，[Apache中文網](/display/~apachechina) ## Histogram Aggregation A multi-bucket values source based aggregation,可以應用于從文檔中提取的數值。它會動態地在值上構建固定大小（a.k.a.interval）桶。例如，如果文檔有一個包含價格的字段(數值)，我們可以配置這個聚合來動態地構建帶間隔5的bucket（比如價格可能代表$ 5），當聚合執行時，每個文檔的價格字段將被評估，并將四舍五入到最接近的bucket，例如，如果價格是32，而bucket（桶）的大小是5，那么四舍五入將產生30，因此，文檔將“掉落”到與關鍵30相關的bucket（桶）中，為了使這更正式，這里是使用的如下計算公式： ``` bucket_key = Math.floor((value - offset) / interval) * interval + offset ``` interval必須是正數，而offset（偏移量）必須是小數`[0, interval[`. 下面的代碼片段“bucket”基于價格的間隔為50 ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50 } } } } ``` 可能返回以下結果： ``` { ... "aggregations": { "prices" : { "buckets": [ { "key": 0.0, "doc_count": 1 }, { "key": 50.0, "doc_count": 1 }, { "key": 100.0, "doc_count": 0 }, { "key": 150.0, "doc_count": 2 }, { "key": 200.0, "doc_count": 3 } ] } } } ``` ### Minimum document count 上面的結果顯示，沒有任何文檔的價格在[100 - 150)范圍內。默認情況下，返回結果將用空桶填充直方圖中的空白。由于min_doc_count設置，可能會更改這個和請求桶的最小值，這是由min_doc_count設置: ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "min_doc_count" : 1 } } } } ``` 返回結果： ``` { ... "aggregations": { "prices" : { "buckets": [ { "key": 0.0, "doc_count": 1 }, { "key": 50.0, "doc_count": 1 }, { "key": 150.0, "doc_count": 2 }, { "key": 200.0, "doc_count": 3 } ] } } } ``` 默認情況下，histogram返回數據本身范圍內的所有bucket,也就是說，具有最小值(使用直方圖)的文檔將確定最小的bucket(帶有最小鍵的bucket)，具有最高值的文檔將確定最大的bucket(具有最高鍵的bucket)。通常，當請求空buckets時，這會造成混亂，特別是當數據被過濾時。為了說明原因，讓我們來看一下列子：假設你正在過濾您的請求，以獲取值在0到500之間的所有文檔，此外，您還希望使用直方圖來將數據切片，其中間隔為50，您還要指定“min_doc_count”：0，因為您希望獲得所有的桶，即使是空的。如果發生這種情況，所有產品(文件)的價格都高于100，你將獲得的第一個bucket將是一個100的key，這是令人困惑的，很多次，你還想把這些桶放在0到100之間。通過使用extended_bounds設置，現在，您可以“強制”直方圖聚合來開始在特定的min值上構建bucket，并且還可以繼續構建到最大值的bucket（即使沒有文檔了），當min_doc_count為0時，使用extended_bounds才有意義（如果min_doc_count大于0，則永遠不會返回空buckets）注意，(顧名思義)extended_bounds不是過濾buckets。意味著，如果extended_bounds.min高于從文檔中提取的值。這些文件仍將決定第一個bucket將是什么（對于extended_bounds.max和最后一個bucket也是一樣），對于filtering buckets，應使用適當的from/to設置將范圍過濾器聚合下的直方圖聚合嵌套。例子： ``` POST /sales/_search?size=0 { "query" : { "constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } } }, "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "extended_bounds" : { "min" : 0, "max" : 500 } } } } } ``` ### Order 默認情況下，返回的bucket按它們的key升序排序，盡管順序行為可以通過order設置來控制。按鍵降序排列桶： ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "_key" : "desc" } } } } } ``` 按其doc_count - 升序排列： ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "_count" : "asc" } } } } } ``` If the histogram aggregation has a direct metrics sub-aggregation,?則后者可以確定桶的順序： ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "price_stats.min" : "asc" } #1 }, "aggs" : { "price_stats" : { "stats" : {"field" : "price"} } } } } } ``` #1 ?{“price_stats.min”：asc“}將根據其price_stats子聚合的最小值對桶進行排序。也可以根據層次結構中的“更深層次的”聚合來對buckets進行排序，只要聚合路徑是single-bucket類型，就可以支持這一點，在路徑中的最后一個聚合可能是單桶的，也可以是度量的。如果它是一個single-bucket類型，那么這個順序將由bucket中的文檔數來定義（例如doc_count），如果這是一個度量標準，則與上面的規則相同（如果路徑必須指出度量名稱以在multi-value度量聚合的情況下排序，并且在single-value度量聚合的情況下，該排序將應用于該值）路徑必須以下列形式定義： ``` AGG_SEPARATOR = '>' ; METRIC_SEPARATOR = '.' ; AGG_NAME = <the name of the aggregation> ; METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ; PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ; ``` ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "order" : { "promoted_products>rating_stats.avg" : "desc" } }, "aggs" : { "promoted_products" : { "filter" : { "term" : { "promoted" : true }}, "aggs" : { "rating_stats" : { "stats" : { "field" : "rating" }} } } } } } } ``` 上述將根據促銷產品的平均評級對桶進行排序 ### Offset 默認情況下，bucket鍵以0開始，然后以interval間隔均勻分布，例如，如果間隔為10，則第一個桶（假設里面有數據）將為[0 - 9]，[10-19]，[20-29]，可以使用offset選項來改變bucket的邊界。這可以用一個例子來說明，如果有10個值從5到14的文檔，使用interval10將產生兩個bucket，每個bucket包含5個文檔，如果使用附加的offset為5，則只有一個包含所有10個文檔的單個bucket[5-14]。 ### Response Format 默認情況下，buckets作為有序數組返回，還可以將響應請求為哈希，而不是用bucket鍵。 ``` POST /sales/_search?size=0 { "aggs" : { "prices" : { "histogram" : { "field" : "price", "interval" : 50, "keyed" : true } } } } ``` 響應結果： ``` { ... "aggregations": { "prices": { "buckets": { "0.0": { "key": 0.0, "doc_count": 1 }, "50.0": { "key": 50.0, "doc_count": 1 }, "100.0": { "key": 100.0, "doc_count": 0 }, "150.0": { "key": 150.0, "doc_count": 2 }, "200.0": { "key": 200.0, "doc_count": 3 } } } } } ``` ### Missing value missing的參數定義了如何處理缺少值的文檔，默認情況下，它們將被忽略，但也有可能將它們視為具有值 ``` POST /sales/_search?size=0 { "aggs" : { "quantity" : { "histogram" : { "field" : "quantity", "interval": 10, "missing": 0 ＃1 } } } } ``` ＃1 ? quantity字段沒有值的文檔將落入與文檔相同的bucket中＃1 ? 值為0