=== Preventing Combinatorial Explosions
The `terms` bucket dynamically builds buckets based on your data; it doesn't
know up front how many buckets will be generated. ((("combinatorial explosions, preventing")))((("aggregations", "preventing combinatorial explosions"))) While this is fine with a
single aggregation, think about what can happen when one aggregation contains
another aggregation, which contains another aggregation, and so forth. The combination of
unique values in each of these aggregations can lead to an explosion in the
number of buckets generated.
Imagine we have a modest dataset that represents movies. Each document lists
the actors in that movie:
[source,js]
----
{
"actors" : [
"Fred Jones",
"Mary Jane",
"Elizabeth Worthing"
]
}
----
If we want to determine the top 10 actors and their top costars, that's trivial
with an aggregation:
[source,js]
----
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
----
This will return a list of the top 10 actors, and for each actor, a list of their
top five costars. This seems like a very modest aggregation; only 50
values will be returned!
However, this seemingly ((("aggregations", "fielddata", "datastructure overview")))innocuous query can easily consume a vast amount of
memory. You can visualize a `terms` aggregation as building a tree in memory.
The `actors` aggregation will build the first level of the tree, with a bucket
for every actor. Then, nested under each node in the first level, the
`costars` aggregation will build a second level, with a bucket for every costar, as seen in <<depth-first-1>>. That means that a single movie will generate n^2^ buckets!
[[depth-first-1]]
.Build full depth tree
image::images/300_120_depth_first_1.svg["Build full depth tree"]
To use some real numbers, imagine each movie has 10 actors on average. Each movie
will then generate 10^2^ == 100 buckets. If you have 20,000 movies, that's
roughly 2,000,000 generated buckets.
Now, remember, our aggregation is simply asking for the top 10 actors and their
co-stars, totaling 50 values. To get the final results, we have to generate
that tree of 2,000,000 buckets, sort it, and finally prune it such that only the
top 10 actors are left. This is illustrated in <<depth-first-2>> and <<depth-first-3>>.
[[depth-first-2]]
.Sort tree
image::images/300_120_depth_first_2.svg["Sort tree"]
[[depth-first-3]]
.Prune tree
image::images/300_120_depth_first_3.svg["Prune tree"]
At this point you should be quite distraught. Twenty thousand documents is paltry,
and the aggregation is pretty tame. What if you had 200 million documents, wanted
the top 100 actors and their top 20 costars, as well as the costars' costars?
You can appreciate how quickly combinatorial expansion can grow, making this
strategy untenable. There is not enough memory in the world to support uncontrolled
combinatorial explosions.
==== Depth-First Versus Breadth-First
Elasticsearch allows you to change the _collection mode_ of an aggregation, for
exactly this situation. ((("collection mode"))) ((("aggregations", "preventing combinatorial explosions", "depth-first versus breadth-first")))The strategy we outlined previously--building the tree fully
and then pruning--is called _depth-first_ and it is the default. ((("depth-first collection strategy"))) Depth-first
works well for the majority of aggregations, but can fall apart in situations
like our actors and costars example.
For these special cases, you should use an alternative collection strategy called
_breadth-first_. ((("beadth-first collection strategy")))This strategy works a little differently. It executes the first
layer of aggregations, and _then_ performs a pruning phase before continuing, as illustrated in <<breadth-first-1>> through <<breadth-first-3>>.
In our example, the `actors` aggregation would be executed first. At this
point, we have a single layer in the tree, but we already know who the top 10
actors are! There is no need to keep the other actors since they won't be in
the top 10 anyway.
[[breadth-first-1]]
.Build first level
image::images/300_120_breadth_first_1.svg["Build first level"]
[[breadth-first-2]]
.Sort first level
image::images/300_120_breadth_first_2.svg["Sort first level"]
[[breadth-first-3]]
.Prune first level
image::images/300_120_breadth_first_3.svg["Prune first level"]
Since we already know the top ten actors, we can safely prune away the rest of the
long tail. After pruning, the next layer is populated based on _its_ execution mode,
and the process repeats until the aggregation is done, as illustrated in <<breadth-first-4>>. This prevents the
combinatorial explosion of buckets and drastically reduces memory requirements
for classes of queries that are amenable to breadth-first.
[[breadth-first-4]]
.Populate full depth for remaining nodes
image::images/300_120_breadth_first_4.svg["Step 4: populate full depth for remaining nodes"]
To use breadth-first, simply ((("collect parameter, enabling breadth-first")))enable it via the `collect` parameter:
[source,js]
----
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10,
"collect_mode" : "breadth_first" <1>
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
----
<1> Enable `breadth_first` on a per-aggregation basis.
Breadth-first should be used only when you expect more buckets to be generated
than documents landing in the buckets. Breadth-first works by caching
document data at the bucket level, and then replaying those documents to child
aggregations after the pruning phase.
The memory requirement of a breadth-first aggregation is linear to the number
of documents in each bucket prior to pruning. For many aggregations, the
number of documents in each bucket is very large. Think of a histogram with
monthly intervals: you might have thousands or hundreds of thousands of
documents per bucket. This makes breadth-first a bad choice, and is why
depth-first is the default.
But for the actor example--which generates a large number of
buckets, but each bucket has relatively few documents--breadth-first is much
more memory efficient, and allows you to build aggregations that would
otherwise fail.
- Introduction
- 入門
- 是什么
- 安裝
- API
- 文檔
- 索引
- 搜索
- 聚合
- 小結
- 分布式
- 結語
- 分布式集群
- 空集群
- 集群健康
- 添加索引
- 故障轉移
- 橫向擴展
- 更多擴展
- 應對故障
- 數據
- 文檔
- 索引
- 獲取
- 存在
- 更新
- 創建
- 刪除
- 版本控制
- 局部更新
- Mget
- 批量
- 結語
- 分布式增刪改查
- 路由
- 分片交互
- 新建、索引和刪除
- 檢索
- 局部更新
- 批量請求
- 批量格式
- 搜索
- 空搜索
- 多索引和多類型
- 分頁
- 查詢字符串
- 映射和分析
- 數據類型差異
- 確切值對決全文
- 倒排索引
- 分析
- 映射
- 復合類型
- 結構化查詢
- 請求體查詢
- 結構化查詢
- 查詢與過濾
- 重要的查詢子句
- 過濾查詢
- 驗證查詢
- 結語
- 排序
- 排序
- 字符串排序
- 相關性
- 字段數據
- 分布式搜索
- 查詢階段
- 取回階段
- 搜索選項
- 掃描和滾屏
- 索引管理
- 創建刪除
- 設置
- 配置分析器
- 自定義分析器
- 映射
- 根對象
- 元數據中的source字段
- 元數據中的all字段
- 元數據中的ID字段
- 動態映射
- 自定義動態映射
- 默認映射
- 重建索引
- 別名
- 深入分片
- 使文本可以被搜索
- 動態索引
- 近實時搜索
- 持久化變更
- 合并段
- 結構化搜索
- 查詢準確值
- 組合過濾
- 查詢多個準確值
- 包含,而不是相等
- 范圍
- 處理 Null 值
- 緩存
- 過濾順序
- 全文搜索
- 匹配查詢
- 多詞查詢
- 組合查詢
- 布爾匹配
- 增加子句
- 控制分析
- 關聯失效
- 多字段搜索
- 多重查詢字符串
- 單一查詢字符串
- 最佳字段
- 最佳字段查詢調優
- 多重匹配查詢
- 最多字段查詢
- 跨字段對象查詢
- 以字段為中心查詢
- 全字段查詢
- 跨字段查詢
- 精確查詢
- 模糊匹配
- Phrase matching
- Slop
- Multi value fields
- Scoring
- Relevance
- Performance
- Shingles
- Partial_Matching
- Postcodes
- Prefix query
- Wildcard Regexp
- Match phrase prefix
- Index time
- Ngram intro
- Search as you type
- Compound words
- Relevance
- Scoring theory
- Practical scoring
- Query time boosting
- Query scoring
- Not quite not
- Ignoring TFIDF
- Function score query
- Popularity
- Boosting filtered subsets
- Random scoring
- Decay functions
- Pluggable similarities
- Conclusion
- Language intro
- Intro
- Using
- Configuring
- Language pitfalls
- One language per doc
- One language per field
- Mixed language fields
- Conclusion
- Identifying words
- Intro
- Standard analyzer
- Standard tokenizer
- ICU plugin
- ICU tokenizer
- Tidying text
- Token normalization
- Intro
- Lowercasing
- Removing diacritics
- Unicode world
- Case folding
- Character folding
- Sorting and collations
- Stemming
- Intro
- Algorithmic stemmers
- Dictionary stemmers
- Hunspell stemmer
- Choosing a stemmer
- Controlling stemming
- Stemming in situ
- Stopwords
- Intro
- Using stopwords
- Stopwords and performance
- Divide and conquer
- Phrase queries
- Common grams
- Relevance
- Synonyms
- Intro
- Using synonyms
- Synonym formats
- Expand contract
- Analysis chain
- Multi word synonyms
- Symbol synonyms
- Fuzzy matching
- Intro
- Fuzziness
- Fuzzy query
- Fuzzy match query
- Scoring fuzziness
- Phonetic matching
- Aggregations
- overview
- circuit breaker fd settings
- filtering
- facets
- docvalues
- eager
- breadth vs depth
- Conclusion
- concepts buckets
- basic example
- add metric
- nested bucket
- extra metrics
- bucket metric list
- histogram
- date histogram
- scope
- filtering
- sorting ordering
- approx intro
- cardinality
- percentiles
- sigterms intro
- sigterms
- fielddata
- analyzed vs not
- 地理坐標點
- 地理坐標點
- 通過地理坐標點過濾
- 地理坐標盒模型過濾器
- 地理距離過濾器
- 緩存地理位置過濾器
- 減少內存占用
- 按距離排序
- Geohashe
- Geohashe
- Geohashe映射
- Geohash單元過濾器
- 地理位置聚合
- 地理位置聚合
- 按距離聚合
- Geohash單元聚合器
- 范圍(邊界)聚合器
- 地理形狀
- 地理形狀
- 映射地理形狀
- 索引地理形狀
- 查詢地理形狀
- 在查詢中使用已索引的形狀
- 地理形狀的過濾與緩存
- 關系
- 關系
- 應用級別的Join操作
- 扁平化你的數據
- Top hits
- Concurrency
- Concurrency solutions
- 嵌套
- 嵌套對象
- 嵌套映射
- 嵌套查詢
- 嵌套排序
- 嵌套集合
- Parent Child
- Parent child
- Indexing parent child
- Has child
- Has parent
- Children agg
- Grandparents
- Practical considerations
- Scaling
- Shard
- Overallocation
- Kagillion shards
- Capacity planning
- Replica shards
- Multiple indices
- Index per timeframe
- Index templates
- Retiring data
- Index per user
- Shared index
- Faking it
- One big user
- Scale is not infinite
- Cluster Admin
- Marvel
- Health
- Node stats
- Other stats
- Deployment
- hardware
- other
- config
- dont touch
- heap
- file descriptors
- conclusion
- cluster settings
- Post Deployment
- dynamic settings
- logging
- indexing perf
- rolling restart
- backup
- restore
- conclusion