-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Calling aggs queries with max_bucket_size can throw too_many_buckets_exception #59983
Comments
Pinging @elastic/stack-monitoring (Team:Monitoring) |
Good stuff @igoristic!! I found this super helpful comment by @polyfractal about how |
Before getting into details, there is discussion ongoing about deprecating or changing the behavior of
++, I think that will help. But note that it is also not sufficient to "protect" the aggregation from tripping the threshold. E.g. you could have two aggs each collect Theoretically the user would need to configure the agg so that the sum of all generated buckets is under the threshold. Sorta doable with
Yes, we should probably be doing that as an extra safety measure. I don't think there's a reason we don't do that... just an oversight when |
Great to know! I didn't know about elastic/elasticsearch#51731 but we'll follow along now. I think, for now, we can open a PR that adds |
Side note: in case the |
Thank you @polyfractal for the very insightful information 🙇
++ @chrisronline I was thinking the same thing after I read the comment you mentioned. Though, I also think we should try to pre-calculate size where possible (this can be a separate PR/issue), since in most cases we just need it to be the same as the count of the items we expect to get back (which should be significantly smaller than |
Where do you see us being able to do this? I'm honestly blanking on a single place where we can leverage this. When we set the |
In the body of the post I gave a little example on how we can do this with indices for example. From our But, I think this should be a separate "improvement" debt, and that |
@chrisronline I've played around with For all we know this might be happening because they have their collection set at a really high rate, or perhaps they increased the monitoring retention period to something beyond what our queries are optimized for. I noticed we have |
I think we might be on different pages here unfortunately.
When thinking this through, I wasn't really concerned about the memory overhead on the Elasticsearch nodes. I'm assuming
If you remove |
@chrisronline Sorry if I miss communicated I think we're on the same page. My point was what do we do with more complex queries that might still trigger the max buckets error due to: calculations/buffering, or if we use GET .monitoring-es-*/_search
{
"size": 0,
"sort": {
"timestamp": {
"order": "desc"
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"type": "shards"
}
}
]
}
},
"aggs": {
"indices": {
"terms": {
"field": "shard.index",
"size": 5,
"shard_size": 5
},
"aggs": {
"by_date": {
"date_histogram": {
"field": "timestamp",
"min_doc_count": 1,
"fixed_interval": "30s"
}
}
}
}
}
} Also, I think we should be conscious about memory/cpu footprints here as well, since I did eventually got my ES to crash with a |
Yup, good point @igoristic I'm not sure what to do here. I don't know if there is a more appropriate Maybe we wait until there is resolution on elastic/elasticsearch#51731. WDYT? |
I don't know if there is a quick/short-term solution. I was thinking for now maybe we could implement the "safest" approach where we decrease the Also, I don't know if it's common sense, but maybe our docs should also convey that if monitoring collection rate and/or monitoring retention is increased so should the |
@igoristic can you schedule a 30 minute call for the 3 of us to go over this? |
As an update from our side, someone on the analytics team is currently working on elastic/elasticsearch#51731. Still no planned release (just got started), but we're hoping to resolve it sooner than later. |
Would greatly appreciate your take on this @chrisronline
Most queries in Stack Monitoring backend use
config.get('monitoring.ui.max_bucket_size')
to set the.size
property in theaggs
. If this property is not set in the config it'll default to 10,000 which is also the cluster'ssearch.max_buckets
. This will causetoo_many_buckets_exception
error if it has enough data to trigger it. This approach works fine for when the aggregation is expected to yield a lot less data points (with a relatively small shard count per index ratio) than the impliedaggs.size
, but this will break once it's the other way around.I think we should avoid defaulting
max_bucket_size
tosearch.max_buckets
for aggregation size where possible. In some cases we can even calculate thesize
if we know approximately how many items we expect to get back. One example is with how we query shard stats for indices eg:Notice how
"field": "shard.index"
aggregation has 10000 as itssize
. Running this on a cluster that has a lot of indices (and shards) will result with the too many buckets error. But, changing the aggs size to something like:(item_count * 1.5 + 10)
while making sure it respectsshard_size
should be able to return the results with the same accuracy without triggering the error (if I understand this correctly).To test the case above you will need to simulate a pretty large cluster with lots of big indices (similar to this). You might also want to downplay your cluster's
search.max_buckets
setting (which should also be changed here)Then you need to obtain the
state_uuid
which is taken from our cluster status query:Be sure to change the
timestamp
andcluster_uuid
relevant to you environment. We then collect the list of indices from the following query:As you can see we can get the count from this query and then do our
aggs.size
calculation mentioned above.We can use a similar approach in other places we use max_bucket_size, and if we truly do need "everything" (and we don't know the count upfront) then we can start looking into composite queries again (revisiting #36358)
The text was updated successfully, but these errors were encountered: