APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560
Labels
>bug
:Distributed Coordination/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
Team:Clients
Meta label for clients team
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
ES currently is not able to return REST responses larger than 2Gb (max int value) because of the way we serialize the messages into
BytesReference
instances.This causes APIs like
/_cluster/state
to stop working eventually (in this case we're talking about ~15k indices with Auditbeat templates when using?human&pretty
and an almost 1G response without those parameters).Even before outright breaking due to the 2G size limit, requesting a response of this size can destabilize smaller master nodes. This has already been observed for smaller states when concurrent requests come into the mix.
This is not all that important of an issue in practice for most users because of the limited usefulness of these massive responses in most cases, but:
One implication of this issue is that the support diagnostics tool breaks and/or that running it might destabilize the master/cluster.
Another issue is orchestration tooling that might hit endpoints like the cluster state endpoint and destabilize/break clusters that way (observed in the real-world already).
It is definitely a bug to have endpoints that eventually become unusable or worse yet allow for bringing down a node if called.
A solution to this is likely to not have these endpoints instead of making them work at larger scale and force users/tooling to use more specific endpoints for the problem at hand instead.
relates #77466
The text was updated successfully, but these errors were encountered: