Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

Closed
Tracked by #77466
original-brownbear opened this issue Oct 20, 2021 · 5 comments
Assignees
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Clients Meta label for clients team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Oct 20, 2021

ES currently is not able to return REST responses larger than 2Gb (max int value) because of the way we serialize the messages into BytesReference instances.
This causes APIs like /_cluster/state to stop working eventually (in this case we're talking about ~15k indices with Auditbeat templates when using ?human&pretty and an almost 1G response without those parameters).

Even before outright breaking due to the 2G size limit, requesting a response of this size can destabilize smaller master nodes. This has already been observed for smaller states when concurrent requests come into the mix.

This is not all that important of an issue in practice for most users because of the limited usefulness of these massive responses in most cases, but:
One implication of this issue is that the support diagnostics tool breaks and/or that running it might destabilize the master/cluster.
Another issue is orchestration tooling that might hit endpoints like the cluster state endpoint and destabilize/break clusters that way (observed in the real-world already).

It is definitely a bug to have endpoints that eventually become unusable or worse yet allow for bringing down a node if called.
A solution to this is likely to not have these endpoints instead of making them work at larger scale and force users/tooling to use more specific endpoints for the problem at hand instead.

relates #77466

@original-brownbear original-brownbear added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss labels Oct 20, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Member Author

We discussed this and decided to provide these APIs in a way that will scale to larger cluster states by implementing them as a chunked HTTP response.

I'll try to see if I can find a quick solution here.

@original-brownbear original-brownbear self-assigned this Oct 20, 2021
@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Oct 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

@consulthys
Copy link
Contributor

consulthys commented Nov 17, 2021

How about providing a HEAD /_cluster/state route that would just return the size of the cluster state. That would allow clients to get that number and based on it decide whether it's a good idea to download the cluster state or not.

That could also allow to chart how the cluster state size evolves over time.

Is it what PR 78816 seeks to provide?

@DaveCTurner
Copy link
Contributor

I think #89838 subsumes this issue so I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Clients Meta label for clients team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

5 participants