APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

original-brownbear · 2021-10-20T10:18:54Z

ES currently is not able to return REST responses larger than 2Gb (max int value) because of the way we serialize the messages into BytesReference instances.
This causes APIs like /_cluster/state to stop working eventually (in this case we're talking about ~15k indices with Auditbeat templates when using ?human&pretty and an almost 1G response without those parameters).

Even before outright breaking due to the 2G size limit, requesting a response of this size can destabilize smaller master nodes. This has already been observed for smaller states when concurrent requests come into the mix.

This is not all that important of an issue in practice for most users because of the limited usefulness of these massive responses in most cases, but:
One implication of this issue is that the support diagnostics tool breaks and/or that running it might destabilize the master/cluster.
Another issue is orchestration tooling that might hit endpoints like the cluster state endpoint and destabilize/break clusters that way (observed in the real-world already).

It is definitely a bug to have endpoints that eventually become unusable or worse yet allow for bringing down a node if called.
A solution to this is likely to not have these endpoints instead of making them work at larger scale and force users/tooling to use more specific endpoints for the problem at hand instead.

relates #77466

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-20T10:18:56Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-10-20T14:59:09Z

We discussed this and decided to provide these APIs in a way that will scale to larger cluster states by implementing them as a chunked HTTP response.

I'll try to see if I can find a quick solution here.

elasticmachine · 2021-10-28T21:59:27Z

Pinging @elastic/clients-team (Team:Clients)

consulthys · 2021-11-17T16:08:58Z

How about providing a HEAD /_cluster/state route that would just return the size of the cluster state. That would allow clients to get that number and based on it decide whether it's a good idea to download the cluster state or not.

That could also allow to chart how the cluster state size evolves over time.

Is it what PR 78816 seeks to provide?

DaveCTurner · 2022-11-21T13:21:37Z

I think #89838 subsumes this issue so I'm closing this.

original-brownbear added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. team-discuss labels Oct 20, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 20, 2021

This was referenced Oct 20, 2021

Fix Large Shard Count Scalability Issues #77466

Open

Avoid materializing uncompressed HTTP response before compression #73719

Open

original-brownbear removed the team-discuss label Oct 20, 2021

original-brownbear self-assigned this Oct 20, 2021

sethmlarson added the Team:Clients Meta label for clients team label Oct 28, 2021

original-brownbear mentioned this issue Jan 5, 2022

Sending Large Transport Messages Should be Optimized #82245

Closed

DaveCTurner closed this as completed Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

original-brownbear commented Oct 20, 2021 •

edited

Loading

elasticmachine commented Oct 20, 2021

original-brownbear commented Oct 20, 2021

elasticmachine commented Oct 28, 2021

consulthys commented Nov 17, 2021 •

edited

Loading

DaveCTurner commented Nov 21, 2022

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

Comments

original-brownbear commented Oct 20, 2021 • edited Loading

elasticmachine commented Oct 20, 2021

original-brownbear commented Oct 20, 2021

elasticmachine commented Oct 28, 2021

consulthys commented Nov 17, 2021 • edited Loading

DaveCTurner commented Nov 21, 2022

original-brownbear commented Oct 20, 2021 •

edited

Loading

consulthys commented Nov 17, 2021 •

edited

Loading