Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

Closed
original-brownbear opened this issue Jan 27, 2022 · 3 comments · Fixed by #85380
Assignees
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Jan 27, 2022

The ValidateJoinRequest contains the cluster state uncompressed.
This causes problems once the cluster state reaches a certain size. For one it requires a massive amount of memory even after #82608 but also, reading the full state on the transport thread outright (unlike with the publication handler that deserializes on GENERIC) is too slow.

For a 40k indices cluster with beats mappings and an admittedly large number of data streams this is what happens:

[2022-01-27T11:35:37,960][WARN ][o.e.t.InboundHandler     ] [elasticsearch-2] handling request [InboundMessage{Header{554386564}{8.1.0}{1239565}{true}{false}{false}{false}{internal:cluster/coordination/join/validate}}] took [7208ms] which is above the warn threshold of [5000ms]

We receive and deserialise a 500M+ message on the transport thread.

This becomes troublesome due the heap required just to buffer the message on a fresh master node that might otherwise be capable of handling this kind of cluster state (it's smaller on heap due to setting+mapping deduplication).

The slowness on the transport thread can mostly be blamed on the time it takes to read index settings.

image

This relates #80493 and setting deduplication in general. Ideally we should find a way of deduplicating the settings better to make the message smaller. Until that time a reasonable solution might be to simply compress the state in the message and read it as plain bytes, then deserialise on GENERIC like we do for the publication handler.

An additional issue with this is that the master/sending node has to serialize this message in full which puts a problematic amount of strain on it potentially.

@original-brownbear original-brownbear added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 27, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 27, 2022

It gets worse if a load of nodes all try and join at the same time, for instance after a network partition heals, because then the master ends up needing to allocate O(100s of MBs) buffers for each node. The duplicated serialization work is not totally critical although it'd be nice to also solve that. Sharing the serialised state across concurrent validation requests like we do with publications would be good I think.

A few months back I also took a look at ways to bound the memory usage to something more sensible in both validation and publication. I think we could buffer the serialised representation of the cluster state on disk, send it in bounded-size chunks, accumulate the chunks on disk on the receiver, and then stream the state straight from disk. Not sure how valuable this would be - if we can't even hold O(1) copy of the serialised cluster state in memory then we're probably close to breaking point in other ways too. I see other advantages to chunking without the disk buffering too, e.g. on some other failure it would let us bail out part-way through rather than continuing to push 100s of MBs of unnecessary data over the wire.

@original-brownbear
Copy link
Member Author

if we can't even hold O(1) copy of the serialised cluster state in memory then we're probably close to breaking point in other ways too.

Currently the serialized uncompressed state is actually significantly larger than the on-heap state due to (index-) setting deduplication. The on-heap state in the example cluster I used here is about 1/10 (50M) of those 500M.

The problem here really is with the serialisation. The state barely grows by adding more indices due to the deduplications we do at this point. But the wire-format only takes advantage of them for mappings, but not for settings which hurt more than mappings to begin with because we store mappings serialized+compressed.

I think ideally we'd do all of the above to fix this. Compress the message (I didn't try it out but I think that gets us about 80% heap savings in this example) and make for some chunked sending. This relates to the discussion I'd like to have on #82245 I think. If we were able to serialize and flush in chunks even without adjusting the wire format and also read in chunks, this would be a much smaller issue.
If we had a way to serialize smarter though and don't send all those settings over and over in the first place that would be even better because it would remove the issue of burning 7s of CPU time for reading a bunch of duplicate strings from the wire (even with a better format something like 50M in one go isn't great so I think we need both, better networking and a nicer serialisation format).

@DaveCTurner DaveCTurner self-assigned this Mar 28, 2022
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 28, 2022
Fixes a few scalability issues around join validation:

- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread

Relates elastic#77466
Closes elastic#83204
DaveCTurner added a commit that referenced this issue Apr 26, 2022
Fixes a few scalability issues around join validation:

- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread

Relates #77466
Closes #83204
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants