A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

original-brownbear · 2022-01-27T12:04:34Z

The ValidateJoinRequest contains the cluster state uncompressed.
This causes problems once the cluster state reaches a certain size. For one it requires a massive amount of memory even after #82608 but also, reading the full state on the transport thread outright (unlike with the publication handler that deserializes on GENERIC) is too slow.

For a 40k indices cluster with beats mappings and an admittedly large number of data streams this is what happens:

[2022-01-27T11:35:37,960][WARN ][o.e.t.InboundHandler     ] [elasticsearch-2] handling request [InboundMessage{Header{554386564}{8.1.0}{1239565}{true}{false}{false}{false}{internal:cluster/coordination/join/validate}}] took [7208ms] which is above the warn threshold of [5000ms]

We receive and deserialise a 500M+ message on the transport thread.

This becomes troublesome due the heap required just to buffer the message on a fresh master node that might otherwise be capable of handling this kind of cluster state (it's smaller on heap due to setting+mapping deduplication).

The slowness on the transport thread can mostly be blamed on the time it takes to read index settings.

This relates #80493 and setting deduplication in general. Ideally we should find a way of deduplicating the settings better to make the message smaller. Until that time a reasonable solution might be to simply compress the state in the message and read it as plain bytes, then deserialise on GENERIC like we do for the publication handler.

An additional issue with this is that the master/sending node has to serialize this message in full which puts a problematic amount of strain on it potentially.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-27T12:04:37Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-01-27T12:33:10Z

It gets worse if a load of nodes all try and join at the same time, for instance after a network partition heals, because then the master ends up needing to allocate O(100s of MBs) buffers for each node. The duplicated serialization work is not totally critical although it'd be nice to also solve that. Sharing the serialised state across concurrent validation requests like we do with publications would be good I think.

A few months back I also took a look at ways to bound the memory usage to something more sensible in both validation and publication. I think we could buffer the serialised representation of the cluster state on disk, send it in bounded-size chunks, accumulate the chunks on disk on the receiver, and then stream the state straight from disk. Not sure how valuable this would be - if we can't even hold O(1) copy of the serialised cluster state in memory then we're probably close to breaking point in other ways too. I see other advantages to chunking without the disk buffering too, e.g. on some other failure it would let us bail out part-way through rather than continuing to push 100s of MBs of unnecessary data over the wire.

original-brownbear · 2022-01-27T12:42:24Z

if we can't even hold O(1) copy of the serialised cluster state in memory then we're probably close to breaking point in other ways too.

Currently the serialized uncompressed state is actually significantly larger than the on-heap state due to (index-) setting deduplication. The on-heap state in the example cluster I used here is about 1/10 (50M) of those 500M.

The problem here really is with the serialisation. The state barely grows by adding more indices due to the deduplications we do at this point. But the wire-format only takes advantage of them for mappings, but not for settings which hurt more than mappings to begin with because we store mappings serialized+compressed.

I think ideally we'd do all of the above to fix this. Compress the message (I didn't try it out but I think that gets us about 80% heap savings in this example) and make for some chunked sending. This relates to the discussion I'd like to have on #82245 I think. If we were able to serialize and flush in chunks even without adjusting the wire format and also read in chunks, this would be a much smaller issue.
If we had a way to serialize smarter though and don't send all those settings over and over in the first place that would be even better because it would remove the issue of burning 7s of CPU time for reading a bunch of duplicate strings from the wire (even with a better format something like 50M in one go isn't great so I think we need both, better networking and a nicer serialisation format).

Fixes a few scalability issues around join validation: - compresses the cluster state sent over the wire - shares the serialized cluster state across multiple nodes - forks the decompression/deserialization work off the transport thread Relates elastic#77466 Closes elastic#83204

Fixes a few scalability issues around join validation: - compresses the cluster state sent over the wire - shares the serialized cluster state across multiple nodes - forks the decompression/deserialization work off the transport thread Relates #77466 Closes #83204

original-brownbear added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 27, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jan 27, 2022

original-brownbear mentioned this issue Jan 27, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

DaveCTurner self-assigned this Mar 28, 2022

DaveCTurner mentioned this issue Mar 28, 2022

Reduce resource needs of join validation #85380

Merged

DaveCTurner closed this as completed in #85380 Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

original-brownbear commented Jan 27, 2022 •

edited

Loading

elasticmachine commented Jan 27, 2022

DaveCTurner commented Jan 27, 2022 •

edited

Loading

original-brownbear commented Jan 27, 2022

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

A Node Joining a Cluster with a Large State Receives the Full Uncompressed State in a ValidateJoinRequest #83204

Comments

original-brownbear commented Jan 27, 2022 • edited Loading

elasticmachine commented Jan 27, 2022

DaveCTurner commented Jan 27, 2022 • edited Loading

original-brownbear commented Jan 27, 2022

original-brownbear commented Jan 27, 2022 •

edited

Loading

DaveCTurner commented Jan 27, 2022 •

edited

Loading