Recycle pages used by outgoing publications #77317

DaveCTurner · 2021-09-06T16:21:09Z

Today PublicationTransportHandler.PublicationContext allocates a bunch
of memory for serialized cluster states and diffs, but it uses a plain
BytesStreamOutput which means that the backing pages are allocated by
the BigArrays#NON_RECYCLING_INSTANCE. With this commit we pass in a
proper BigArrays so that the pages being used can be recycled.

Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled.

elasticmachine · 2021-09-06T16:21:13Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-09-06T17:20:22Z

Hmm, the failure is a genuine problem: we simulate node reboots in CoordinatorTests and we block any further activity on the rebooted instance, but we don't release any memory they allocated.

original-brownbear · 2021-09-06T20:10:11Z

Thanks David, I always wanted this :)

Hmm, the failure is a genuine problem: we simulate node reboots in CoordinatorTests and we block any further activity on the rebooted instance, but we don't release any memory they allocated.

I remember us running into this before when you tried this previously. Maybe it's ok to leave this as a TODO and just not use pooled buffers in tests that block all activity on a node like that? Not the nicest solution, but realistically if we want to block all activity on a node then that means no releasing of memory and we do have all kinds of tests for all kinds of failure scenarios/disconnects/you-name-it that provide coverage for this?

I suppose we could invest more effort and create another version of BigArrays that allows us some more fine grained control over this and selectively release all memory for a blocked node but that seems quite cumbersome?

DaveCTurner · 2021-09-06T20:24:08Z

It's actually not that bad it seems, see DaveCTurner@f5ce155#diff-0681cc9adce47fddd18bdfb39428f871bb4b592a40829fd6b58eeea7e1495358R1533-R1537 although that's not all, we also have to clean up messages that went to a blackhole (waiting a day works but makes the tests super-slow) and also there's another blackhole for messages that get sent to a node that has rebooted. Not sure what else I'll find yet, mostly I'm just working through failures where the cluster gets left in a broken state by the test.

- Track blackholed requests for timely completion - Deliver blackholed requests at end of runRandomly - Release memory from rebooted nodes - Complete messages delivered to rebooted nodes - Heal cluster from disruptions before end of tests

…on-memory

DaveCTurner · 2021-09-07T11:51:48Z

failure is #58946, @elasticmachine please run elasticsearch-ci/part-1

DaveCTurner · 2021-09-07T12:30:29Z

I've run through the CoordinatorTests over 100 times forcing the use of MockBigArrays without seeing any further leaks or other failures, so I believe this is ready to review now.

original-brownbear

Spent quite some time thinking this one through and couldn't find a leak I think. This looks really nice, thanks David!

Just one quick question on the concurrency handling (mainly to make sure I'm not missing any detail :)).

...r/src/test/java/org/elasticsearch/cluster/coordination/PublicationTransportHandlerTests.java

server/src/main/java/org/elasticsearch/common/bytes/ReleasableBytesReference.java

original-brownbear · 2021-09-07T17:04:33Z

server/src/main/java/org/elasticsearch/cluster/coordination/PublicationTransportHandler.java

            if (bytes == null) {
                try {
                    bytes = serializeFullClusterState(newState, destination.getVersion());
-                    serializedStates.put(destination.getVersion(), bytes);
+                    final ReleasableBytesReference existingBytes = serializedStates.putIfAbsent(destination.getVersion(), bytes);


I wonder, can't we just do computeIfAbsent here instead and block on concurrent access? This is all on the generic pool anyways if there's contention isn't it? In any case no matter the thread this runs on, seems like blocking on the locked CHM key is still cheaper on the system overall than serializing twice? (it's an edge case either way but using computeIfAbsent is less code here I think? :)

Yeah that's a good point, I wasn't sure if computeIfAbsent ran the supplier twice sometimes but you made me go and read the docs and they say that it won't so it should be fine. Definitely a corner case of a corner case to be blocking here but I suppose we may as well not burn CPU while we wait given that we'll be waiting about as long either way. No risk of deadlock either.

Ok see 13bfe0c (it triggered a few more changes/simplifications too)

…on-memory

original-brownbear

LGTM, if CI is happy I'm happy :) thanks David!

This reverts commit 8b50fcd.

Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled.

* Recycle pages used by outgoing publications (#77317) Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled. * Enable stricter tests always * Fix one * Better assertions (and longer run) in testSingleNodeDiscoveryStabilisesEvenWhenDisrupted * Clean up warnings

DaveCTurner added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.16.0 labels Sep 6, 2021

DaveCTurner requested a review from original-brownbear September 6, 2021 16:21

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 6, 2021

DaveCTurner removed the request for review from original-brownbear September 6, 2021 16:43

DaveCTurner marked this pull request as draft September 6, 2021 16:43

Fix more tests

7b4961d

- Track blackholed requests for timely completion - Deliver blackholed requests at end of runRandomly - Release memory from rebooted nodes - Complete messages delivered to rebooted nodes - Heal cluster from disruptions before end of tests

DaveCTurner force-pushed the 2021-09-06-recycle-cluster-state-publication-memory branch from f5ce155 to 7b4961d Compare September 7, 2021 08:16

DaveCTurner added 3 commits September 7, 2021 11:19

Merge branch 'master' into 2021-09-06-recycle-cluster-state-publicati…

b562d4c

…on-memory

Picky picky

1fd6e41

Deliver blackholed requests in DMT tests too

05136f8

DaveCTurner requested a review from original-brownbear September 7, 2021 12:29

DaveCTurner marked this pull request as ready for review September 7, 2021 12:29

original-brownbear reviewed Sep 7, 2021

View reviewed changes

DaveCTurner added 3 commits September 7, 2021 18:14

Merge branch 'master' into 2021-09-06-recycle-cluster-state-publicati…

783a36d

…on-memory

Assert threadpool shut down

7ca5255

Use computeIfAbsent

13bfe0c

original-brownbear approved these changes Sep 7, 2021

View reviewed changes

DaveCTurner merged commit 8b50fcd into elastic:master Sep 8, 2021

DaveCTurner deleted the 2021-09-06-recycle-cluster-state-publication-memory branch September 8, 2021 06:46

DaveCTurner added a commit that referenced this pull request Sep 8, 2021

Revert "Recycle pages used by outgoing publications (#77317)"

76056c6

This reverts commit 8b50fcd.

DaveCTurner mentioned this pull request Sep 8, 2021

Recycle pages used by outgoing publications #77406

Merged

DaveCTurner mentioned this pull request Sep 8, 2021

Recycle pages used by outgoing publications #77407

Merged

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

jrodewig mentioned this pull request Dec 6, 2021

[DOCS] Release notes for v7.16.0 #81369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recycle pages used by outgoing publications #77317

Recycle pages used by outgoing publications #77317

DaveCTurner commented Sep 6, 2021

elasticmachine commented Sep 6, 2021

DaveCTurner commented Sep 6, 2021

original-brownbear commented Sep 6, 2021

DaveCTurner commented Sep 6, 2021

DaveCTurner commented Sep 7, 2021

DaveCTurner commented Sep 7, 2021

original-brownbear left a comment

original-brownbear Sep 7, 2021

DaveCTurner Sep 7, 2021

DaveCTurner Sep 7, 2021

original-brownbear left a comment

Recycle pages used by outgoing publications #77317

Recycle pages used by outgoing publications #77317

Conversation

DaveCTurner commented Sep 6, 2021

elasticmachine commented Sep 6, 2021

DaveCTurner commented Sep 6, 2021

original-brownbear commented Sep 6, 2021

DaveCTurner commented Sep 6, 2021

DaveCTurner commented Sep 7, 2021

DaveCTurner commented Sep 7, 2021

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Sep 7, 2021

Choose a reason for hiding this comment

DaveCTurner Sep 7, 2021

Choose a reason for hiding this comment

DaveCTurner Sep 7, 2021

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment