-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recycle pages used by outgoing publications #77317
Recycle pages used by outgoing publications #77317
Conversation
Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled.
Pinging @elastic/es-distributed (Team:Distributed) |
Hmm, the failure is a genuine problem: we simulate node reboots in |
Thanks David, I always wanted this :)
I remember us running into this before when you tried this previously. Maybe it's ok to leave this as a TODO and just not use pooled buffers in tests that block all activity on a node like that? Not the nicest solution, but realistically if we want to block all activity on a node then that means no releasing of memory and we do have all kinds of tests for all kinds of failure scenarios/disconnects/you-name-it that provide coverage for this? I suppose we could invest more effort and create another version of |
It's actually not that bad it seems, see DaveCTurner@f5ce155#diff-0681cc9adce47fddd18bdfb39428f871bb4b592a40829fd6b58eeea7e1495358R1533-R1537 although that's not all, we also have to clean up messages that went to a blackhole (waiting a day works but makes the tests super-slow) and also there's another blackhole for messages that get sent to a node that has rebooted. Not sure what else I'll find yet, mostly I'm just working through failures where the cluster gets left in a broken state by the test. |
- Track blackholed requests for timely completion - Deliver blackholed requests at end of runRandomly - Release memory from rebooted nodes - Complete messages delivered to rebooted nodes - Heal cluster from disruptions before end of tests
f5ce155
to
7b4961d
Compare
failure is #58946, @elasticmachine please run elasticsearch-ci/part-1 |
I've run through the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent quite some time thinking this one through and couldn't find a leak I think. This looks really nice, thanks David!
Just one quick question on the concurrency handling (mainly to make sure I'm not missing any detail :)).
...r/src/test/java/org/elasticsearch/cluster/coordination/PublicationTransportHandlerTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/common/bytes/ReleasableBytesReference.java
Show resolved
Hide resolved
if (bytes == null) { | ||
try { | ||
bytes = serializeFullClusterState(newState, destination.getVersion()); | ||
serializedStates.put(destination.getVersion(), bytes); | ||
final ReleasableBytesReference existingBytes = serializedStates.putIfAbsent(destination.getVersion(), bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, can't we just do computeIfAbsent
here instead and block on concurrent access? This is all on the generic pool anyways if there's contention isn't it? In any case no matter the thread this runs on, seems like blocking on the locked CHM key is still cheaper on the system overall than serializing twice? (it's an edge case either way but using computeIfAbsent
is less code here I think? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point, I wasn't sure if computeIfAbsent
ran the supplier twice sometimes but you made me go and read the docs and they say that it won't so it should be fine. Definitely a corner case of a corner case to be blocking here but I suppose we may as well not burn CPU while we wait given that we'll be waiting about as long either way. No risk of deadlock either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok see 13bfe0c (it triggered a few more changes/simplifications too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if CI is happy I'm happy :) thanks David!
This reverts commit 8b50fcd.
Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled.
Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled.
* Recycle pages used by outgoing publications (#77317) Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled. * Enable stricter tests always * Fix one * Better assertions (and longer run) in testSingleNodeDiscoveryStabilisesEvenWhenDisrupted * Clean up warnings
* Recycle pages used by outgoing publications (#77317) Today `PublicationTransportHandler.PublicationContext` allocates a bunch of memory for serialized cluster states and diffs, but it uses a plain `BytesStreamOutput` which means that the backing pages are allocated by the `BigArrays#NON_RECYCLING_INSTANCE`. With this commit we pass in a proper `BigArrays` so that the pages being used can be recycled. * Enable stricter tests always * Fix one * Better assertions (and longer run) in testSingleNodeDiscoveryStabilisesEvenWhenDisrupted * Clean up warnings
Today
PublicationTransportHandler.PublicationContext
allocates a bunchof memory for serialized cluster states and diffs, but it uses a plain
BytesStreamOutput
which means that the backing pages are allocated bythe
BigArrays#NON_RECYCLING_INSTANCE
. With this commit we pass in aproper
BigArrays
so that the pages being used can be recycled.