Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8343704: Bad GC parallelism with processing Cleaner queues #22043

Closed
wants to merge 19 commits into from

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Nov 12, 2024

See the bug for more discussion and reproducer. This PR replaces the ad-hoc linked list with segmented list of arrays. Arrays are easy targets for GC. There are possible improvements here, most glaring is parallelism that is currently knee-capped by global synchronization. The synchronization scheme follows what we have in original code, and I think it is safer to continue with it right now.

I'll put performance data in a separate comment.

Additional testing:

  • Original reproducer improves drastically
  • New microbenchmark shows no regression on "churning" tests, which covers insertion/removal perf
  • New microbenchmark shows improvement on Full GC times (crude, but repeatable), serves as a proxy for reproducer
  • java/lang/ref tests in release
  • all tests in fastdebug

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8343704: Bad GC parallelism with processing Cleaner queues (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/22043/head:pull/22043
$ git checkout pull/22043

Update a local copy of the PR:
$ git checkout pull/22043
$ git pull https://git.openjdk.org/jdk.git pull/22043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 22043

View PR using the GUI difftool:
$ git pr show -t 22043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/22043.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Nov 12, 2024

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Nov 12, 2024

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8343704: Bad GC parallelism with processing Cleaner queues

Reviewed-by: bchristi, vklang, ogillespie, kdnilsen

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 2 new commits pushed to the master branch:

  • 659f70b: 8343418: Unnecessary Hashtable usage in CSS.htmlAttrToCssAttrMap
  • 5c8cb2e: 8337199: Add jcmd Thread.vthread_scheduler and Thread.vthread_pollers diagnostic commands

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Nov 12, 2024
@shipilev
Copy link
Member Author

shipilev commented Nov 12, 2024

Original reproducer on my M1 improves dramatically:

# Before
...
[8.989s][info ][gc       ] GC(50) Pause Young (Normal) (G1 Evacuation Pause) 608M->21M(1011M) 46.562ms
[9.187s][info ][gc       ] GC(51) Pause Young (Normal) (G1 Evacuation Pause) 608M->22M(1011M) 45.286ms
[9.387s][info ][gc       ] GC(52) Pause Young (Normal) (G1 Evacuation Pause) 609M->21M(1011M) 45.636ms
[9.592s][info ][gc       ] GC(53) Pause Young (Normal) (G1 Evacuation Pause) 608M->22M(1015M) 47.514ms
[9.794s][info ][gc       ] GC(54) Pause Young (Normal) (G1 Evacuation Pause) 612M->22M(1015M) 46.807ms
[9.993s][info ][gc       ] GC(55) Pause Young (Normal) (G1 Evacuation Pause) 612M->21M(1015M) 45.964ms

# After
...
[6.617s][info ][gc       ] GC(50) Pause Young (Normal) (G1 Evacuation Pause) 501M->20M(829M) 10.668ms
[6.758s][info ][gc       ] GC(51) Pause Young (Normal) (G1 Evacuation Pause) 501M->20M(829M) 10.712ms
[6.894s][info ][gc       ] GC(52) Pause Young (Normal) (G1 Evacuation Pause) 500M->20M(829M) 12.090ms
[7.031s][info ][gc       ] GC(53) Pause Young (Normal) (G1 Evacuation Pause) 501M->20M(829M) 11.026ms
[7.171s][info ][gc       ] GC(54) Pause Young (Normal) (G1 Evacuation Pause) 500M->20M(837M) 12.107ms

A closest reproducer in form of JMH test also improves in both GC times and noise. On 5950X:

Benchmark       (count)  Mode  Cnt    Score    Error  Units

# Before
CleanerGC.test    16384  avgt    9    2.602 ±  0.041  ms/op
CleanerGC.test    65536  avgt    9    4.815 ±  0.038  ms/op
CleanerGC.test   262144  avgt    9   12.569 ±  0.200  ms/op
CleanerGC.test  1048576  avgt    9   52.579 ±  1.745  ms/op
CleanerGC.test  4194304  avgt    9  209.140 ± 17.027  ms/op

# After
CleanerGC.test    16384  avgt    9    2.499 ±  0.209  ms/op
CleanerGC.test    65536  avgt    9    3.269 ±  0.185  ms/op
CleanerGC.test   262144  avgt    9    6.108 ±  0.140  ms/op
CleanerGC.test  1048576  avgt    9   18.621 ±  0.150  ms/op
CleanerGC.test  4194304  avgt    9   67.267 ±  1.413  ms/op

On churn benchmark, which covers insertion/removal perf, new implementation outperforms the original one:

Benchmark          (count)  (recipFreq)  Mode  Cnt   Score    Error  Units

# Before
CleanerChurn.test      N/A          128  avgt    9   7.063 ±  0.262  ns/op
CleanerChurn.test      N/A          256  avgt    9   5.669 ±  0.118  ns/op
CleanerChurn.test      N/A          512  avgt    9   5.025 ±  0.066  ns/op
CleanerChurn.test      N/A         1024  avgt    9   4.714 ±  0.086  ns/op
CleanerChurn.test      N/A         2048  avgt    9   4.595 ±  0.091  ns/op

# After
CleanerChurn.test      N/A          128  avgt    9   5.941 ± 0.133  ns/op
CleanerChurn.test      N/A          256  avgt    9   5.113 ± 0.091  ns/op
CleanerChurn.test      N/A          512  avgt    9   4.767 ± 0.109  ns/op
CleanerChurn.test      N/A         1024  avgt    9   4.551 ± 0.037  ns/op
CleanerChurn.test      N/A         2048  avgt    9   4.448 ± 0.103  ns/op

@openjdk
Copy link

openjdk bot commented Nov 12, 2024

@shipilev The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Nov 12, 2024

@xmas92
Copy link
Member

xmas92 commented Nov 14, 2024

This seem to handle excessive allocations when churning around an empty list by keeping the head node always allocated. I wonder if there is any worth adding some hysteresis if it churns around a multiple of the NODE_CAPACITY, by for example pooling one node.

@shipilev
Copy link
Member Author

This seem to handle excessive allocations when churning around an empty list by keeping the head node always allocated.

Realistically, the list is almost never empty: there is a Cleaner instance itself recorded in the list. The only (?) state when the list is fully empty is when Cleaner itself is also dead, AFAICS. We pre-allocate head node for the implementation simplicity: if head is always available, we don't need to null-check it, for example.

I wonder if there is any worth adding some hysteresis if it churns around a multiple of the NODE_CAPACITY, by for example pooling one node.

I don't think we should care about this case: it seems the rare benefit does not outweigh the cost for common case? The goal for this implementation is to avoid wasting more space than necessary. Caching a node would take another bunch of KBs per Cleaner, at very least.

@xmas92
Copy link
Member

xmas92 commented Nov 14, 2024

I don't think we should care about this case: it seems the rare benefit does not outweigh the cost for common case? The goal for this implementation is to avoid wasting more space than necessary. Caching a node would take another bunch of KBs per Cleaner, at very least.

That is probably correct. I was however thinking that it would only be pooled asymmetrically as some type of hystereses. So you pool when you remove a node (switch the head) and keep it far an arbitrary amount of removals. So it would only really waste memory for cleaners that have this behaviour that they keep adding and removing cleanable around a NODE_CAPACITY boundary.

@shipilev
Copy link
Member Author

I was however thinking that it would only be pooled asymmetrically as some type of hystereses. So you pool when you remove a node (switch the head) and keep it far an arbitrary amount of removals. So it would only really waste memory for cleaners that have this behaviour that they keep adding and removing cleanable around a NODE_CAPACITY boundary.

I really do not want to make an improvement that is more complicated than it needs to be :) As it stands, current thing improves performance across the board, so chasing minor inefficiencies looks like venturing into diminishing returns territory. We can do this as the follow-up, if you want to explore that. I don't see clearly how this would work, and I would prefer to hold off more advanced heuristics in favor of simplicity here.

@shipilev
Copy link
Member Author

That is probably correct. I was however thinking that it would only be pooled asymmetrically as some type of hystereses. So you pool when you remove a node (switch the head) and keep it far an arbitrary amount of removals. So it would only really waste memory for cleaners that have this behaviour that they keep adding and removing cleanable around a NODE_CAPACITY boundary.

I have been playing with 8343704-node-cache.txt -- is that what you had in mind?

@xmas92
Copy link
Member

xmas92 commented Nov 15, 2024

That is probably correct. I was however thinking that it would only be pooled asymmetrically as some type of hystereses. So you pool when you remove a node (switch the head) and keep it far an arbitrary amount of removals. So it would only really waste memory for cleaners that have this behaviour that they keep adding and removing cleanable around a NODE_CAPACITY boundary.

I have been playing with 8343704-node-cache.txt -- is that what you had in mind?

Yes. It amortises the allocation over at least NODE_CAPACITY inserts, instead of 1 in the worst case.

I have very little experience how this plays out in practice, nor how the cleanable are used. This was purely an observation based on seen a symmetrical grow/shrink behaviour of some resource.

@shipilev
Copy link
Member Author

I have been playing with 8343704-node-cache.txt -- is that what you had in mind?

Yes. It amortises the allocation over at least NODE_CAPACITY inserts, instead of 1 in the worst case.

Pushed that into PR. There is some measurable impact for dealing with this node cache during heavy churn, but it is still well within the improvements margin we get wholesale.

I am willing to accept this single-slot cache for three reasons: a) crossing the artificial border with just a single cleaner registering and unregistering looks like something we want to mitigate to avoid surprises; b) Cleaners are recommended by Javadoc to be shared, so wasting another node should not be as painful; c) the implementation and maintenance cost is not high.

But I would draw the line at this heuristics, and do nothing smarter :)

@franz1981
Copy link

Not that it was the main problem here (since using array-linked lists deliver already the major improvement with less changes, kudos!) - but in case having a lock-free structure can be of any help, in JCTools we have
and https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpmcUnboundedXaddArrayQueue.java which is decently scalable offer side (less consumer-side) and can save reusing chuncks, becoming way simpler.
If can be of any interest I can create a gist with a simplified, non-Unsafe version.

Otherwise https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/array/FAAArrayQueue.java it's as simple as it looks, but probably need to make it linearizable - IDK.

@RogerRiggs
Copy link
Contributor

With SM removal, there is a doPrivileged call in Cleaner.java.
Should that be handled separately from this PR?

@shipilev
Copy link
Member Author

With SM removal, there is a doPrivileged call in Cleaner.java. Should that be handled separately from this PR?

Since this is not related to the problem at hand, I'd prefer to keep it out of this PR.

@shipilev
Copy link
Member Author

With SM removal, there is a doPrivileged call in Cleaner.java. Should that be handled separately from this PR?

Since this is not related to the problem at hand, I'd prefer to keep it out of this PR.

Actually, I wonder if you mean Cleaner.java that is not even affected by this patch? It is completely unrelated, and it would probably be gone with JDK-8344332.

@RogerRiggs
Copy link
Contributor

Right, I meant jdk.internal.ref.Cleaner.
So Yes, it will be gone after 8344332; and it can remain as is until that's implemented.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 2, 2024
@shipilev
Copy link
Member Author

shipilev commented Dec 3, 2024

Testing looks clean for me here. I would like to integrate this soon. If anyone wants to test it through their CIs, please do so :)

@shipilev
Copy link
Member Author

shipilev commented Dec 4, 2024

There we go.

/integrate

@openjdk
Copy link

openjdk bot commented Dec 4, 2024

Going to push as commit 4000e92.
Since your change was applied there have been 49 commits pushed to the master branch:

  • 0c7451a: 8332686: InetAddress.ofLiteral can throw StringIndexOutOfBoundsException
  • 56d315d: 8343540: Report preview error for inherited effectively-preview methods
  • 994504c: 8329351: Add runtime/Monitor/TestRecursiveLocking.java for recursive Java monitor stress testing
  • 3d49665: 8345286: Remove use of SecurityManager API from misc areas
  • 38927fc: 8343213: TEST_BUG: [Graal] java/lang/ref/Basic.java fails
  • cf1eb58: 8344935: [ubsan]: javaThread.hpp:1241:52: runtime error: load of value 9831830, which is not a valid value for type 'freeze_result'
  • 943aa03: 8345404: [ppc64le] ProblemList TestAllocOutOfMemory.java#large
  • 9e2b66f: 8345178: RISC-V: Add gtests for narrow cmpxchg
  • 4c33caa: 8344609: Check ResourceMark nesting when allocating a GrowableArray on an alternative ResourceArea
  • 4b92816: 8345375: Improve debuggability of test/jdk/java/net/Socket/CloseAvailable.java
  • ... and 39 more: https://git.openjdk.org/jdk/compare/3eb54615783562f24e61983dfcc3cb823b27b0eb...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 4, 2024
@openjdk openjdk bot closed this Dec 4, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 4, 2024
@openjdk
Copy link

openjdk bot commented Dec 4, 2024

@shipilev Pushed as commit 4000e92.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs [email protected] integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.