Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase GCLocker retry allocation count to avoid OOM too early #19026

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

hackeryang
Copy link
Member

@hackeryang hackeryang commented Sep 13, 2023

Description

JDK 9 to 21 has a GCLocker bug which causes JVM to throw OOME too early, because the default value of -XX:GCLockerRetryAllocationCount is 2, most of Trino clusters in production environments has 8 to 16+ CPU physical cores(which mostly means 32 vcores), so threads may be starved from receiving memory by the GClocker. The relevant issue: https://bugs.openjdk.org/browse/JDK-8192647

The principle flow chart of obtaining GCLocker in JVM is as follows:
img_v2_51b8f20a-d7d3-47c5-8166-744da482895g
And this OOM phenomenon was also seen in our production environment:
BbdhuON6yf
After investigating the JVM tuning articles shared by some companies, I found that Tencent(i.e. the parent company of League of Legends) also set this parameter to 100:
image
I know our community will migrate to JDK 21 partially for auto vectorization in Project Hummingbird, so maybe we will need this parameter in a little long time.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Increase GCLocker retry allocation count to avoid OOM too early. ({issue}`19026`)

@hackeryang hackeryang force-pushed the reduce_gclocker_starve branch from 1f98668 to 5e8d26b Compare September 13, 2023 08:59
@hackeryang
Copy link
Member Author

It seems that all unit tests have been passed, hi @losipiuk @wendigo , can you please help merge when you have time, thank you~

@wendigo wendigo merged commit 99153d9 into trinodb:master Sep 13, 2023
@@ -17,3 +17,5 @@
-XX:+UseAESCTRIntrinsics
# Disable Preventive GC for performance reasons (JDK-8293861)
-XX:-G1UsePreventiveGC
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:GCLockerRetryAllocationCount=32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update product test's jvm.config as well

Copy link
Member Author

@hackeryang hackeryang Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update product test's jvm.config as well

I actually considered whether I should add in testing/trino-product-tests-launcher/src/main/resources/docker/presto-product-tests/conf/environment/multinode-all/jvm.config as well at the beginning of our contribution~

But I found that the heap space is 2G, which is shown below:
image
So I thought that most Trino nodes have less CPU cores than the number of memory(mostly 1:4 or 1:6), and this config may be used to a node having 1 CPU core, it may not need to add the parameter and keep the default value 2~

And I also saw this pr, it seems similar to what you need: #15632

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not saying GCLockerRetryAllocationCount is needed for product tests
but, product tests jvm.config should be as close to the recommended settings as possible
so it will use lower Xmx by necessity, but GCLockerRetryAllocationCount should be same as recommended (unless there is a reason not to)

@github-actions github-actions bot added this to the 427 milestone Sep 13, 2023
@hackeryang hackeryang deleted the reduce_gclocker_starve branch September 14, 2023 06:16
@hackeryang
Copy link
Member Author

hackeryang commented Jun 19, 2024

Hello @wendigo , could you please help add back the jvm config -XX:GCLockerRetryAllocationCount=32, i saw that our community removed it at jdk22:
image
However, the jdk community hasn't fixed the bug yet until jdk 24, or may not fix in a little long time:
image

@wendigo
Copy link
Contributor

wendigo commented Jun 19, 2024

In our experiments it's no longer needed. Is there any particular reason why it should be there by default?

@hackeryang
Copy link
Member Author

hackeryang commented Jun 19, 2024

In our experiments it's no longer needed. Is there any particular reason why it should be there by default?

@wendigo It was because we have some small clusters which have little memory space and often met this error:
image

The Retried waiting for GCLocker too often error was because the default value of GCLockerRetryAllocationCount is 2, it means that if a thread tries to request for some memory and cannot be satisfied this time for waiting a GC(the heap memory is almost full), all the threads(based on the number of CPU cores) only share the chance of 2 times to retry(other threads can also be blocked for GC at the same time).

So workers can be easily crashed thrown by a OOM error in small clusters, which influence users' experience about robustness.

I have been told that some users at other companies also have small clusters and met this error, maybe your cluster is big and has many memory space, which is not easy to throw Retried waiting for GCLocker too often error~

@findepi
Copy link
Member

findepi commented Jun 19, 2024

@hackeryang the message above (Retried waiting for GCLocker too often ...), was this running with Java 21 or 22?

@hackeryang
Copy link
Member Author

@hackeryang the message above (Retried waiting for GCLocker too often ...), was this running with Java 21 or 22?

@findepi It was often met in JDK 17 at our small clusters in the past, but according to the jdk issue: https://bugs.openjdk.org/browse/JDK-8192647

It also influence jdk 21 to 23, and still not decided when to fix, we don't have jdk21 environment for now, but i believe that in small clusters the error will still be met

@findepi
Copy link
Member

findepi commented Jun 19, 2024

Might be or might be not. AFAICT, the GCLockerRetryAllocationCount was removed after Java 22 became required for runtime. If you're on an older Trino version and do not use Java 22, you definitely are recommended to keep using the GCLockerRetryAllocationCount flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants