-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase GCLocker retry allocation count to avoid OOM too early #19026
Conversation
1f98668
to
5e8d26b
Compare
@@ -17,3 +17,5 @@ | |||
-XX:+UseAESCTRIntrinsics | |||
# Disable Preventive GC for performance reasons (JDK-8293861) | |||
-XX:-G1UsePreventiveGC | |||
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647) | |||
-XX:GCLockerRetryAllocationCount=32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update product test's jvm.config as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update product test's jvm.config as well
I actually considered whether I should add in testing/trino-product-tests-launcher/src/main/resources/docker/presto-product-tests/conf/environment/multinode-all/jvm.config
as well at the beginning of our contribution~
But I found that the heap space is 2G, which is shown below:
So I thought that most Trino nodes have less CPU cores than the number of memory(mostly 1:4 or 1:6), and this config may be used to a node having 1 CPU core, it may not need to add the parameter and keep the default value 2~
And I also saw this pr, it seems similar to what you need: #15632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not saying GCLockerRetryAllocationCount is needed for product tests
but, product tests jvm.config should be as close to the recommended settings as possible
so it will use lower Xmx by necessity, but GCLockerRetryAllocationCount should be same as recommended (unless there is a reason not to)
Hello @wendigo , could you please help add back the jvm config |
In our experiments it's no longer needed. Is there any particular reason why it should be there by default? |
@wendigo It was because we have some small clusters which have The So workers can be easily crashed thrown by a OOM error in small clusters, which influence users' experience about robustness. I have been told that some users at other companies also have small clusters and met this error, maybe your cluster is big and has many memory space, which is not easy to throw |
@hackeryang the message above ( |
@findepi It was often met in JDK 17 at our small clusters in the past, but according to the jdk issue: https://bugs.openjdk.org/browse/JDK-8192647 It also influence jdk 21 to 23, and still not decided when to fix, we don't have jdk21 environment for now, but i believe that in small clusters the error will still be met |
Might be or might be not. AFAICT, the |
Description
JDK 9 to 21 has a GCLocker bug which causes JVM to throw OOME too early, because the default value of
-XX:GCLockerRetryAllocationCount
is 2, most of Trino clusters in production environments has 8 to 16+ CPU physical cores(which mostly means 32 vcores), so threads may be starved from receiving memory by the GClocker. The relevant issue: https://bugs.openjdk.org/browse/JDK-8192647The principle flow chart of obtaining GCLocker in JVM is as follows:
![img_v2_51b8f20a-d7d3-47c5-8166-744da482895g](https://private-user-images.githubusercontent.com/26461591/267570171-4b8d1bbb-58b1-48e9-ad86-6fdd3cb5d4cb.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0Mjc2MDIsIm5iZiI6MTczOTQyNzMwMiwicGF0aCI6Ii8yNjQ2MTU5MS8yNjc1NzAxNzEtNGI4ZDFiYmItNThiMS00OGU5LWFkODYtNmZkZDNjYjVkNGNiLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA2MTUwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlMGVjM2IyZDJmNjYyOGVhMjEyNDBiOGRmYTJiNWFkNzQxMzFhM2M4M2NlYTkwNDhjMzkyYWU4NzA2YTYwYzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.61eWr2y-tTsiIlSQ_95jMwOkcqaoCiFZ5O3nnj60UFk)
![BbdhuON6yf](https://private-user-images.githubusercontent.com/26461591/267574770-89cd12f5-06b4-4642-a7a5-883ee5565e4b.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0Mjc2MDIsIm5iZiI6MTczOTQyNzMwMiwicGF0aCI6Ii8yNjQ2MTU5MS8yNjc1NzQ3NzAtODljZDEyZjUtMDZiNC00NjQyLWE3YTUtODgzZWU1NTY1ZTRiLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA2MTUwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZmYzBjM2ZjYjE2MmMzMzhmZTk3ZDM4OWMzMjk4ZDFmZjkyMjk1ODAwY2Y4NGM0ZTZjMzg2ZTMzMzE4ZDBiMTAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.oDYYB8fbD_y7mWlFtLgr-awDQkcNCzzscKM3RrzH9kI)
![image](https://private-user-images.githubusercontent.com/26461591/267574048-1313b93c-a831-4adc-884d-a4f7da42bcf8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0Mjc2MDIsIm5iZiI6MTczOTQyNzMwMiwicGF0aCI6Ii8yNjQ2MTU5MS8yNjc1NzQwNDgtMTMxM2I5M2MtYTgzMS00YWRjLTg4NGQtYTRmN2RhNDJiY2Y4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDA2MTUwMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk2YTA0NWIyM2NmOTY4NzViMzBlMTJiYmE2MDU1N2VmNDY5Mjc4OGUzOTU4ZjQ2M2EyOTgyYzQ5ZDM5YTkxYWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.c11rbLhcapRRIDZz6sYdC4Fzl_THLBuQSCS-zyeKr6o)
And this OOM phenomenon was also seen in our production environment:
After investigating the JVM tuning articles shared by some companies, I found that
Tencent
(i.e. the parent company ofLeague of Legends
) also set this parameter to 100:I know our community will migrate to JDK 21 partially for auto vectorization in
Project Hummingbird
, so maybe we will need this parameter in a little long time.Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: