Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageMetricsITCase.verifyImageMetrics failures in main #38919

Closed
zakkak opened this issue Feb 21, 2024 · 15 comments · Fixed by #39026
Closed

ImageMetricsITCase.verifyImageMetrics failures in main #38919

zakkak opened this issue Feb 21, 2024 · 15 comments · Fixed by #39026

Comments

@zakkak
Copy link
Contributor

zakkak commented Feb 21, 2024

Describe the bug

jpa-postgressql-withxml fails in ImageMetricsITCase.verifyImageMetrics test

See https://github.com/quarkusio/quarkus/actions/runs/7985178432/job/21803563937#step:16:776

The regression was introduced in #38820

Note: As the changes are in vertx-http this probably affects other tests as well, but the impact doesn't happen to be big enough to trigger a test failure for them.

Expected behavior

The test should pass.

Actual behavior

The test fails with:

org.opentest4j.AssertionFailedError: Expected analysis_results.fields.reflection to be within range [163 +- 3%] but was 168 ==> expected: <true> but was: <false>

How to Reproduce?

./mvnw -Dnative -pl integration-tests/jpa-postgresql-withxml -Dnative.surefire.skip -Dformat.skip -Dno-descriptor-tests clean verify -Dtest-containers -Dstart-containers

Output of uname -a or ver

No response

Output of java -version

No response

Mandrel or GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

After #38820 I see the following differences:

diff --git a/old.json b/new.json
index c60df879106..b1c92fa16f2 100644
--- a/old.json
+++ b/new.json
@@ -2,30 +2,30 @@
   "resource_usage": {
     "memory": {
       "system_total": 66506358784,
-      "peak_rss_bytes": 6114426880
+      "peak_rss_bytes": 8380002304
     },
     "garbage_collection": {
-      "count": 119,
-      "total_secs": 3.413,
-      "max_heap": 8176271360
+      "count": 66,
+      "total_secs": 2.625,
+      "max_heap": 28445245440
     },
     "cpu": {
-      "load": 12.15073613200237,
+      "load": 12.75164113785558,
       "parallelism": 32,
       "total_cores": 32
     },
-    "total_secs": 43.448367164000004
+    "total_secs": 43.512096230000004
   },
   "image_details": {
     "code_area": {
-      "bytes": 44577920,
-      "compilation_units": 66680
+      "bytes": 44601456,
+      "compilation_units": 66734
     },
-    "total_bytes": 93915968,
+    "total_bytes": 93952832,
     "image_heap": {
-      "bytes": 48922624,
+      "bytes": 48934912,
       "objects": {
-        "count": 485976
+        "count": 486273
       },
       "resources": {
         "bytes": 1616088,
@@ -47,29 +47,29 @@
   },
   "analysis_results": {
     "types": {
-      "total": 23615,
+      "total": 23629,
       "reflection": 6697,
       "jni": 61,
-      "reachable": 20816
+      "reachable": 20830
     },
     "methods": {
       "foreign_downcalls": -1,
-      "total": 188214,
-      "reflection": 4899,
+      "total": 188269,
+      "reflection": 4900,
       "jni": 55,
-      "reachable": 102637
+      "reachable": 102708
     },
     "classes": {
-      "total": 23615,
+      "total": 23629,
       "reflection": 6697,
       "jni": 61,
-      "reachable": 20816
+      "reachable": 20830
     },
     "fields": {
-      "total": 49625,
-      "reflection": 166,
+      "total": 49643,
+      "reflection": 168,
       "jni": 61,
-      "reachable": 30008
+      "reachable": 30032
     }
   }
 }

@cescoffier do the above diffs looks reasonable according to your changes?

Copy link

quarkus-bot bot commented Feb 21, 2024

/cc @Karm (mandrel), @ebullient (metrics), @galderz (mandrel), @jmartisk (metrics)

@gsmet
Copy link
Member

gsmet commented Feb 22, 2024

This doesn't look pretty:

-      "peak_rss_bytes": 6114426880
+      "peak_rss_bytes": 8380002304

but are we sure they are solely due to Clément's PR?

@zakkak
Copy link
Contributor Author

zakkak commented Feb 22, 2024

but are we sure they are solely due to Clément's PR?

That was the only difference between the two runs (I tested with 31fcfc1 and fd46a61), so I would say yes. Note however that the peak RSS is going up and down between different builds and it's not related to the PR. I am running on a machine with 62GB of RAM with no constraints on the max heap size, so the GC is free to use as much RAM as it wants according to its heuristics which are not deterministic.

For instance running with 31fcfc1 3 times I get:

  1. 6837665792
  2. 7617609728
  3. 12387741696

(Notice that the last run has almost double the peak RSS of the first run without any chances in the code.)

@gsmet
Copy link
Member

gsmet commented Feb 26, 2024

@zakkak could we either raise the threshold or disable the test for now? It's very annoying to have all our builds failing because of this.

@zakkak
Copy link
Contributor Author

zakkak commented Feb 26, 2024

I guess the less evil would be to adjust the thresholds and keep this issue open until we verify the new thresholds are OK. This way we won't risk missing another increase due to other changes.

Note that the best would be to not have merged #38820 in the first place without handling this since the test was already failing in the PR.

@cescoffier
Copy link
Member

The PR in question does not seem related. I will have another look, but basically it is just tests.

My guess is that the true cause is the hot reload which must keep a few objects in memory, including the content of the certificates. Unfortunately, there is no solution around that.

That being said, this only happens if the certificate hot reload is enabled, which is not the case in this test. But there is a few more static methods and fields that could explain it (even if they should be very small when hot reload is not enabled).

@cescoffier
Copy link
Member

cescoffier commented Feb 27, 2024

Ok, to it's a build time RSS increase - not runtime. It makes more sense.

How can I produce the reports from above?
The PR introduces an early usage of the Mutiny API and Vert.x scheduled tasks (which then uses Netty underneath). That may explain it.

@cescoffier
Copy link
Member

I'm confused - but I may not understand what that test is actually doing.

I get:
ImageMetricsITCase.verifyImageMetrics:15 Expected analysis_results.types.jni to be within range [63 +- 1%] but was 61 ==> expected: but was:

Seems related to JNI? It seems I'm under the previous value (isn't that a good thing?)

@cescoffier
Copy link
Member

So.... with a more recent version of Graalvm, I got something more expected.

@cescoffier
Copy link
Member

So, I'm unable to reproduce it locally with:

native-image 21 2023-09-19
GraalVM Runtime Environment GraalVM CE 21+35.1 (build 21+35-jvmci-23.1-b15)
Substrate VM GraalVM CE 21+35.1 (build 21+35, serial gc)

I can try to remove the usage of Mutiny, but it's somewhat weird, as I would need to use CS/CF, which are WAY more error-prone.

@zakkak
Copy link
Contributor Author

zakkak commented Feb 27, 2024

How can I produce the reports from above?

They are generated when building the native image in a file called quarkus-integration-test-jpa-postgresql-withxml-999-SNAPSHOT-runner-build-output-stats.json

I used:

./mvnw -Dnative -pl integration-tests/jpa-postgresql-withxml clean package -Dnative.surefire.skip -Dquarkus.native.container-build=true -Dquarkus.native.container-runtime=podman

So.... with a more recent version of Graalvm, I got something more expected.

Please use the jdk-21 mandrel builder image as the thresholds are version specific.

@cescoffier
Copy link
Member

Ok, I know get the same exception as you do. So, it's not a problem with GraalVM CE, but only mandrel (or the in container build using Mandrel). That will not be easy to understand.

The removal of Mutiny does not help. So, it's something else.

@cescoffier
Copy link
Member

@zakkak Ok, got the test to pass:

I removed: Mutiny usage (sad), a record (useless anyway), and java stream usage.

@cescoffier
Copy link
Member

@zakkak See #39026

@zakkak
Copy link
Contributor Author

zakkak commented Feb 27, 2024

So, it's not a problem with GraalVM CE, but only mandrel (or the in container build using Mandrel). That will not be easy to understand.

That's because Mandrel uses a different JDK as a base (OpenJDK vs LabsJDK) so there are some more differences in the json, the thresholds are meant to test the Quarkus' default builder image (i.e. Mandrel at the moment) however.

@zakkak See #39026

@quarkus-bot quarkus-bot bot added this to the 3.9 - main milestone Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants