Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 ubi-quarkus-graalvmce-builder-image is very slow #260

Closed
ryanemerson opened this issue Dec 13, 2023 · 20 comments
Closed

arm64 ubi-quarkus-graalvmce-builder-image is very slow #260

ryanemerson opened this issue Dec 13, 2023 · 20 comments

Comments

@ryanemerson
Copy link

The Infinispan project previously used quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 as a builder image to create various native components for both arm64 and amd64 architectures. The total time taken for all of our images was ~ 30 mins.

In order to use the latest GraalVM JDK 21 distribution, I have updated the builder image to be based upon quay.io/quarkus/ubi-quarkus-graalvmce-builder-image:jdk-21. However this has dramatically slowed down our image build time, with all our images now taking ~ 4 hours. Upon further investigation, it seems this is specifically caused by the arm64 builder image, as only building for amd64 results in the build time coming back down to ~ 16 mins.

Has anything changed between ubi-quarkus-native-image and ubi-quarkus-graalvmce-builder-image that could explain this increased build time, or is the culprit more likely to be GraalVM itself?

@zakkak
Copy link
Collaborator

zakkak commented Dec 13, 2023

@ryanemerson is this happening with quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21 as well? Also what Quarkus version are you using?

cc @Karm

@ryanemerson
Copy link
Author

We're using Quarkus 3.6.0.

I tried the mandrel based images and the build time is now down to ~ 2.5 hours, so there's some improvement.

@zakkak
Copy link
Collaborator

zakkak commented Jan 18, 2024

Hi @ryanemerson could you please also provide a few more info on your setup?

E.g. what are the hardware specs of the amd64 machine you are using for the builds, and what are those of the arm64 machine you are comparing with?

Upon further investigation, it seems this is specifically caused by the arm64 builder image, as only building for amd64 results in the build time coming back down to ~ 16 mins.

Did you also try building only for arm64? What are the results?

@ryanemerson
Copy link
Author

Hardware is a m4.xlarge AWS instance with 4 vCPU 2.4 GHz Intel Xeon E5-2676 v3** Processor. We're using qemu emulation so that we can build the image with docker buildx.

Exact workflow:

docker run --rm --privileged quay.io/infinispan-test/binfmt:qemu-v8.0.4-33  --install arm64
docker buildx create --name multiarch --use
docker buildx build --platform linux/arm64 -t ${imageFQN} target-${name}/image

Did you also try building only for arm64? What are the results?

Yes. An arm64 only build takes 1h 46min with quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21 whereas our old build with quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 takes 10min 6s.

@zakkak
Copy link
Collaborator

zakkak commented Jan 22, 2024

Thank you for the extra information @ryanemerson. We will try to replicate and investigate the issue.

@zakkak
Copy link
Collaborator

zakkak commented Jan 31, 2024

As a first step, I tried reproducing the issue on a local AMD64 machine (using podman run --platform linux/arm64) and I get the following results:

Mandrel Version Total Build time on QEMU arm64 Total Build Time natively on amd64
22.3.5.0 (Java 17) 6m 44s 26.2s
23.0.3.0 (Java 17) 8m 56s 21.5s
23.1.2.0 (Java 21) 8m 9s 20.1s

Some interesting differences between the arm64 runs:

  1. GCs have gone up by 10x, however the total time spent on GC (10-15s) doesn't correspond to the total increase (1.5-2m). The number of GCs has gone up in amd64 as well (but by 2-4x).
  2. Memory usage during build time (RSS) has dropped significantly (from 6.35G to 3G, which could explain the increase in the number of GCs (we observe the same on amd64)
  3. The phase with the largest increase in execution time is "compiling methods" (from 90s to 140-200s) which seems to cover the biggest part of the increase.

I will investigate further...

@zakkak
Copy link
Collaborator

zakkak commented Feb 14, 2024

I will investigate further...

After some more experimentation it looks like the slow down is related to the initial heap size. Setting Xms to 7g (i.e., passing -Dquarkus.native.additional-build-args=-J-Xms7g) I was able to get 23.1 to build in similar time with 22.3 (6m 15s).

The issue seems related to oracle/graal#6432

@ryanemerson could you please give this a try, while I try to better understand why that's happening? Please use as the initial heap size something a bit higher than the Peak RSS you get when building with Mandrel 22.3.

@ryanemerson
Copy link
Author

Thanks for looking into this.

The most recent builder images have significantly reduced the latencies we were experiencing when I first created this issue, however the total build time is still almost double what we experienced with quay.io/quarkus/ubi-quarkus-native-image:22.3-java17.

Adding -Dquarkus.native.additional-build-args=-J-Xms7g only reduced the build time slightly, 49min 43s vs 52min 46s.

@zakkak
Copy link
Collaborator

zakkak commented Feb 14, 2024

Adding -Dquarkus.native.additional-build-args=-J-Xms7g only reduced the build time slightly, 49min 43s vs 52min 46s.

What is the Peak RSS reported when building with 22.3-java17 without using this option?

while I try to better understand why that's happening?

So the actual issue is that oracle/graal#6432 is setting GCTime to 9 instead of the default 99 and that seems to lead the JVM to perform more heap size adaptations which in turn seems to lead to the slowdown you are observing.

@ryanemerson may I ask you to give this a go with -Dquarkus.native.additional-build-args=J-XX:GCTimeRatio=99 as well and report back? Since the differences I observe are only ~1m on my machine I would really like to know if this resolves the issue in your case which seems to have a much bigger impact.

cc @fniephaus

@ryanemerson
Copy link
Author

What is the Peak RSS reported when building with 22.3-java17 without using this option?

Peak RSS: 7.21GB

@ryanemerson may I ask you to give this a go with -Dquarkus.native.additional-build-args=J-XX:GCTimeRatio=99 as well and report back?

Sure np.

@fniephaus
Copy link

So the actual issue is that oracle/graal#6432 is setting GCTime to 9 instead of the default 99 and that seems to lead the JVM to perform more heap size adaptations which in turn seems to lead to the slowdown you are observing.

GCTime=99 is mostly for latency and leads to the build process quickly using as much memory as it is allowed to use (bigger peak RSS). With GCTime=9, we tweak the GC more towards throughput, allowing it to spend more time cleaning up while not allocating more memory (lower peak RSS).

I haven't seen any actual build output in this issue, but "very slow" sounds to me like the app simply requires more memory to be built with GraalVM. How much memory/CPU is the build process allowed to use? A build time of ~50min could mean that 7GB is simply not enough.

@ryanemerson
Copy link
Author

ryanemerson commented Feb 15, 2024

Setting GCTime=99 made no noticeable difference to build time.

I haven't seen any actual build output in this issue, but "very slow" sounds to me like the app simply requires more memory to be built with GraalVM. How much memory/CPU is the build process allowed to use? A build time of ~50min could mean that 7GB is simply not enough.

We have Xmx set to 8g before and after the increased latency was observed between the two different builder images. I increased this to 16g and the build time remains the same.

@zakkak
Copy link
Collaborator

zakkak commented Feb 15, 2024

@ryanemerson can you share the output of native-image when building with 23.1 (jdk-21 tag) and when building with 22.3 (22.3-java17 tag)?

And just to make sure are you using Quarkus 3.6.0 in both cases? If not you might be hitting quarkusio/quarkus#38683 (although if that was the case it shouldn't show up only on aarch64)

A reproducer could also be handy if you can share.

@fniephaus
Copy link

I increased this to 16g and the build time remains the same.

Are you sure that you also bumped the Xmx value? If you did, it seems memory is not the bottleneck, maybe it's CPU. You could try by increasing the number of cores available in your container.

@ryanemerson
Copy link
Author

I've created a standalone reproducer to simplify things: https://github.com/ryanemerson/quarkus-arm64-slow-reproducer

Here's the output for building with the two different builder images, on the same machine using the same args with Quarkus 3.7.3:

You can see that the total build time for 22.3-java17 was only 04:49 min, whereas jdk-21 was 36:36 min.

@zakkak
Copy link
Collaborator

zakkak commented Feb 16, 2024

Thanks for the reproducer and the extra info @ryanemerson

At first sight it still looks like a GC-related issue to me:

  1. The build reports similar sizes for the generated binary files and similar numbers of reachable code and data, so this doesn't look like the jdk-21 image is trying to do more work.
  2. The jdk-21 image reports that it's allowed to use up to 23.5GB of memory, yet the peak RSS is only 1.88GB in contrast to 5.93GB when using 22.3-java17.
  3. It's clear that the number of GCs is significantly higher and despite the output saying that only 110 seconds out of the total 36 minutes were spent in GC, I am afraid that's not true (based on my earlier experiments)

I will try the reproducer and have another look next week.

@zakkak
Copy link
Collaborator

zakkak commented Feb 22, 2024

@ryanemerson thanks again for the reproducer and output results. I was finally able to see what's wrong.

After a closer inspection of the logs I noticed that the build is actually running on x86_64 instead of aarch64 when using 22.3-java17.

This led me to have a second look at your Dockerfiles and the images they use.
Dockerfile.22.3-java17 uses quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 which is not a multi-arch image (and is also no longer supported). Multi-arch images were introduced in #200 and the image naming changed. As a result if you want to use 22.3-java17 you should use one of the following images:

At this point it might be worth mentioning that the latest builder image for Java 17 is tagged with jdk-17 so please use that instead of 22.3-java17 unless you have a good reason not to.

Applying the following patch to the reproducer I am getting more consistent results (the jdk-17 is still slower by ~1m due to the reason explained in #260 (comment)). The bad news is that the correct results are the slow ones (and emulation should be the one to blame here).

diff --git a/src/main/docker/Dockerfile.22.3-java17 b/src/main/docker/Dockerfile.22.3-java17
index d108905..b6c718f 100644
--- a/src/main/docker/Dockerfile.22.3-java17
+++ b/src/main/docker/Dockerfile.22.3-java17
@@ -1,4 +1,4 @@
-FROM quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 as build
+FROM quay.io/quarkus/ubi-quarkus-mandrel-builder-image:22.3-java17 as build
 COPY --chown=quarkus:quarkus mvnw /code/mvnw
 COPY --chown=quarkus:quarkus .mvn /code/.mvn
 COPY --chown=quarkus:quarkus pom.xml /code/
diff --git a/src/main/docker/Dockerfile.jdk-21 b/src/main/docker/Dockerfile.jdk-21
index b5ab2a8..b7c63a9 100644
--- a/src/main/docker/Dockerfile.jdk-21
+++ b/src/main/docker/Dockerfile.jdk-21
@@ -1,4 +1,4 @@
-FROM quay.io/quarkus/ubi-quarkus-graalvmce-builder-image:jdk-21 as build
+FROM quay.io/quarkus/ubi-quarkus-mandrel-builder-image:jdk-21 as build
 COPY --chown=quarkus:quarkus mvnw /code/mvnw
 COPY --chown=quarkus:quarkus .mvn /code/.mvn
 COPY --chown=quarkus:quarkus pom.xml /code/

As a follow up question, I am curious whether you actually test the images you build with 22.3-java17 on aarch64 architectures? If you do how come the tests don't fail? Are they silently running on x86 as well?

I am closing this issue as it's actually not an issue with the images themselves. For the record I am adding the build outputs I get from the correct images below:

22.3-java17

#15 162.1 ========================================================================================================================
#15 162.1 GraalVM Native Image: Generating 'native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner' (executable)...
#15 162.1 ========================================================================================================================
#15 183.6 [1/7] Initializing...                                                                                   (32.4s @ 0.23GB)
#15 183.6  Version info: 'GraalVM 22.3.5.0-Final Java 17 Mandrel Distribution'
#15 183.6  Java version info: '17.0.10+7'
#15 183.6  C compiler: gcc (redhat, aarch64, 8.5.0)
#15 183.6  Garbage collector: Serial GC
#15 183.6  2 user-specific feature(s)
#15 183.6  - io.quarkus.runner.Feature: Auto-generated class by Quarkus from the existing extensions
#15 183.6  - io.quarkus.runtime.graal.DisableLoggingFeature: Disables INFO logging during the analysis phase
#15 409.9 [2/7] Performing analysis...  [************]                                                           (224.8s @ 1.07GB)
#15 409.9   10,736 (88.70%) of 12,104 classes reachable
#15 409.9   15,484 (58.14%) of 26,632 fields reachable
#15 410.0   53,185 (55.72%) of 95,456 methods reachable
#15 410.0      571 classes,   109 fields, and 3,015 methods registered for reflection
#15 410.0       63 classes,    68 fields, and    55 methods registered for JNI access
#15 410.0        4 native libraries: dl, pthread, rt, z
#15 422.2 [3/7] Building universe...                                                                              (12.2s @ 0.93GB)
#15 431.6 [4/7] Parsing methods...      [***]                                                                      (9.2s @ 1.44GB)
#15 436.9 [5/7] Inlining methods...     [***]                                                                      (4.5s @ 3.02GB)
#15 587.3 [6/7] Compiling methods...    [************]                                                           (150.3s @ 4.14GB)
#15 614.7 [7/7] Creating image...                                                                                 (16.0s @ 1.60GB)
#15 614.7   20.54MB (47.53%) for code area:    34,309 compilation units
#15 614.7   22.29MB (51.56%) for image heap:  280,629 objects and 8 resources
#15 614.7  400.01KB ( 0.90%) for other data
#15 614.7   43.22MB in total
#15 614.7 ------------------------------------------------------------------------------------------------------------------------
#15 614.7 Top 10 packages in code area:                               Top 10 object types in image heap:
#15 614.8    1.65MB sun.security.ssl                                     4.47MB byte[] for code metadata
#15 614.8  990.43KB java.util                                            2.93MB byte[] for general heap data
#15 614.8  727.77KB java.lang.invoke                                     2.56MB java.lang.String
#15 614.8  716.03KB com.sun.crypto.provider                              2.53MB java.lang.Class
#15 614.8  530.26KB io.netty.buffer                                      2.02MB byte[] for java.lang.String
#15 614.8  508.54KB java.lang                                          922.63KB com.oracle.svm.core.hub.DynamicHubCompanion
#15 614.8  450.63KB sun.security.x509                                  660.56KB java.util.HashMap$Node
#15 614.8  400.80KB java.io                                            597.69KB byte[] for reflection metadata
#15 614.8  396.05KB io.quarkus.runtime.generated                       490.62KB java.lang.String[]
#15 614.8  368.73KB java.util.concurrent                               391.26KB byte[] for embedded resources
#15 614.8   13.73MB for 398 more packages                                4.48MB for 2583 more object types
#15 614.8 ------------------------------------------------------------------------------------------------------------------------
#15 614.8                         5.2s (1.1% of total time) in 44 GCs | Peak RSS: 6.60GB | CPU load: 5.20
#15 614.8 ------------------------------------------------------------------------------------------------------------------------
#15 614.8 Produced artifacts:
#15 614.9  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner (executable)
#15 614.9  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner-build-output-stats.json (json)
#15 614.9  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner-timing-stats.json (raw)
#15 614.9  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner.build_artifacts.txt (txt)
#15 614.9 ========================================================================================================================
#15 614.9 Finished generating 'native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner' in 7m 43s.

jdk-21

#15 158.2 ========================================================================================================================
#15 158.2 GraalVM Native Image: Generating 'native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner' (executable)...
#15 158.2 ========================================================================================================================
#15 158.2 For detailed information and explanations on the build output, visit:
#15 158.2 https://github.com/oracle/graal/blob/master/docs/reference-manual/native-image/BuildOutput.md
#15 158.2 ------------------------------------------------------------------------------------------------------------------------
#15 179.7 [1/8] Initializing...                                                                                   (29.3s @ 0.12GB)
#15 179.7  Java version: 21.0.2+13-LTS, vendor version: Mandrel-23.1.2.0-Final
#15 179.7  Graal compiler: optimization level: 2, target machine: armv8-a
#15 179.7  C compiler: gcc (redhat, aarch64, 8.5.0)
#15 179.7  Garbage collector: Serial GC (max heap size: 80% of RAM)
#15 179.7  3 user-specific feature(s):
#15 179.7  - com.oracle.svm.thirdparty.gson.GsonFeature
#15 179.7  - io.quarkus.runner.Feature: Auto-generated class by Quarkus from the existing extensions
#15 179.7  - io.quarkus.runtime.graal.DisableLoggingFeature: Disables INFO logging during the analysis phase
#15 179.8 ------------------------------------------------------------------------------------------------------------------------
#15 179.8  3 experimental option(s) unlocked:
#15 179.8  - '-H:+AllowFoldMethods' (origin(s): command line)
#15 179.8  - '-H:BuildOutputJSONFile' (origin(s): command line)
#15 179.8  - '-H:-UseServiceLoaderFeature' (origin(s): command line)
#15 179.8 ------------------------------------------------------------------------------------------------------------------------
#15 179.8 Build resources:
#15 179.8  - 23.14GB of memory (75.6% of 30.60GB system memory, determined at start)
#15 179.8  - 32 thread(s) (100.0% of 32 available processor(s), determined at start)
#15 445.7 [2/8] Performing analysis...  [******]                                                                 (264.7s @ 0.79GB)
#15 445.8    10,792 reachable types   (86.0% of   12,545 total)
#15 445.8    15,433 reachable fields  (57.2% of   26,979 total)
#15 445.9    54,505 reachable methods (56.0% of   97,287 total)
#15 445.9     3,472 types,   107 fields, and 3,164 methods registered for reflection
#15 445.9        61 types,    61 fields, and    55 methods registered for JNI access
#15 445.9         4 native libraries: dl, pthread, rt, z
#15 463.5 [3/8] Building universe...                                                                              (17.6s @ 0.96GB)
#15 474.9 [4/8] Parsing methods...      [***]                                                                     (11.3s @ 0.57GB)
#15 484.0 [5/8] Inlining methods...     [***]                                                                      (8.3s @ 0.64GB)
#15 653.0 [6/8] Compiling methods...    [*************]                                                          (168.9s @ 0.88GB)
#15 667.3 [7/8] Layouting methods...    [****]                                                                    (14.2s @ 0.95GB)
#15 684.3 [8/8] Creating image...       [****]                                                                    (16.7s @ 1.10GB)
#15 684.3   20.59MB (44.91%) for code area:    34,685 compilation units
#15 684.3   24.87MB (54.23%) for image heap:  290,435 objects and 55 resources
#15 684.3  401.52kB ( 0.86%) for other data
#15 684.3   45.85MB in total
#15 684.3 ------------------------------------------------------------------------------------------------------------------------
#15 684.3 Top 10 origins of code area:                                Top 10 object types in image heap:
#15 684.3   11.62MB java.base                                            6.87MB byte[] for code metadata
#15 684.3    1.31MB svm.jar (Native Image)                               3.79MB byte[] for java.lang.String
#15 684.3  875.07kB modified-io.vertx.vertx-core-4.5.3.jar               2.64MB java.lang.Class
#15 684.3  840.23kB n.0.0-SNAPSHOT-runner.jar                            2.64MB java.lang.String
#15 684.3  536.34kB io.netty.netty-buffer-4.1.106.Final.jar              1.01MB byte[] for general heap data
#15 684.3  474.51kB io.netty.netty-common-4.1.106.Final.jar            927.44kB com.oracle.svm.core.hub.DynamicHubCompanion
#15 684.3  401.97kB io.netty.netty-codec-http-4.1.106.Final.jar        623.77kB byte[] for reflection metadata
#15 684.3  364.87kB io.netty.netty-transport-4.1.106.Final.jar         518.61kB java.lang.String[]
#15 684.3  363.66kB io.netty.netty-codec-http2-4.1.106.Final.jar       454.59kB java.util.HashMap$Node
#15 684.3  337.29kB io.smallrye.config.smallrye-config-core-3.5.4.jar  396.86kB byte[] for embedded resources
#15 684.3    3.36MB for 72 more packages                                 5.07MB for 2708 more object types
#15 684.3 ------------------------------------------------------------------------------------------------------------------------
#15 684.3 Recommendations:
#15 684.3  HEAP: Set max heap for improved and more predictable memory usage.
#15 684.3  CPU:  Enable more CPU features with '-march=native' for improved performance.
#15 684.4 ------------------------------------------------------------------------------------------------------------------------
#15 684.4                        26.3s (4.9% of total time) in 651 GCs | Peak RSS: 2.38GB | CPU load: 4.82
#15 684.4 ------------------------------------------------------------------------------------------------------------------------
#15 684.4 Produced artifacts:
#15 684.4  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner (executable)
#15 684.4  /code/target/native-arm-slow-reproducer-1.0.0-SNAPSHOT-native-image-source-jar/native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner-build-output-stats.json (build_info)
#15 684.4 ========================================================================================================================
#15 684.4 Finished generating 'native-arm-slow-reproducer-1.0.0-SNAPSHOT-runner' in 8m 54s.

@ryanemerson
Copy link
Author

After a closer inspection of the logs I noticed that the build is actually running on x86_64 instead of aarch64 when using 22.3-java17.

This led me to have a second look at your Dockerfiles and the images they use. Dockerfile.22.3-java17 uses quay.io/quarkus/ubi-quarkus-native-image:22.3-java17 which is not a multi-arch image (and is also no longer supported). Multi-arch images were introduced in #200 and the image naming changed

Well I feel dumb 😅 We don't have any automated testing for our arm images, they're provided on a best effort basis for community user, which is why this wasn't detected. It seems nobody is actually using these images.

Thanks for looking into this @zakkak, much appreciated.

@fniephaus
Copy link

Is there still an issue on the Native Image side? #260 (comment) is somewhat expected and #260 (comment) shows the result: while the build takes ~1min longer on JDK 17, it only needs 2.38GB as opposed to 6.60GB of memory, even on a machine with 75.6% of 30.60GB of memory available.

@zakkak
Copy link
Collaborator

zakkak commented Feb 26, 2024

Is there still an issue on the Native Image side?

I think not.

#260 (comment) is somewhat expected and #260 (comment) shows the result: while the build takes ~1min longer on JDK 17, it only needs 2.38GB as opposed to 6.60GB of memory, even on a machine with 75.6% of 30.60GB of memory available.

True, but whether that's good or bad really depends on the use case. I have opened quarkusio/quarkus#38968 to give some options to Quarkus users, perhaps it would make sense to implement something similar directly on GraalVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants