Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpAdvancedIT fails with Keyclaok container insufficient memory error in Podman Jenkins jobs #2106

Closed
jcarranzan opened this issue Oct 21, 2024 · 16 comments
Labels
priority/urgent Issue that needs to be resolved by next RHBQ release.

Comments

@jcarranzan
Copy link
Contributor

It's just failing in Podman Jenkins jobs and HttpAdvancedIT, but we use the same Keycloak service container for other tests in http-module.

Some points about my previous investigation:

@michalvavrik
Copy link
Member

  • IIUC Keycloak consumes all available memory it needs, therefore we need to check that Set memory limit for Keycloak container quarkus-test-framework#1351 has effect and it reduces available memory inside container. Let's try to set it to 1 MiB and expect failure. Let's try to set it to raise default value in failing tests. Trouble with settings Java heap limit is that metaspace could still raise and that is why I think it is preferable to limit total container memory
  • we need to test how much memory is available on the executor in total before this test is started, so that we know whether there is not something leaking before this test starts

@jcarranzan
Copy link
Contributor Author

Summary of Investigations:

  1. Attempt with 1 MiB Memory Limit:
    @KeycloakContainer( command = { "start-dev", "--import-realm", "--hostname-strict=false" }, memoryLimitMiB = 1 )
    Outcome:
    The Keycloak container failed to start because the minimum required memory is 6 MiB.
    I then increased the memory limit to 10 MiB:
    @KeycloakContainer( command = { "start-dev", "--import-realm", "--hostname-strict=false" }, memoryLimitMiB = 10 )
    but the container crashed with the following error:
2024-10-21 12:54:04,523 ERROR [tc.qua.io/.0] (main) Could not start container: java.lang.IllegalStateException: Wait strategy failed. Container crashed with out-of-memory (OOMKilled)
2024-10-21 12:54:04,527 ERROR [tc.qua.io/.0] (main) Log output from the failed container:
/opt/keycloak/bin/kc.sh: line 169:    84 Killed                  'java' -Dkc.config.build-and-exit=true -Djava.util.concurrent.ForkJoinPool.common.threadFactory=io.quarkus.bootstrap.forkjoin.QuarkusForkJoinWorkerThreadFactory '-XX:MetaspaceSize=96M' '-XX:MaxMetaspaceSize=256m' '-Dfile.encoding=UTF-8' '-Dsun.stdout.encoding=UTF-8' '-Dsun.err.encoding=UTF-8' '-Dstdout.encoding=UTF-8' '-Dstderr.encoding=UTF-8' '-XX:+ExitOnOutOfMemoryError' '-Djava.security.egd=file:/dev/urandom' '-XX:+UseG1GC' '-XX:FlightRecorderOptions=stackdepth=512' '-XX:MaxRAMPercentage=70' '-XX:MinRAMPercentage=70' '-XX:InitialRAMPercentage=50' '--add-opens=java.base/java.util=ALL-UNNAMED' '--add-opens=java.base/java.util.concurrent=ALL-UNNAMED' '--add-opens=java.base/java.security=ALL-UNNAMED' '-Duser.language=en' '-Duser.country=US' -Dkc.home.dir='/opt/keycloak/bin/..' -Djboss.server.config.dir='/opt/keycloak/bin/../conf' -Djava.util.logging.manager=org.jboss.logmanager.LogManager -Dpicocli.disable.closures=true -Dquarkus-log-max-startup-records=10000 -cp '/opt/keycloak/bin/../lib/quarkus-run.jar' io.quarkus.bootstrap.runner.QuarkusEntryPoint --profile=dev start-dev '--import-realm' '--hostname-strict=false'

  1. Attempt with Increased Memory Limit (2000 MiB):
    And I also modified the test to print the memory logs before container is starting

@KeycloakContainer(
    command = { "start-dev", "--import-realm", "--hostname-strict=false" },
    memoryLimitMiB = 2000
)

Executed in our Podman-based Jenkins job.
Log Jenkins job output:

13:16:02 [INFO] Running io.quarkus.ts.http.advanced.HttpAdvancedIT
13:16:02 === JVM Memory Info ===
13:16:02 Free Memory: 453 MB
13:16:02 Total Memory: 1788 MB
13:16:02 Max Memory: 1924 MB
13:16:02 === System Memory Info ===
13:16:02 Total Physical Memory: 7695 MB
13:16:02 Free Physical Memory: 1263 MB
13:16:02 === Disk Space Info ===
13:16:02 Total Disk Space: 40847 MB
13:16:02 Free Disk Space: 32480 MB
13:16:12 11:16:12,469 Container quay.io/keycloak/keycloak:25.0 is starting: 410b311947561a85de756b388a52cf266b40f549eb1976b4155f44f034d4a015
13:21:12 11:21:12,785 Could not start container: java.lang.IllegalStateException: Wait strategy failed. Container exited with code 1
13:21:12 	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:533)
13:21:12 	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:354)
3:21:12 11:21:12,813 Log output from the failed container:
13:21:12 OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006af400000, 4034920448, 0) failed; error='Not enough space' (errno=12)
13:21:12 #
13:21:12 # There is insufficient memory for the Java Runtime Environment to continue.
13:21:12 # Native memory allocation (mmap) failed to map 4034920448 bytes. Error detail: committing reserved memory.
13:21:12 # An error report file with more information is saved as:
13:21:12 # /tmp/hs_err_pid84.log


@michalvavrik michalvavrik added the priority/urgent Issue that needs to be resolved by next RHBQ release. label Oct 21, 2024
@rsvoboda
Copy link
Member

What about " memoryLimitMiB = 768?"

What === JVM Memory Info === is printed out when .withProperty("JAVA_OPTS", "-Xms512m -Xmx1g") approach is used ?

@jcarranzan
Copy link
Contributor Author

  • Just with @KeycloakContainer(command = { "start-dev", "--import-realm", "--hostname-strict=false" }, memoryLimitMiB = 768)

Output in Jenkins:

09:36:46 [INFO] Running io.quarkus.ts.http.advanced.HttpAdvancedIT
09:36:47 === JVM Memory Info ===
**09:36:47 Free Memory: 433 MB**
09:36:47 Total Memory: 1537 MB
09:36:47 Max Memory: 1924 MB
09:36:47 === System Memory Info ===
09:36:47 Total Physical Memory: 7695 MB
09:36:47 Free Physical Memory: 1254 MB
09:36:47 === Disk Space Info ===
09:36:47 Total Disk Space: 40847 MB
09:36:47 Free Disk Space: 32480 MB
  • With just withProperty("JAVA_OPTS", "-Xms512m -Xmx1g") approach:

Output in Jenkins:

0:01:47 === JVM Memory Info ===
**10:01:47 Free Memory: 553 MB**
10:01:47 Total Memory: 1664 MB
10:01:47 Max Memory: 1924 MB
10:01:47 === System Memory Info ===
10:01:47 Total Physical Memory: 7695 MB
10:01:47 Free Physical Memory: 1402 MB
10:01:47 === Disk Space Info ===
10:01:47 Total Disk Space: 40847 MB
10:01:47 Free Disk Space: 32480 MB

and test passed OK

@rsvoboda
Copy link
Member

and test passed OK

only the last one or both cases?

@jcarranzan
Copy link
Contributor Author

and test passed OK

only the last one or both cases?

Only the last one, with withProperty("JAVA_OPTS", "-Xms512m -Xmx1g")

@rsvoboda
Copy link
Member

Both are using similar memory, so what's the difference?

@rsvoboda
Copy link
Member

@mabartos do you have any idea why the withProperty("JAVA_OPTS", "-Xms512m -Xmx1g") approach would work better than your quarkus-qe/quarkus-test-framework#1351 change?

@michalvavrik
Copy link
Member

That's all interesting, but as far as actual failure goes, this sentence from the issue description is what matters most:

The issue occurs before tests start. The failure is nor related to the HttpAdvancedIT tests. When Keycloak container tries to allocate memory ,ther's already insufficient memory due to a resource leak.

Here is what you need to do: implement org.junit.platform.launcher.TestExecutionListener (here is example how I did it in TFW ), and print out memory information in executionFinished, executionStarted, testPlanExecutionFinished, testPlanExecutionStarted. Ideally save it so that you can scroll through results and identify exact place where memory jumped or wasn't cleaned.

@mabartos
Copy link

There are two things:

  1. What other services/processes are started when these tests are executed? Seems odd to me that for the test execution, there's available only 1250MB out of 8GB of the system memory. Is the container properly destroyed after the test class execution and does it release the memory? Does the issue occur even when only the HttpAdvancedIT test is executed in Jenkins? Or is it reproducible only when all these tests are executed and possibly some resources are held over the tests execution?
  2. It seems the container limit is not set, or it is not handled well in some cases from the testcontainers side. The maximum heap size for Keycloak in a container should be 70% of the whole available container memory. Specifically, when the limit is set, the cgroup memory.limit_in_bytes should be the limit - when limit is set to 1GB, the max heap size should be ~700MB. However, you can see in the log, that it tries to allocate ~4GB, which is weird.

After my experiments on my local machine, the container limit was always set and I haven't spotted any issue with that. Wondering why it behaves differently in Jenkins.

As values -Xms, and -Xmx are static (hardcoded), they are always respected.

@jcarranzan Until this issue is resolved, and your workaround sets these values, it'd be good to change the JAVA_OPTS to JAVA_OPTS_KC_HEAP(Keycloak takes it into consideration), in order to not override all JVM opts.

@jcarranzan
Copy link
Contributor Author

jcarranzan commented Oct 25, 2024

After implementing local changes in our test framework based on @michalvavrik's suggestions, I've identified that the GzipMaxInputIT test in the http-advanced module causes a rapid spike in memory consumption from 1858MB down to 726MB of free physical memory.

Additionally, the Keycloak container consistently tries to allocate 4GB of memory instead of the intended 1GB. Despite attempts to adjust the test framework in my branch(https://github.com/jcarranzan/quarkus-test-framework/pull/new/keycloak-memory-improvements) to enforce a 1GB limit, these efforts haven't been successful.

As previously mentioned, this issue could be resolved by configuring the JAVA_OPTS_KC_HEAP environment variable. You can find more details on setting JVM options for Keycloak here:

https://www.keycloak.org/keycloak-benchmark/kubernetes-guide/latest/running/jvm/jvm_options

But also, disable the GzipMaxInputIT tests, Keycloak container started successfully:

16:25:42 14:25:42,025 INFO  ## Running test HeadersIT.testWithNoAcceptHeader()
16:25:42 [Execution finished: testWithNoAcceptHeader()]
16:25:42 === JVM Memory Info ===
16:25:42 Free Memory: 168 MB
16:25:42 Total Memory: 637 MB
16:25:42 Max Memory: 1924 MB
16:25:42 === System Memory Info ===
16:25:42 Total Physical Memory: 7695 MB
16:25:42 Free Physical Memory: 1711 MB
16:25:42 === Disk Space Info ===
16:25:42 Total Disk Space: 40847 MB
16:25:42 Free Disk Space: 32067 MB
16:25:42 --------------------------------------------------
16:25:44 14:25:44,105 INFO  [app] Service stopped (Quarkus JVM mode)
16:25:44 [INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.577 s -- in io.quarkus.ts.http.advanced.HeadersIT
16:25:44 [Execution finished: HeadersIT]
16:25:44 === JVM Memory Info ===
16:25:44 Free Memory: 165 MB
16:25:44 Total Memory: 637 MB
16:25:44 Max Memory: 1924 MB
16:25:44 === System Memory Info ===
16:25:44 Total Physical Memory: 7695 MB
16:25:44 Free Physical Memory: 1880 MB
16:25:44 === Disk Space Info ===
16:25:44 Total Disk Space: 40847 MB
16:25:44 Free Disk Space: 32115 MB
16:25:44 --------------------------------------------------
16:25:44 [Execution started: HttpAdvancedIT]
16:25:44 === JVM Memory Info ===
16:25:44 Free Memory: 164 MB
16:25:44 Total Memory: 637 MB
16:25:44 Max Memory: 1924 MB
16:25:44 === System Memory Info ===
16:25:44 Total Physical Memory: 7695 MB
16:25:44 Free Physical Memory: 1880 MB
16:25:44 === Disk Space Info ===
16:25:44 Total Disk Space: 40847 MB
16:25:44 Free Disk Space: 32115 MB
16:25:44 --------------------------------------------------
16:25:44 [INFO] Running io.quarkus.ts.http.advanced.HttpAdvancedIT
16:25:44 14:25:44,786 INFO  ⭐  PKCS12 keystore and truststore generated successfully!
16:25:44 14:25:44,786 INFO  🔐  Key Store File: /tmp/quarkus-qe-certs13130330053228786909/quarkus-qe-keystore.p12
16:25:44 14:25:44,787 INFO  🔓  Trust Store File: /tmp/quarkus-qe-certs13130330053228786909/quarkus-qe-truststore.p12
16:25:44 14:25:44,841 INFO  [keycloak] Initialize service (quay.io/keycloak/keycloak:25.0)
16:25:44 14:25:44,869 Image pull policy will be performed by: DefaultPullPolicy()
16:25:44 14:25:44,872 Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor')
16:25:44 14:25:44,889 Testcontainers version: 1.20.3
16:25:45 14:25:45,468 Found Docker environment with Environment variables, system properties and defaults. Resolved dockerHost=unix:///run/user/600/podman/podman.sock
16:25:45 14:25:45,469 Docker host IP address is localhost
16:25:45 14:25:45,519 Connected to docker: 
16:25:45   Server Version: 4.9.4-rhel
16:25:45   API Version: 1.41
16:25:45   Operating System: rhel
16:25:45   Total Memory: 7695 MB
16:25:45 14:25:45,542 Pulling docker image: testcontainers/ryuk:0.9.0. Please be patient; this may take some time but only needs to be done once.
16:25:45 14:25:45,765 Starting to pull image
16:25:45 14:25:45,788 Pulling image layers:  1 pending,  1 downloaded,  0 extracted, (0 bytes/? MB)
16:25:45 14:25:45,789 Pulling image layers:  1 pending,  2 downloaded,  0 extracted, (0 bytes/? MB)
16:25:46 14:25:46,069 Pulling image layers:  0 pending,  3 downloaded,  0 extracted, (0 bytes/0 bytes)
16:25:46 14:25:46,072 Pulling image layers:  0 pending,  4 downloaded,  0 extracted, (0 bytes/0 bytes)
16:25:46 14:25:46,078 Pulling image layers:  0 pending,  4 downloaded,  0 extracted, (0 bytes/0 bytes)
16:25:46 14:25:46,079 Image testcontainers/ryuk:0.9.0 pull took PT0.536755073S
16:25:46 14:25:46,098 Creating container for image: testcontainers/ryuk:0.9.0
16:25:46 14:25:46,158 Container testcontainers/ryuk:0.9.0 is starting: 2336473375ad30c7024f350dd63fafd8e03a5754751695c6d1ed6a5674a12316
16:25:46 14:25:46,559 Container testcontainers/ryuk:0.9.0 started in PT0.460703204S
16:25:46 14:25:46,562 Ryuk started - will monitor and terminate Testcontainers containers on JVM exit
16:25:46 14:25:46,562 Checking the system...
16:25:46 14:25:46,563 ✔︎ Docker server version should be at least 1.6.0
16:25:46 14:25:46,565 Pulling docker image: quay.io/keycloak/keycloak:25.0. Please be patient; this may take some time but only needs to be done once.
16:25:50 14:25:50,662 Starting to pull image
16:25:50 14:25:50,672 Pulling image layers:  0 pending,  1 downloaded,  0 extracted, (0 bytes/0 bytes)
16:25:51 14:25:51,130 Pulling image layers:  2 pending,  2 downloaded,  0 extracted, (0 bytes/? MB)
16:25:53 14:25:53,427 Pulling image layers:  1 pending,  3 downloaded,  0 extracted, (131 MB/? MB)
16:25:55 14:25:55,403 Pulling image layers:  0 pending,  4 downloaded,  0 extracted, (131 MB/217 MB)
16:25:55 14:25:55,410 Pulling image layers:  0 pending,  5 downloaded,  0 extracted, (131 MB/217 MB)
16:25:55 14:25:55,435 Pulling image layers:  0 pending,  5 downloaded,  0 extracted, (131 MB/217 MB)
16:25:55 14:25:55,435 Pull complete. 5 layers, pulled in 4s (downloaded 131 MB at 32 MB/s)
16:25:55 14:25:55,435 Image quay.io/keycloak/keycloak:25.0 pull took PT8.87001428S
16:25:55 14:25:55,442 Creating container for image: quay.io/keycloak/keycloak:25.0
16:25:55 14:25:55,615 Container quay.io/keycloak/keycloak:25.0 is starting: f13b9bb0029d9ff53e8b6338e2673268fef8e4332a429a4cc38048804e631fdf
16:26:14 14:26:14,029 Container quay.io/keycloak/keycloak:25.0 started in PT18.586993393S
16:26:16 14:26:16,036 INFO  [keycloak] Service started (quay.io/keycloak/keycloak:25.0)
16:26:16 14:26:16,036 INFO  [app] Initialize service (Quarkus JVM mode)

@rsvoboda
Copy link
Member

@jcarranzan sounds like quarkus-test-service-keycloak/src/main/java/io/quarkus/test/services/containers/KeycloakGenericDockerContainerManagedResource.java should use JAVA_OPTS_KC_HEAP approach instead of config.withMemory

GzipMaxInputIT - please create QQE task to investigate memory consumption characteristics of Quarkus app using gzip, something to look into before next lts.

@mabartos
Copy link

@jcarranzan sounds like quarkus-test-service-keycloak/src/main/java/io/quarkus/test/services/containers/KeycloakGenericDockerContainerManagedResource.java should use JAVA_OPTS_KC_HEAP approach instead of config.withMemory

I'm not really sure why the container limit is not propagated in the test containers. However, as it works after experiments on my local machine, additionally there are no issues on the Keycloak side around the memory parameters, it's very unlikely it's a Keycloak issue. In production envs, the testcontainers are not used and the docker/podman memory limits are properly used.

The issue should be related to the testcontainers, or handling KeycloakGenericDockerContainerManagedResource itself.

I'd say for purposes of this testsuite, setting the mentioned JAVA_OPTS_KC_HEAP is suitable.

@mabartos
Copy link

mabartos commented Oct 29, 2024

Probably more suitable solution would be:

container.withEnv("JAVA_OPTS_APPEND", "-XX:MaxRAM=1g");

As we would create an artificial memory limit, which will be applied and not fragile as by setting the testcontainer container memory limit.

We'd be still able to use the relative heap settings defined in Keycloak, as we'd define the parameter in JAVA_OPTS_APPEND var. It could help even us in Keycloak to find potential issues in the future.

@jcarranzan Does it work for you?

@jcarranzan
Copy link
Contributor Author

Yes @mabartos it works, I will test it and open a PR in our TF with my tweaks. Thanks.

@jcarranzan
Copy link
Contributor Author

Fixed with quarkus-qe/quarkus-test-framework#1389

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/urgent Issue that needs to be resolved by next RHBQ release.
Projects
None yet
Development

No branches or pull requests

4 participants