Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

yrodiere · 2023-10-18T15:16:42Z

Follows up on #35407

Disclaimer: I didn't get the chance to run the whole test suite locally, so... fingers crossed :/

With these changes, I was able to run the Hibernate ORM quickstart, and after a few initial leaks (can't be avoided since we don't restart Vertx), new classloaders consistently got GC'd and I never got above 6 QuarkusClassLoader instances (believe me, that's not a lot for dev mode).

As to our own test suite... it's getting better, but we're far from being done. There are some annoying circular references caused by io.netty.util.concurrent.GlobalEventExecutor#terminationFuture and I think io.quarkus.test.ClearCache#clearAnnotationCache is not working as well as we want it to (or maybe should be used more often, e.g. in continuous testing). And there's probably more.

Regarding the content of this PR:

The first two commits are about improving the information we get when debugging or looking at heap dumps. Because believe me, when there's a dozen different "Augmentation Class Loader: TEST" in the same heap dump, it's not funny at all.

The next commit fixes a JDBC URL that seemed suspicious, but doesn't really impact anything (the test that uses this config triggers a startup failure on purpose before the database is started anyway).

The next commit change a few JDBC URLs to avoid a leak caused by H2 when it registerers a JVM Shutdown Hook (which lasts forever) which indirectly references a QuarkusClassLoader (which is supposed to be short-lived). => Removed, see #36560 (comment)

Then come a few commits that impact dev mode, so might have an impact in the real world out there. Because it turns out a lot of the leaks in our test suite are actually caused by tests of dev mode, since dev mode is very leaky: basically each dev mode restart leaks a whole application worth of classloaders.

These last commits, in order:

~~We apply the same fix for H2 JVM Shutdown Hooks, but this time for dev services~~ => Removed, see Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560 (comment)
We make sure to reset the TCCL of ForkJoinPool#commonPool threads for every CuratedApplication#run method (for some reason there are 3 of them)
We simplify Smallrye config setting/release so that we don't end up with a classloader high in the hierarchy referencing a version of io.smallrye.config.SmallRyeConfigProviderResolver that will forever reference a classloader lower in the hierarchy, making it impossible to GC. cc @mkouba : this is the problem we've been investigating the other day; I tried a few solutions and this one seems to be both safe and effective.
And finally, we fix a very extensive leak that probably affects every application in dev mode: java.lang.Thread#inheritedAccessControlContext has an array of ProtectionDomain with references to classloaders. In effect, any time we create a thread, it has a high chance of preventing several classloaders from being GC. This is very annoying for Vertx threads and ForkJoinPool#commonPool threads, in particular. cc @Sanne:, who noticed this independently, but much sooner than I did :)

quarkus-bot · 2023-10-18T15:17:24Z

/cc @Sanne (hibernate-orm), @gsmet (hibernate-orm)

yrodiere · 2023-10-18T15:29:09Z

Ok, now I understand why QuarkusClassLoader doesn't pass its name to the super constructor: to avoid ugly stacktraces. I'll change the first commit.

gsmet · 2023-10-18T15:30:42Z

@radcortez could you have a look at the commit related to config?

gsmet

Added a small comment.

The changes looks good to me but I would like to have @radcortez to have a closer look at the config commit.

test-framework/junit5-internal/src/main/java/io/quarkus/test/QuarkusUnitTest.java

gsmet · 2023-10-18T15:34:59Z

Also, @stuartwdouglas could you have a look at the ProtectionDomain commit, just to make sure we haven't missed something?

Sanne · 2023-10-18T15:41:56Z

+1 for the H2 property. We're using the same in Hibernate ORM - for the same reasons.

Also, @stuartwdouglas could you have a look at the ProtectionDomain commit, just to make sure we haven't missed something?

This should be fine, as mentioned elsewhere we never designed Quarkus to be used with the SecurityManager. In fact I suggested this same change to Yoann and plan to apply the same to the other classloaders, with some additional optimisations.

Sanne

I approve, pending Roberto's opinion as suggested by Guillaume.

stuartwdouglas · 2023-10-18T19:27:26Z

I know that DB_CLOSE_DELAY=-1 has bit us before with race conditions in tests. Basically if the number of connections drops to zero then the DB gets deleted in the middle of the test. This was a long time ago that we were seeing this so I don't remember all the details.

stuartwdouglas · 2023-10-18T19:30:21Z

Actually ignore that last comment, I misremembered what the different options do, and this looks fine.

stuartwdouglas · 2023-10-18T20:08:14Z

I think the ProtectionDomain changes are fine.

core/runtime/src/main/java/io/quarkus/runtime/configuration/QuarkusConfigFactory.java

radcortez

I'm not a big fan of cleaning up mess like this, but I guess it is easier this way.

My recommendation is for tests that are not QuarkusTests to just use SmallRyeConfigBuilder and SmallRyeConfig directly instead of CofigProvider.getConfig.

...ns/config-yaml/runtime/src/test/java/io/quarkus/config/yaml/runtime/ApplicationYamlTest.java

...t-client-config/runtime/src/test/java/io/quarkus/restclient/config/RestClientConfigTest.java

test-framework/junit5-internal/src/main/java/io/quarkus/test/ConfigUtil.java

…structor

…Test

It doesn't matter in practice since the only test using this config reproduces a startup failure that happens before we start H2; but still, let's fix this for correctness.

Some config was lazily created and bound to the dev class loader through the configsForClassLoader map io.smallrye.config.SmallRyeConfigProviderResolver#getConfig(java.lang.ClassLoader), and the corresponding map entry was never cleared, resulting in the corresponding classloader never being garbage collected.

This removes a whole class of metaspace leaks caused by Thread.inheritedAccessControlContext referencing ProtectionDomains which reference older classloaders. Of course this may have impacts on how the SecurityManager behaves, but as I understand it, absolutely no part of Quarkus is ready to run with a SecurityManager enabled anyway.

Since a previous commit, we now remember the lazily-created config in QuarkusConfigFactory, which is good because in dev mode it allows us to release it upon restart (when we call setConfig(null)). However, in *tests* that rely on this lazily initialized config, just calling releaseConfig(ConfigProvider.getConfig()) is no longer enough, because of that remembered config in QuarkusConfigFactory: we need to reset that reference too. If we forget to reset it, then when KafkaDevServicesDevModeTestCase will execute, TestHTTPResourceManager#getUri will retrieve the leaked config from DefaultSerdeConfigTest or DefaultSchemaConfigTest, the injected URI in KafkaDevServicesDevModeTestCase will be wrong, and the test will fail (with very unhelpful error messages).

…tensions If someone calls ConfigProvider.getConfig() out of the blue in a test that doesn't use any Quarkus*Test extension, this will call QuarkusConfigFactory and leak config in two ways: 1. In QuarkusConfigFactory#config 2. In SmallRyeConfigProviderResolver, registering config for the TCCL, which in such a case is most likely the system CL. A well-behaved test would call QuarkusConfigFactory.setConfig(null) to clean up all that, but it's easy to miss and there is potential for ConfigProvider.getConfig() being called indirectly, so there's no way we can guarantee all tests are well-behaved. This should at least guarantee that after a badly behaving test executes, the next test using a Quarkus*Test extension will clean up the mess.

yrodiere · 2023-10-26T07:01:23Z

Thanks for the review.

I'm not a big fan of cleaning up mess like this, but I guess it is easier this way.

Right, it's more of a safety net. One that's apparently necessary as I saw several similar cleanups throughout the test modules.
Ideally we'd just have tests that don't leak, but I don't think anyone wants to deal with that right now.

Anyway... I addressed your comments and switched this PR to draft. Now I'll have a look at the test failures... This time I won't mark as ready for review until my fork's build passes.

If someone calls ConfigProvider.getConfig() out of the blue in a test that doesn't use any Quarkus*Test extension, this will indirectly call QuarkusConfigFactory and leak config in two ways: 1. In QuarkusConfigFactory#config 2. In SmallRyeConfigProviderResolver, registering config for the TCCL, which in such a case is most likely the system CL. Thus, a well-behaved test should call QuarkusConfigFactory.setConfig(null) to clean up all that, no just SmallRyeConfigProviderResolver.releaseConfig(). Similarly, tests that register configuration explicitly can just call QuarkusConfigFactory.setConfig(config) at the beginning and QuarkusConfigFactory.setConfig(null) at the end, which will properly simulate how a real Quarkus application behaves, and should cover all edge cases involving multiple classloaders, properly cleaning up everything at the end of the test.

yrodiere · 2023-10-27T11:35:03Z

Alright the build passed on my fork (if we ignore some flakes).

@radcortez I had to revert a change to TestResourceManager; apparently using QuarkusTestFactory there does more harm than good. I'm not sure even using releaseConfig there makes sense now, but that at least doesn't seem to hurt so I left it and added a comment: e76a0e5#diff-a050ea6d2087e6491a6cfcadd0ef58e0ffbe09935ab12850ca3adbc38ce14bffR182-R188

quarkus-bot · 2023-10-27T18:10:57Z

Failing Jobs - Building `683741b`

Status	Name	Step	Failures	Logs	Raw logs
✔️	JVM Tests - JDK 11
✖	JVM Tests - JDK 17	`Build`	⚠️ Check →	Logs	Raw logs
✖	JVM Tests - JDK 17 Windows	`Upload gc.log`	⚠️ Check →	Logs	Raw logs
✔️	JVM Tests - JDK 21
✖	Native Tests - Virtual Thread - Main		⚠️ Check →	Logs	Raw logs
✖	Native Tests - Windows - RESTEasy Jackson	`Setup GraalVM`	⚠️ Check →	Logs	Raw logs

yrodiere · 2023-10-30T07:04:59Z

From what I can see, the failing builds are failing on other PRs as well:

Native Tests - Virtual Thread - Main
- Fix order of defaults recording #36753
- Bump org.mvnpm.at.lit-labs:ssr-dom-shim from 1.1.1 to 1.1.2 #36732
Native Tests - Windows - RESTEasy Jackson
- Remove unsupported Jakarta Persistence Security extension guides from downstream YAML file #36771
- Fix order of defaults recording #36753

I looked at the logs and they don't seem to point at any problem that would obviously come from from this PR. JVM Tests - JDK 17 and JVM Tests - JDK 17 Windows, in particular, just took more time than usual for no obvious reason.

So, I'll merge. Thanks for the reviews!

yrodiere requested review from Sanne and mkouba October 18, 2023 15:16

quarkus-bot bot added area/config area/core area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/hibernate-orm Hibernate ORM area/panache area/persistence OBSOLETE, DO NOT USE area/spring Issues relating to the Spring integration area/testing labels Oct 18, 2023

yrodiere force-pushed the metaspace-leaks-3 branch from 8eea1d4 to e9b279b Compare October 18, 2023 15:30

gsmet reviewed Oct 18, 2023

View reviewed changes

test-framework/junit5-internal/src/main/java/io/quarkus/test/QuarkusUnitTest.java Show resolved Hide resolved

yrodiere force-pushed the metaspace-leaks-3 branch from e9b279b to 8981763 Compare October 18, 2023 15:40

Sanne approved these changes Oct 18, 2023

View reviewed changes

gsmet requested a review from radcortez October 18, 2023 15:49

stuartwdouglas approved these changes Oct 18, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

radcortez reviewed Oct 18, 2023

View reviewed changes

core/runtime/src/main/java/io/quarkus/runtime/configuration/QuarkusConfigFactory.java Show resolved Hide resolved

yrodiere force-pushed the metaspace-leaks-3 branch from 8981763 to 4a68784 Compare October 19, 2023 07:09

quarkus-bot bot added the area/agroal label Oct 19, 2023

mkouba approved these changes Oct 19, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

yrodiere marked this pull request as ready for review October 25, 2023 12:05

yrodiere force-pushed the metaspace-leaks-3 branch from 02e2966 to c73b031 Compare October 25, 2023 12:05

This comment has been minimized.

Sign in to view

radcortez requested changes Oct 25, 2023

View reviewed changes

yrodiere added 9 commits October 26, 2023 08:38

Clarify why QuarkusClassLoader doesn't pass its name to the super con…

6bd7546

…structor

Enrich classloader names with the name of the application/QuarkusUnit…

83a426a

…Test

Fix a suspicious H2 URL in Hibernate ORM test config

ba3d3b1

It doesn't matter in practice since the only test using this config reproduces a startup failure that happens before we start H2; but still, let's fix this for correctness.

Always reset the ForkJoinPool TCCL on startup

3374799

Improve error messages on KafkaDevServicesDevModeTestCase failures

2e9bfde

yrodiere marked this pull request as draft October 26, 2023 07:01

yrodiere force-pushed the metaspace-leaks-3 branch 2 times, most recently from d8044eb to 4b70a5a Compare October 26, 2023 07:02

yrodiere added 2 commits October 26, 2023 09:11

Merge test-related ConfigUtils into a single class

683741b

yrodiere force-pushed the metaspace-leaks-3 branch from 4b70a5a to 683741b Compare October 26, 2023 07:13

yrodiere marked this pull request as ready for review October 27, 2023 11:30

yrodiere requested a review from radcortez October 27, 2023 11:30

radcortez approved these changes Oct 27, 2023

View reviewed changes

yrodiere merged commit 8d6d112 into quarkusio:main Oct 30, 2023
60 of 64 checks passed

quarkus-bot bot added this to the 3.6 - main milestone Oct 30, 2023

yrodiere mentioned this pull request Nov 24, 2023

QUARKUS_PROFILE=dev and '-Dquarkus.profile=dev' not working for native image #37177

Closed

yrodiere deleted the metaspace-leaks-3 branch January 29, 2024 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

yrodiere commented Oct 18, 2023 •

edited

Loading

quarkus-bot bot commented Oct 18, 2023

yrodiere commented Oct 18, 2023

gsmet commented Oct 18, 2023

gsmet left a comment

gsmet commented Oct 18, 2023

Sanne commented Oct 18, 2023

Sanne left a comment

stuartwdouglas commented Oct 18, 2023

stuartwdouglas commented Oct 18, 2023

stuartwdouglas commented Oct 18, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

radcortez left a comment

yrodiere commented Oct 26, 2023

yrodiere commented Oct 27, 2023

quarkus-bot bot commented Oct 27, 2023

yrodiere commented Oct 30, 2023

Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

Conversation

yrodiere commented Oct 18, 2023 • edited Loading

quarkus-bot bot commented Oct 18, 2023

yrodiere commented Oct 18, 2023

gsmet commented Oct 18, 2023

gsmet left a comment

Choose a reason for hiding this comment

gsmet commented Oct 18, 2023

Sanne commented Oct 18, 2023

Sanne left a comment

Choose a reason for hiding this comment

stuartwdouglas commented Oct 18, 2023

stuartwdouglas commented Oct 18, 2023

stuartwdouglas commented Oct 18, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

radcortez left a comment

Choose a reason for hiding this comment

yrodiere commented Oct 26, 2023

yrodiere commented Oct 27, 2023

quarkus-bot bot commented Oct 27, 2023

Failing Jobs - Building 683741b

yrodiere commented Oct 30, 2023

yrodiere commented Oct 18, 2023 •

edited

Loading

Failing Jobs - Building `683741b`