-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560
Conversation
Ok, now I understand why |
8eea1d4
to
e9b279b
Compare
@radcortez could you have a look at the commit related to config? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small comment.
The changes looks good to me but I would like to have @radcortez to have a closer look at the config commit.
test-framework/junit5-internal/src/main/java/io/quarkus/test/QuarkusUnitTest.java
Show resolved
Hide resolved
Also, @stuartwdouglas could you have a look at the |
e9b279b
to
8981763
Compare
+1 for the H2 property. We're using the same in Hibernate ORM - for the same reasons.
This should be fine, as mentioned elsewhere we never designed Quarkus to be used with the SecurityManager. In fact I suggested this same change to Yoann and plan to apply the same to the other classloaders, with some additional optimisations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve, pending Roberto's opinion as suggested by Guillaume.
I know that DB_CLOSE_DELAY=-1 has bit us before with race conditions in tests. Basically if the number of connections drops to zero then the DB gets deleted in the middle of the test. This was a long time ago that we were seeing this so I don't remember all the details. |
Actually ignore that last comment, I misremembered what the different options do, and this looks fine. |
I think the ProtectionDomain changes are fine. |
This comment has been minimized.
This comment has been minimized.
core/runtime/src/main/java/io/quarkus/runtime/configuration/QuarkusConfigFactory.java
Show resolved
Hide resolved
8981763
to
4a68784
Compare
This comment has been minimized.
This comment has been minimized.
02e2966
to
c73b031
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of cleaning up mess like this, but I guess it is easier this way.
My recommendation is for tests that are not QuarkusTests to just use SmallRyeConfigBuilder
and SmallRyeConfig
directly instead of CofigProvider.getConfig
.
...ns/config-yaml/runtime/src/test/java/io/quarkus/config/yaml/runtime/ApplicationYamlTest.java
Outdated
Show resolved
Hide resolved
...t-client-config/runtime/src/test/java/io/quarkus/restclient/config/RestClientConfigTest.java
Show resolved
Hide resolved
test-framework/junit5-internal/src/main/java/io/quarkus/test/ConfigUtil.java
Outdated
Show resolved
Hide resolved
It doesn't matter in practice since the only test using this config reproduces a startup failure that happens before we start H2; but still, let's fix this for correctness.
Some config was lazily created and bound to the dev class loader through the configsForClassLoader map io.smallrye.config.SmallRyeConfigProviderResolver#getConfig(java.lang.ClassLoader), and the corresponding map entry was never cleared, resulting in the corresponding classloader never being garbage collected.
This removes a whole class of metaspace leaks caused by Thread.inheritedAccessControlContext referencing ProtectionDomains which reference older classloaders. Of course this may have impacts on how the SecurityManager behaves, but as I understand it, absolutely no part of Quarkus is ready to run with a SecurityManager enabled anyway.
Since a previous commit, we now remember the lazily-created config in QuarkusConfigFactory, which is good because in dev mode it allows us to release it upon restart (when we call setConfig(null)). However, in *tests* that rely on this lazily initialized config, just calling releaseConfig(ConfigProvider.getConfig()) is no longer enough, because of that remembered config in QuarkusConfigFactory: we need to reset that reference too. If we forget to reset it, then when KafkaDevServicesDevModeTestCase will execute, TestHTTPResourceManager#getUri will retrieve the leaked config from DefaultSerdeConfigTest or DefaultSchemaConfigTest, the injected URI in KafkaDevServicesDevModeTestCase will be wrong, and the test will fail (with very unhelpful error messages).
…tensions If someone calls ConfigProvider.getConfig() out of the blue in a test that doesn't use any Quarkus*Test extension, this will call QuarkusConfigFactory and leak config in two ways: 1. In QuarkusConfigFactory#config 2. In SmallRyeConfigProviderResolver, registering config for the TCCL, which in such a case is most likely the system CL. A well-behaved test would call QuarkusConfigFactory.setConfig(null) to clean up all that, but it's easy to miss and there is potential for ConfigProvider.getConfig() being called indirectly, so there's no way we can guarantee all tests are well-behaved. This should at least guarantee that after a badly behaving test executes, the next test using a Quarkus*Test extension will clean up the mess.
Thanks for the review.
Right, it's more of a safety net. One that's apparently necessary as I saw several similar cleanups throughout the test modules. Anyway... I addressed your comments and switched this PR to draft. Now I'll have a look at the test failures... This time I won't mark as ready for review until my fork's build passes. |
d8044eb
to
4b70a5a
Compare
If someone calls ConfigProvider.getConfig() out of the blue in a test that doesn't use any Quarkus*Test extension, this will indirectly call QuarkusConfigFactory and leak config in two ways: 1. In QuarkusConfigFactory#config 2. In SmallRyeConfigProviderResolver, registering config for the TCCL, which in such a case is most likely the system CL. Thus, a well-behaved test should call QuarkusConfigFactory.setConfig(null) to clean up all that, no just SmallRyeConfigProviderResolver.releaseConfig(). Similarly, tests that register configuration explicitly can just call QuarkusConfigFactory.setConfig(config) at the beginning and QuarkusConfigFactory.setConfig(null) at the end, which will properly simulate how a real Quarkus application behaves, and should cover all edge cases involving multiple classloaders, properly cleaning up everything at the end of the test.
4b70a5a
to
683741b
Compare
Alright the build passed on my fork (if we ignore some flakes). @radcortez I had to revert a change to |
Failing Jobs - Building 683741b
|
From what I can see, the failing builds are failing on other PRs as well:
I looked at the logs and they don't seem to point at any problem that would obviously come from from this PR. So, I'll merge. Thanks for the reviews! |
Follows up on #35407
Disclaimer: I didn't get the chance to run the whole test suite locally, so... fingers crossed :/
With these changes, I was able to run the Hibernate ORM quickstart, and after a few initial leaks (can't be avoided since we don't restart Vertx), new classloaders consistently got GC'd and I never got above 6
QuarkusClassLoader
instances (believe me, that's not a lot for dev mode).As to our own test suite... it's getting better, but we're far from being done. There are some annoying circular references caused by
io.netty.util.concurrent.GlobalEventExecutor#terminationFuture
and I thinkio.quarkus.test.ClearCache#clearAnnotationCache
is not working as well as we want it to (or maybe should be used more often, e.g. in continuous testing). And there's probably more.Regarding the content of this PR:
The first two commits are about improving the information we get when debugging or looking at heap dumps. Because believe me, when there's a dozen different "Augmentation Class Loader: TEST" in the same heap dump, it's not funny at all.
The next commit fixes a JDBC URL that seemed suspicious, but doesn't really impact anything (the test that uses this config triggers a startup failure on purpose before the database is started anyway).
The next commit change a few JDBC URLs to avoid a leak caused by H2 when it registerers a JVM Shutdown Hook (which lasts forever) which indirectly references a=> Removed, see #36560 (comment)QuarkusClassLoader
(which is supposed to be short-lived).Then come a few commits that impact dev mode, so might have an impact in the real world out there. Because it turns out a lot of the leaks in our test suite are actually caused by tests of dev mode, since dev mode is very leaky: basically each dev mode restart leaks a whole application worth of classloaders.
These last commits, in order:
We apply the same fix for H2 JVM Shutdown Hooks, but this time for dev services=> Removed, see Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560 (comment)ForkJoinPool#commonPool
threads for everyCuratedApplication#run
method (for some reason there are 3 of them)io.smallrye.config.SmallRyeConfigProviderResolver
that will forever reference a classloader lower in the hierarchy, making it impossible to GC. cc @mkouba : this is the problem we've been investigating the other day; I tried a few solutions and this one seems to be both safe and effective.java.lang.Thread#inheritedAccessControlContext
has an array ofProtectionDomain
with references to classloaders. In effect, any time we create a thread, it has a high chance of preventing several classloaders from being GC. This is very annoying for Vertx threads andForkJoinPool#commonPool
threads, in particular. cc @Sanne:, who noticed this independently, but much sooner than I did :)