Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560

Merged
merged 11 commits into from
Oct 30, 2023

Conversation

yrodiere
Copy link
Member

@yrodiere yrodiere commented Oct 18, 2023

Follows up on #35407

Disclaimer: I didn't get the chance to run the whole test suite locally, so... fingers crossed :/

With these changes, I was able to run the Hibernate ORM quickstart, and after a few initial leaks (can't be avoided since we don't restart Vertx), new classloaders consistently got GC'd and I never got above 6 QuarkusClassLoader instances (believe me, that's not a lot for dev mode).

As to our own test suite... it's getting better, but we're far from being done. There are some annoying circular references caused by io.netty.util.concurrent.GlobalEventExecutor#terminationFuture and I think io.quarkus.test.ClearCache#clearAnnotationCache is not working as well as we want it to (or maybe should be used more often, e.g. in continuous testing). And there's probably more.


Regarding the content of this PR:

The first two commits are about improving the information we get when debugging or looking at heap dumps. Because believe me, when there's a dozen different "Augmentation Class Loader: TEST" in the same heap dump, it's not funny at all.

The next commit fixes a JDBC URL that seemed suspicious, but doesn't really impact anything (the test that uses this config triggers a startup failure on purpose before the database is started anyway).

The next commit change a few JDBC URLs to avoid a leak caused by H2 when it registerers a JVM Shutdown Hook (which lasts forever) which indirectly references a QuarkusClassLoader (which is supposed to be short-lived). => Removed, see #36560 (comment)

Then come a few commits that impact dev mode, so might have an impact in the real world out there. Because it turns out a lot of the leaks in our test suite are actually caused by tests of dev mode, since dev mode is very leaky: basically each dev mode restart leaks a whole application worth of classloaders.

These last commits, in order:

  • We apply the same fix for H2 JVM Shutdown Hooks, but this time for dev services => Removed, see Metaspace improvements in QuarkusUnitTest (and dev mode!) - round 3 #36560 (comment)
  • We make sure to reset the TCCL of ForkJoinPool#commonPool threads for every CuratedApplication#run method (for some reason there are 3 of them)
  • We simplify Smallrye config setting/release so that we don't end up with a classloader high in the hierarchy referencing a version of io.smallrye.config.SmallRyeConfigProviderResolver that will forever reference a classloader lower in the hierarchy, making it impossible to GC. cc @mkouba : this is the problem we've been investigating the other day; I tried a few solutions and this one seems to be both safe and effective.
  • And finally, we fix a very extensive leak that probably affects every application in dev mode: java.lang.Thread#inheritedAccessControlContext has an array of ProtectionDomain with references to classloaders. In effect, any time we create a thread, it has a high chance of preventing several classloaders from being GC. This is very annoying for Vertx threads and ForkJoinPool#commonPool threads, in particular. cc @Sanne:, who noticed this independently, but much sooner than I did :)

@yrodiere yrodiere requested review from Sanne and mkouba October 18, 2023 15:16
@quarkus-bot quarkus-bot bot added area/config area/core area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/hibernate-orm Hibernate ORM area/panache area/persistence OBSOLETE, DO NOT USE area/spring Issues relating to the Spring integration area/testing labels Oct 18, 2023
@quarkus-bot
Copy link

quarkus-bot bot commented Oct 18, 2023

/cc @Sanne (hibernate-orm), @gsmet (hibernate-orm)

@yrodiere
Copy link
Member Author

Ok, now I understand why QuarkusClassLoader doesn't pass its name to the super constructor: to avoid ugly stacktraces. I'll change the first commit.

@gsmet
Copy link
Member

gsmet commented Oct 18, 2023

@radcortez could you have a look at the commit related to config?

Copy link
Member

@gsmet gsmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small comment.

The changes looks good to me but I would like to have @radcortez to have a closer look at the config commit.

@gsmet
Copy link
Member

gsmet commented Oct 18, 2023

Also, @stuartwdouglas could you have a look at the ProtectionDomain commit, just to make sure we haven't missed something?

@Sanne
Copy link
Member

Sanne commented Oct 18, 2023

+1 for the H2 property. We're using the same in Hibernate ORM - for the same reasons.

Also, @stuartwdouglas could you have a look at the ProtectionDomain commit, just to make sure we haven't missed something?

This should be fine, as mentioned elsewhere we never designed Quarkus to be used with the SecurityManager. In fact I suggested this same change to Yoann and plan to apply the same to the other classloaders, with some additional optimisations.

Copy link
Member

@Sanne Sanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, pending Roberto's opinion as suggested by Guillaume.

@gsmet gsmet requested a review from radcortez October 18, 2023 15:49
@stuartwdouglas
Copy link
Member

I know that DB_CLOSE_DELAY=-1 has bit us before with race conditions in tests. Basically if the number of connections drops to zero then the DB gets deleted in the middle of the test. This was a long time ago that we were seeing this so I don't remember all the details.

@stuartwdouglas
Copy link
Member

Actually ignore that last comment, I misremembered what the different options do, and this looks fine.

@stuartwdouglas
Copy link
Member

I think the ProtectionDomain changes are fine.

@quarkus-bot

This comment has been minimized.

@quarkus-bot

This comment has been minimized.

@yrodiere yrodiere marked this pull request as ready for review October 25, 2023 12:05
@quarkus-bot

This comment has been minimized.

Copy link
Member

@radcortez radcortez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of cleaning up mess like this, but I guess it is easier this way.

My recommendation is for tests that are not QuarkusTests to just use SmallRyeConfigBuilder and SmallRyeConfig directly instead of CofigProvider.getConfig.

It doesn't matter in practice since the only test using this config
reproduces a startup failure that happens before we start H2;
but still, let's fix this for correctness.
Some config was lazily created and bound to the dev class loader
through the configsForClassLoader map
io.smallrye.config.SmallRyeConfigProviderResolver#getConfig(java.lang.ClassLoader),
and the corresponding map entry was never cleared,
resulting in the corresponding classloader never being garbage
collected.
This removes a whole class of metaspace leaks caused by
Thread.inheritedAccessControlContext referencing ProtectionDomains which
reference older classloaders.

Of course this may have impacts on how the SecurityManager behaves, but
as I understand it, absolutely no part of Quarkus is ready to run with a
SecurityManager enabled anyway.
Since a previous commit, we now remember the lazily-created config in
QuarkusConfigFactory, which is good because in dev mode it allows us to
release it upon restart (when we call setConfig(null)).

However, in *tests* that rely on this lazily initialized config, just
calling releaseConfig(ConfigProvider.getConfig()) is no longer enough,
because of that remembered config in QuarkusConfigFactory: we need to
reset that reference too.

If we forget to reset it, then when KafkaDevServicesDevModeTestCase
will execute, TestHTTPResourceManager#getUri will retrieve the leaked
config from DefaultSerdeConfigTest or DefaultSchemaConfigTest,
the injected URI in KafkaDevServicesDevModeTestCase will be wrong,
and the test will fail (with very unhelpful error messages).
…tensions

If someone calls ConfigProvider.getConfig() out of the blue in a test
that doesn't use any Quarkus*Test extension, this will call
QuarkusConfigFactory and leak config in two ways:

1. In QuarkusConfigFactory#config
2. In SmallRyeConfigProviderResolver, registering config for the TCCL,
   which in such a case is most likely the system CL.

A well-behaved test would call QuarkusConfigFactory.setConfig(null) to
clean up all that, but it's easy to miss and there is potential for
ConfigProvider.getConfig() being called indirectly, so there's no way we
can guarantee all tests are well-behaved.

This should at least guarantee that after a badly behaving test
executes, the next test using a Quarkus*Test extension will clean up the
mess.
@yrodiere yrodiere marked this pull request as draft October 26, 2023 07:01
@yrodiere
Copy link
Member Author

Thanks for the review.

I'm not a big fan of cleaning up mess like this, but I guess it is easier this way.

Right, it's more of a safety net. One that's apparently necessary as I saw several similar cleanups throughout the test modules.
Ideally we'd just have tests that don't leak, but I don't think anyone wants to deal with that right now.

Anyway... I addressed your comments and switched this PR to draft. Now I'll have a look at the test failures... This time I won't mark as ready for review until my fork's build passes.

@yrodiere yrodiere force-pushed the metaspace-leaks-3 branch 2 times, most recently from d8044eb to 4b70a5a Compare October 26, 2023 07:02
If someone calls ConfigProvider.getConfig() out of the blue in a test
that doesn't use any Quarkus*Test extension, this will indirectly call
QuarkusConfigFactory and leak config in two ways:

1. In QuarkusConfigFactory#config
2. In SmallRyeConfigProviderResolver, registering config for the TCCL,
   which in such a case is most likely the system CL.

Thus, a well-behaved test should call QuarkusConfigFactory.setConfig(null)
to clean up all that, no just SmallRyeConfigProviderResolver.releaseConfig().

Similarly, tests that register configuration explicitly can just call
QuarkusConfigFactory.setConfig(config) at the beginning and
QuarkusConfigFactory.setConfig(null) at the end,
which will properly simulate how a real Quarkus application behaves,
and should cover all edge cases involving multiple classloaders,
properly cleaning up everything at the end of the test.
@yrodiere yrodiere marked this pull request as ready for review October 27, 2023 11:30
@yrodiere yrodiere requested a review from radcortez October 27, 2023 11:30
@yrodiere
Copy link
Member Author

Alright the build passed on my fork (if we ignore some flakes).

@radcortez I had to revert a change to TestResourceManager; apparently using QuarkusTestFactory there does more harm than good. I'm not sure even using releaseConfig there makes sense now, but that at least doesn't seem to hurt so I left it and added a comment: e76a0e5#diff-a050ea6d2087e6491a6cfcadd0ef58e0ffbe09935ab12850ca3adbc38ce14bffR182-R188

@quarkus-bot
Copy link

quarkus-bot bot commented Oct 27, 2023

Failing Jobs - Building 683741b

Status Name Step Failures Logs Raw logs Build scan
✔️ JVM Tests - JDK 11
JVM Tests - JDK 17 Build ⚠️ Check → Logs Raw logs
JVM Tests - JDK 17 Windows Upload gc.log ⚠️ Check → Logs Raw logs
✔️ JVM Tests - JDK 21
Native Tests - Virtual Thread - Main ⚠️ Check → Logs Raw logs
Native Tests - Windows - RESTEasy Jackson Setup GraalVM ⚠️ Check → Logs Raw logs

@yrodiere
Copy link
Member Author

From what I can see, the failing builds are failing on other PRs as well:

I looked at the logs and they don't seem to point at any problem that would obviously come from from this PR. JVM Tests - JDK 17 and JVM Tests - JDK 17 Windows, in particular, just took more time than usual for no obvious reason.

So, I'll merge. Thanks for the reviews!

@yrodiere yrodiere merged commit 8d6d112 into quarkusio:main Oct 30, 2023
60 of 64 checks passed
@quarkus-bot quarkus-bot bot added this to the 3.6 - main milestone Oct 30, 2023
@yrodiere yrodiere deleted the metaspace-leaks-3 branch January 29, 2024 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants