-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test no thread leaks in engine and connectors #21470
Conversation
76829cf
to
8469dcd
Compare
6ff24cd
to
318600b
Compare
3a715e7
to
c81c83a
Compare
@@ -175,4 +178,16 @@ public boolean translateHiveViews(HiveConfig hiveConfig) | |||
{ | |||
return hiveConfig.isTranslateHiveViews(); | |||
} | |||
|
|||
public record ShutdownExecutors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have wrapper around Executor, ScheduledExecutor, ExecutorService, ScheduledExecutorService ...
which delegates all the methods and puts @PreDestroy
on shutdown?
Would that be nicer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like #20960?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's probably why i couldn't find the Cleanup class (got removed 0d4f855#r1514739127)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % suggestion on different way to close executors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc: @kokosing you recently merged something around this area, might be useful to reuse.
7411aca
to
0bc8341
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns about the new "assertNoThreadLeaked" infrastructure.
testing/trino-testing/src/main/java/io/trino/testing/ThreadAssertions.java
Show resolved
Hide resolved
try (T resource = resourceCreator.get()) { | ||
exerciseResource.accept(resource); | ||
} | ||
Thread[] threads = new Thread[clamp(threadGroup.activeCount(), 10, 1000)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10? 1000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brittle and will result in false positives, false negatives, and potentially flaky tests. How does it account for all the JVM platform threads or framework threads that are not under Trino's control?
Also, relying on an unspecified naming convention ("ForkJoinPool.commonPool-worker-xxx") is not a good idea, as there's no guarantee it won't change arbitrarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this to get the threads:
List<Thread> list = Thread.getAllStackTraces().keySet().stream()
.filter(thread -> threadGroup.parentOf(thread.getThreadGroup()))
.toList();
It is slower because the JVM gets the stack traces also, but it is more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10? 1000?
10 -- at least no-zero, so that the next call is authoritative on live threads.
1000 -- arbitrarily chosen cap so that we don't print too many threads, if too many threads stay alive
This is brittle and will result in false positives, false negatives, and potentially flaky tests. How does it account for all the JVM platform threads or framework threads that are not under Trino's control?
This may be brittle and we may need to improve or remove the test.
false positives (successes) are acceptable from this test perspective.
false negatives (failures) are not acceptable and we may need to improve or remove the test.
potentially flaky tests -- thanks for pointing this out. You know how much i care about non-flaky tests and now i don't feel alone
account for JVM platform threads -- by inspecting only one ThraedGroup, created privately within the test.
framework threads -- by performing a warmup first (might not be sufficient if framework expires the threads, then we may eg need to add excludes for known non-leaks -- i had it initially)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this to get the threads:
thanks @dain for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generic, because i wanted to test cover both query runner and PlanTester.
If i introduce the public methods that expect these types, would it help?
If not, what would you do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better. However, we should look into getting rid of PlanTester. There's nothing it's doing that shouldn't be doable with StandaloneQueryRunner. Maybe the only thing is being able to specify which optimizers are instantiated, but I would argue that that's an improper way of testing optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to remove PlanTester within this PR? Not sure I know exactly what this entails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would be a separate change. But that wouldn’t be a pre requisite for testing query runners alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so i kept the test with PlanTester and made it impossible to provide anything other than PlanTester or QueryRunner
0bc8341
to
07946b0
Compare
8c3dc8d
to
d480a6d
Compare
AC, PTAL |
String stackTraces = Arrays.stream(threads, 0, count) | ||
.filter(thread -> thread != Thread.currentThread()) | ||
// Common ForkJoinPool threads are statically managed, not considered a leak | ||
.filter(thread -> !thread.getName().startsWith("ForkJoinPool.commonPool-worker-")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would fork join pool thread show up in the list of threads for the thread group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. Maybe the common FJP has a thread factory that doesn't set TG explicitly. i suspect it might be that new FJP threads inherit ThreadGroup from the thread interacting with the FJP.
// Common ForkJoinPool threads are statically managed, not considered a leak | ||
.filter(thread -> !thread.getName().startsWith("ForkJoinPool.commonPool-worker-")) | ||
// OkHttp TaskRunner is statically managed (okhttp3.internal.concurrent.TaskRunner), not considered a leak | ||
.filter(thread -> !thread.getName().equals("OkHttp TaskRunner")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "statically managed" mean in this context? If the okhttp thread pools are not being closed, then how is that not a leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pool living on a static field in the library, started lazily by the library. It would be a problem if we wanted to unload the classloader, but we already know classloader unloading won't really work (due to various libraries employing this pattern), and that's why duplicating plugin classloader was removed.
test failures are not related |
d480a6d
to
16fd619
Compare
AC. rebased to re-run, previous build failed due to flaky tests |
Stress test green % "Upload test results" failing in every job. |
Stress test green % workflow verification |
another stress test round
|
running TestIcebergQueryRunner locally 1000 times i was able to get two failures, different than on the IC --
Here too, like in many other places, we do Fixing such leaks would be a bigger effort. |
94aac1a
to
b9b68cd
Compare
b9b68cd
to
cea89b3
Compare
cea89b3
to
b813d23
Compare
b813d23
to
2477e6b
Compare
Continued at #21913 |
follows #21466