Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test no thread leaks in engine and connectors #21470

Closed
wants to merge 1 commit into from

Conversation

findepi
Copy link
Member

@findepi findepi commented Apr 9, 2024

follows #21466

@cla-bot cla-bot bot added the cla-signed label Apr 9, 2024
@github-actions github-actions bot added tests:hive hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector bigquery BigQuery connector mongodb MongoDB connector labels Apr 9, 2024
@findepi findepi force-pushed the findepi/thread-leaks-13 branch 3 times, most recently from 76829cf to 8469dcd Compare April 9, 2024 15:31
@findepi findepi changed the title Test for thread leaks Fix thread leaks in Hive and Delta connectors Apr 9, 2024
@findepi findepi removed bigquery BigQuery connector mongodb MongoDB connector labels Apr 9, 2024
@findepi findepi force-pushed the findepi/thread-leaks-13 branch 2 times, most recently from 6ff24cd to 318600b Compare April 9, 2024 16:16
@findepi findepi changed the title Fix thread leaks in Hive and Delta connectors Fix thread leaks in Hive, Iceberg, Delta and Hudi connectors Apr 9, 2024
@findepi findepi force-pushed the findepi/thread-leaks-13 branch 2 times, most recently from 3a715e7 to c81c83a Compare April 9, 2024 18:44
@@ -175,4 +178,16 @@ public boolean translateHiveViews(HiveConfig hiveConfig)
{
return hiveConfig.isTranslateHiveViews();
}

public record ShutdownExecutors(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have wrapper around Executor, ScheduledExecutor, ExecutorService, ScheduledExecutorService ...
which delegates all the methods and puts @PreDestroy on shutdown?

Would that be nicer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like #20960?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's probably why i couldn't find the Cleanup class (got removed 0d4f855#r1514739127)

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % suggestion on different way to close executors

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc: @kokosing you recently merged something around this area, might be useful to reuse.

@findepi findepi force-pushed the findepi/thread-leaks-13 branch 2 times, most recently from 7411aca to 0bc8341 Compare April 9, 2024 21:43
martint
martint previously requested changes Apr 9, 2024
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the new "assertNoThreadLeaked" infrastructure.

try (T resource = resourceCreator.get()) {
exerciseResource.accept(resource);
}
Thread[] threads = new Thread[clamp(threadGroup.activeCount(), 10, 1000)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10? 1000?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle and will result in false positives, false negatives, and potentially flaky tests. How does it account for all the JVM platform threads or framework threads that are not under Trino's control?

Also, relying on an unspecified naming convention ("ForkJoinPool.commonPool-worker-xxx") is not a good idea, as there's no guarantee it won't change arbitrarily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this to get the threads:

        List<Thread> list = Thread.getAllStackTraces().keySet().stream()
                .filter(thread -> threadGroup.parentOf(thread.getThreadGroup()))
                .toList();

It is slower because the JVM gets the stack traces also, but it is more accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10? 1000?

10 -- at least no-zero, so that the next call is authoritative on live threads.
1000 -- arbitrarily chosen cap so that we don't print too many threads, if too many threads stay alive

This is brittle and will result in false positives, false negatives, and potentially flaky tests. How does it account for all the JVM platform threads or framework threads that are not under Trino's control?

This may be brittle and we may need to improve or remove the test.

false positives (successes) are acceptable from this test perspective.
false negatives (failures) are not acceptable and we may need to improve or remove the test.
potentially flaky tests -- thanks for pointing this out. You know how much i care about non-flaky tests and now i don't feel alone
account for JVM platform threads -- by inspecting only one ThraedGroup, created privately within the test.
framework threads -- by performing a warmup first (might not be sufficient if framework expires the threads, then we may eg need to add excludes for known non-leaks -- i had it initially)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this to get the threads:

thanks @dain for the suggestion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generic, because i wanted to test cover both query runner and PlanTester.
If i introduce the public methods that expect these types, would it help?
If not, what would you do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. However, we should look into getting rid of PlanTester. There's nothing it's doing that shouldn't be doable with StandaloneQueryRunner. Maybe the only thing is being able to specify which optimizers are instantiated, but I would argue that that's an improper way of testing optimizations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to remove PlanTester within this PR? Not sure I know exactly what this entails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would be a separate change. But that wouldn’t be a pre requisite for testing query runners alone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so i kept the test with PlanTester and made it impossible to provide anything other than PlanTester or QueryRunner

@findepi findepi changed the title Fix thread leaks in Hive, Iceberg, Delta and Hudi connectors Fix thread leaks in Trino Apr 10, 2024
@findepi findepi force-pushed the findepi/thread-leaks-13 branch 3 times, most recently from 8c3dc8d to d480a6d Compare April 10, 2024 21:58
@findepi findepi changed the title Fix thread leaks in Trino Test no thread leaks in engine and connectors Apr 10, 2024
@findepi
Copy link
Member Author

findepi commented Apr 10, 2024

AC, PTAL

@findepi findepi added the no-release-notes This pull request does not require release notes entry label Apr 10, 2024
String stackTraces = Arrays.stream(threads, 0, count)
.filter(thread -> thread != Thread.currentThread())
// Common ForkJoinPool threads are statically managed, not considered a leak
.filter(thread -> !thread.getName().startsWith("ForkJoinPool.commonPool-worker-"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would fork join pool thread show up in the list of threads for the thread group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. Maybe the common FJP has a thread factory that doesn't set TG explicitly. i suspect it might be that new FJP threads inherit ThreadGroup from the thread interacting with the FJP.

// Common ForkJoinPool threads are statically managed, not considered a leak
.filter(thread -> !thread.getName().startsWith("ForkJoinPool.commonPool-worker-"))
// OkHttp TaskRunner is statically managed (okhttp3.internal.concurrent.TaskRunner), not considered a leak
.filter(thread -> !thread.getName().equals("OkHttp TaskRunner"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "statically managed" mean in this context? If the okhttp thread pools are not being closed, then how is that not a leak?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pool living on a static field in the library, started lazily by the library. It would be a problem if we wanted to unload the classloader, but we already know classloader unloading won't really work (due to various libraries employing this pattern), and that's why duplicating plugin classloader was removed.

@findepi
Copy link
Member Author

findepi commented Apr 11, 2024

test failures are not related

@findepi findepi force-pushed the findepi/thread-leaks-13 branch from d480a6d to 16fd619 Compare April 13, 2024 05:52
@findepi
Copy link
Member Author

findepi commented Apr 13, 2024

AC. rebased to re-run, previous build failed due to flaky tests

@findepi
Copy link
Member Author

findepi commented Apr 16, 2024

Stress test green % "Upload test results" failing in every job.

@findepi
Copy link
Member Author

findepi commented Apr 16, 2024

Stress test green % workflow verification

@findepi
Copy link
Member Author

findepi commented Apr 16, 2024

another stress test round

@findepi
Copy link
Member Author

findepi commented Apr 16, 2024

running TestIcebergQueryRunner locally 1000 times i was able to get two failures, different than on the IC -- "task-driver-timeout-1" leaked & "worker-dynamic-catalog-manager-2" leaked. The first is a ScheduledExecutorService closed here:

Here too, like in many other places, we do shutdownNow() without awaiting termination.
Second one is a similar case.

Fixing such leaks would be a bigger effort.

@findepi findepi force-pushed the findepi/thread-leaks-13 branch from 94aac1a to b9b68cd Compare April 16, 2024 19:38
@findepi findepi marked this pull request as draft April 16, 2024 19:39
@findepi findepi force-pushed the findepi/thread-leaks-13 branch from b9b68cd to cea89b3 Compare April 23, 2024 09:29
@findepi findepi force-pushed the findepi/thread-leaks-13 branch from cea89b3 to b813d23 Compare May 9, 2024 08:57
@findepi findepi force-pushed the findepi/thread-leaks-13 branch from b813d23 to 2477e6b Compare May 9, 2024 09:04
@findepi findepi mentioned this pull request May 10, 2024
@findepi findepi closed this May 10, 2024
@findepi findepi deleted the findepi/thread-leaks-13 branch May 10, 2024 09:07
@findepi
Copy link
Member Author

findepi commented May 10, 2024

Continued at #21913

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector no-release-notes This pull request does not require release notes entry
Development

Successfully merging this pull request may close these issues.

5 participants