RADICAL executor MPI test leaves threads running at shutdown #3708

benclifford · 2024-11-29T12:50:26Z

Describe the bug

In PR #3397 I've been pushing slowly but aggressively on making sure all parts of Parsl close down OS-ish resources they have allocated, rather than relying on some later OS cleanup. Right now primarily threads, file descriptors (but also ideally processes). Work from that has slowly trickled into master over time.

Right now I'm getting these threads left behind at the end of the RADICAL-Pilot MPI test. The assert failure is the aggressive thread checking I have added, the threads are I think from inside the radical.pilot codebase.

Is there a way to shut these down at executor shutdown?

$ pytest parsl/tests/test_radical/test_mpi_funcs.py --config local

>           assert threading.active_count() == 1, "test left threads running: " + repr(threading.enumerate())
E           AssertionError: test left threads running: [<_MainThread(MainThread, started 139787084908352)>, <Thread(Thread-1 (_work), started daemon 139786984863424)>, <Thread(Thread-2 (_run_proxy), started daemon 139786740291264)>, <Thread(Thread-3 (_monitor), started daemon 139786731898560)>, <Thread(Thread-4 (_work), started daemon 139786723505856)>, <Thread(Thread-6 (_watch), started daemon 139786312459968)>, <Thread(Thread-7 (_listener), started daemon 139786304067264)>, <Thread(Thread-8 (_listener), started daemon 139785934984896)>, <Thread(Thread-9 (_listener), started daemon 139785926592192)>, <Thread(Thread-10 (_listener), started daemon 139785918199488)>, <Thread(Thread-11 (_listener), started daemon 139785909806784)>, <Thread(Thread-12 (_listener), started daemon 139785901414080)>, <Thread(Thread-14 (_listener), started daemon 139785540724416)>, <Thread(Thread-15 (_listener), started daemon 139785515546304)>, <Thread(Thread-16 (_listener), started daemon 139785062569664)>, <Thread(Thread-18 (_listener), started daemon 139785045784256)>, <Thread(Thread-19 (_listener), started daemon 139784676701888)>, <Thread(Thread-21 (_listener), started daemon 139784643131072)>, <Thread(Thread-22 (_listener), started daemon 139784634738368)>, <Thread(Thread-23 (_bulk_collector), started daemon 139784626345664)>]
E           assert 20 == 1
E            +  where 20 = <function active_count at 0x7f22b6d37ec0>()
E            +    where <function active_count at 0x7f22b6d37ec0> = threading.active_count

cc @AymenFJA @andre-merzky

To Reproduce
Checkout #PR 3397, run the above command

Expected behavior
Assert passes, because threads are cleaned up.

Environment
commit 254d8ba of PR #3397

The text was updated successfully, but these errors were encountered:

andre-merzky · 2024-12-03T18:20:47Z

Hi @benclifford - that's really good timing you are showing here, as we work on those code paths anyway right now :-D We track the work related to this ticket now in radical-cybertools/radical.pilot#3272.

One thing: we are a bit optimistic about thread termination, in the sense that we set termination flags but don't join the threads. So there is usual a delay of up to a second until the threads are actually gone (the slowest one needs 0.5 seconds max to pick up that signal). Is that something you would worry about, or is it ok for you to add a delay to the test to capture that behavior?

benclifford · 2024-12-03T18:22:47Z

@andre-merzky its preferable for my work that everything is gone by the time the executor says it is shut down, rather than adding in delays in the test.

andre-merzky · 2024-12-03T18:38:00Z

Yeah, I thought so. All right, this might need a bit more time then, but we'll get it done.

In addition to the work happening [here](radical-cybertools/radical.pilot#3269) to address this issue #3708. This PR ensures the cleanup of threads generated by RPEX. - Bug fix (partially #3708) --------- Co-authored-by: Ben Clifford <[email protected]>

benclifford · 2024-12-14T19:20:33Z

#3718 removes one of the 20 leftover threads - the bulk collector one - leaving 18 leftovers (+ 1 expected main thread)

andre-merzky · 2024-12-16T01:48:38Z

@benclifford : the remaining threads are now cleanly terminated by radical-cybertools/radical.pilot#3269 - that goes into the next RP release, before the break.

benclifford · 2024-12-17T11:56:06Z

I've just tried radical pilot 1.90 (as in PR #3725) and I see all the threads shut down now. I'll close this issue when #3725 is merged.

This brings in fixes for a few issues that are fixed on the radical side of things: #3722 - a race condition on task completion #3708 - cleaner shutdown handling as part of #3397 #3646 - Python 3.13 support # Changed Behaviour whatever has changed in radical-pilot # Fixes #3722 ## Type of change - Bug fix

benclifford added the bug label Nov 29, 2024

AymenFJA mentioned this issue Dec 2, 2024

RP thread cleanup radical-cybertools/radical.pilot#3272

Open

AymenFJA mentioned this issue Dec 9, 2024

Make radical bulk collector exit more explicitly #3718

Merged

benclifford mentioned this issue Dec 17, 2024

Upgrade RADICAL-Pilot version pin #3725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RADICAL executor MPI test leaves threads running at shutdown #3708

RADICAL executor MPI test leaves threads running at shutdown #3708

benclifford commented Nov 29, 2024

andre-merzky commented Dec 3, 2024

benclifford commented Dec 3, 2024

andre-merzky commented Dec 3, 2024

benclifford commented Dec 14, 2024

andre-merzky commented Dec 16, 2024

benclifford commented Dec 17, 2024

RADICAL executor MPI test leaves threads running at shutdown #3708

RADICAL executor MPI test leaves threads running at shutdown #3708

Comments

benclifford commented Nov 29, 2024

andre-merzky commented Dec 3, 2024

benclifford commented Dec 3, 2024

andre-merzky commented Dec 3, 2024

benclifford commented Dec 14, 2024

andre-merzky commented Dec 16, 2024

benclifford commented Dec 17, 2024