flux-in-parsl-CI testing is very hangy #3484

benclifford · 2024-06-12T10:21:20Z

Describe the bug

PR #3159 introduces per-PR flux testing. This hangs often - it's not immediately clear why.

This doesn't stop PRs being merged, because that flux test is not a mandatory test (because of lack of confidence in it being able to pass often) - but it means that we will be ignoring flux test failures, which makes the test pointless.

See notes on PR #3259 for some earlier investigations

To Reproduce
look at some recent CI builds

benclifford · 2024-07-10T11:29:14Z

I've dug into this a bit more on my laptop (where I can use the flux testing container to see this bug happening).

With this PYTHONFAULTHANDLER=1 python3 -m pytest parsl/tests/test_flux.py --config local --random-order I often get hangs.

Sending a SIGSEGV into that process causes the enabled faulthandler to dump a stack trace, and that basically shows that ZMQ is hanging in cleanup during a garbage collection.

All stack traces basically look like this (with the garbage collection happening at an arbitrary point in the test and FluxExecutor code - so __del__ appears in the stack trace invoked from an arbitrary piece of code because thats where the garbage collection happened to execute.

Current thread 0x000076c1e68f01c0 (most recent call first):
  Garbage-collecting
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 266 in term
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 324 in destroy
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 142 in __del__
  File "/usr/lib/python3.10/threading.py", line 247 in __init__
  File "/usr/lib/python3.10/threading.py", line 546 in __init__
  File "/parsl/parsl/executors/flux/executor.py", line 207 in __init__
  File "/parsl/parsl/tests/test_flux.py", line 37 in test_multiply

ZMQ term will (as documented) wait until all sockets are closed, but at least I don't see that there is anything that will deterministically close the ZMQ socket used in FluxExecutor - it might happen to be garbage collected and closed that way, I think, sometimes, but I haven't dug into which socket is being waited for, or into the garbage collection sequence.

My experience both with __del__ (used in the Python ZMQ implementation) and with PyZMQ in general elsewhere in Parsl (in the High Throughput Executor and monitoring) is that letting this stuff happen at garbage collection time works out badly and that it would be better to close things explicitly because of random hangs like this.

I'm slowly working through the rest of the codebase trying to put in proper shutdowns for all sorts of things (see PR #3397)

CC @jameshcorbett who implemented this use of zmq

This is to avoid a race-prone garbage collection driven shutdown of ZMQ - see issue #3484 for more details.

Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information. This PR attempts to remove some dangerous behaviour there: i) creation of ZMQ context and socket is moved into the thread which makes use of them - previous the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads. ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread) On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests with seeing any hangs.

jameshcorbett · 2024-07-10T16:35:13Z

Sorry about this. Working on the executor was one of my first uses of ZMQ, I should have been more careful about resource cleanup.

benclifford · 2024-07-10T16:37:20Z

@jameshcorbett can you sanity check #3517?

jameshcorbett · 2024-07-10T16:54:19Z

Yup! Done. It looks sensible to me.

Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information. This PR attempts to remove some dangerous behaviour there: i) creation of ZMQ context and socket is moved into the thread which makes use of them - before this PR, the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads. ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread) On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests without seeing any hangs.

benclifford added the bug label Jun 12, 2024

benclifford mentioned this issue Jul 3, 2024

Close Channels at DFK cleanup #3503

Merged

benclifford added safe-exit flux FluxExecutor labels Jul 10, 2024

benclifford added a commit that referenced this issue Jul 10, 2024

Explicitly close FluxExecutor ZMQ socket after use

2b0c63a

This is to avoid a race-prone garbage collection driven shutdown of ZMQ - see issue #3484 for more details.

benclifford mentioned this issue Jul 10, 2024

Move FluxExecutor ZMQ into thread and explicitly clean it up #3517

Merged

jameshcorbett mentioned this issue Jul 10, 2024

flux: cleanup zmq context and socket #3518

Merged

benclifford closed this as completed in #3517 Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-in-parsl-CI testing is very hangy #3484

flux-in-parsl-CI testing is very hangy #3484

benclifford commented Jun 12, 2024 •

edited

Loading

benclifford commented Jul 10, 2024

jameshcorbett commented Jul 10, 2024

benclifford commented Jul 10, 2024

jameshcorbett commented Jul 10, 2024

flux-in-parsl-CI testing is very hangy #3484

flux-in-parsl-CI testing is very hangy #3484

Comments

benclifford commented Jun 12, 2024 • edited Loading

benclifford commented Jul 10, 2024

jameshcorbett commented Jul 10, 2024

benclifford commented Jul 10, 2024

jameshcorbett commented Jul 10, 2024

benclifford commented Jun 12, 2024 •

edited

Loading