Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-in-parsl-CI testing is very hangy #3484

Closed
benclifford opened this issue Jun 12, 2024 · 4 comments · Fixed by #3517
Closed

flux-in-parsl-CI testing is very hangy #3484

benclifford opened this issue Jun 12, 2024 · 4 comments · Fixed by #3517
Labels

Comments

@benclifford
Copy link
Collaborator

benclifford commented Jun 12, 2024

Describe the bug

PR #3159 introduces per-PR flux testing. This hangs often - it's not immediately clear why.

This doesn't stop PRs being merged, because that flux test is not a mandatory test (because of lack of confidence in it being able to pass often) - but it means that we will be ignoring flux test failures, which makes the test pointless.

See notes on PR #3259 for some earlier investigations

To Reproduce
look at some recent CI builds

@benclifford
Copy link
Collaborator Author

I've dug into this a bit more on my laptop (where I can use the flux testing container to see this bug happening).

With this PYTHONFAULTHANDLER=1 python3 -m pytest parsl/tests/test_flux.py --config local --random-order I often get hangs.

Sending a SIGSEGV into that process causes the enabled faulthandler to dump a stack trace, and that basically shows that ZMQ is hanging in cleanup during a garbage collection.

All stack traces basically look like this (with the garbage collection happening at an arbitrary point in the test and FluxExecutor code - so __del__ appears in the stack trace invoked from an arbitrary piece of code because thats where the garbage collection happened to execute.

Current thread 0x000076c1e68f01c0 (most recent call first):
  Garbage-collecting
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 266 in term
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 324 in destroy
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/context.py", line 142 in __del__
  File "/usr/lib/python3.10/threading.py", line 247 in __init__
  File "/usr/lib/python3.10/threading.py", line 546 in __init__
  File "/parsl/parsl/executors/flux/executor.py", line 207 in __init__
  File "/parsl/parsl/tests/test_flux.py", line 37 in test_multiply

ZMQ term will (as documented) wait until all sockets are closed, but at least I don't see that there is anything that will deterministically close the ZMQ socket used in FluxExecutor - it might happen to be garbage collected and closed that way, I think, sometimes, but I haven't dug into which socket is being waited for, or into the garbage collection sequence.

My experience both with __del__ (used in the Python ZMQ implementation) and with PyZMQ in general elsewhere in Parsl (in the High Throughput Executor and monitoring) is that letting this stuff happen at garbage collection time works out badly and that it would be better to close things explicitly because of random hangs like this.

I'm slowly working through the rest of the codebase trying to put in proper shutdowns for all sorts of things (see PR #3397)

CC @jameshcorbett who implemented this use of zmq

benclifford added a commit that referenced this issue Jul 10, 2024
This is to avoid a race-prone garbage collection driven shutdown
of ZMQ - see issue #3484 for more details.
benclifford added a commit that referenced this issue Jul 10, 2024
Prior to this PR, there were frequent hangs in CI at cleanup of the
ZMQ objects used by the FluxExecutor. See issue #3484 for some more
information.

This PR attempts to remove some dangerous behaviour there:

i) creation of ZMQ context and socket is moved into the thread which
makes use of them - previous the socket was created on the main thread
and passed into the submission thread which uses it. This removes some
thread safety issues where a socket cannot be safely moved between
threads.

ii) ZMQ context and socket are more explicitly closed (using with-blocks)
rather than leaving that to the garbage collector. In the hung tests,
the ZMQ context was being garbage collected in the main thread, which is
documented as being unsafe when sockets are open belonging to another
thread (the submission thread)

On my laptop I could see a hang around 50% of test runs before this PR.
After this PR, I have run about 100 iterations of the flux tests with seeing
any hangs.
@jameshcorbett
Copy link
Contributor

Sorry about this. Working on the executor was one of my first uses of ZMQ, I should have been more careful about resource cleanup.

@benclifford
Copy link
Collaborator Author

@jameshcorbett can you sanity check #3517?

@jameshcorbett
Copy link
Contributor

Yup! Done. It looks sensible to me.

benclifford added a commit that referenced this issue Jul 23, 2024
Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information.

This PR attempts to remove some dangerous behaviour there:

i) creation of ZMQ context and socket is moved into the thread which makes use of them - before this PR, the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads.

ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread)

On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests without seeing any hangs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants