-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-in-parsl-CI testing is very hangy #3484
Comments
I've dug into this a bit more on my laptop (where I can use the flux testing container to see this bug happening). With this Sending a SIGSEGV into that process causes the enabled All stack traces basically look like this (with the garbage collection happening at an arbitrary point in the test and FluxExecutor code - so
ZMQ My experience both with I'm slowly working through the rest of the codebase trying to put in proper shutdowns for all sorts of things (see PR #3397) CC @jameshcorbett who implemented this use of zmq |
This is to avoid a race-prone garbage collection driven shutdown of ZMQ - see issue #3484 for more details.
Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information. This PR attempts to remove some dangerous behaviour there: i) creation of ZMQ context and socket is moved into the thread which makes use of them - previous the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads. ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread) On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests with seeing any hangs.
Sorry about this. Working on the executor was one of my first uses of ZMQ, I should have been more careful about resource cleanup. |
@jameshcorbett can you sanity check #3517? |
Yup! Done. It looks sensible to me. |
Prior to this PR, there were frequent hangs in CI at cleanup of the ZMQ objects used by the FluxExecutor. See issue #3484 for some more information. This PR attempts to remove some dangerous behaviour there: i) creation of ZMQ context and socket is moved into the thread which makes use of them - before this PR, the socket was created on the main thread and passed into the submission thread which uses it. This removes some thread safety issues where a socket cannot be safely moved between threads. ii) ZMQ context and socket are more explicitly closed (using with-blocks) rather than leaving that to the garbage collector. In the hung tests, the ZMQ context was being garbage collected in the main thread, which is documented as being unsafe when sockets are open belonging to another thread (the submission thread) On my laptop I could see a hang around 50% of test runs before this PR. After this PR, I have run about 100 iterations of the flux tests without seeing any hangs.
Describe the bug
PR #3159 introduces per-PR flux testing. This hangs often - it's not immediately clear why.
This doesn't stop PRs being merged, because that flux test is not a mandatory test (because of lack of confidence in it being able to pass often) - but it means that we will be ignoring flux test failures, which makes the test pointless.
See notes on PR #3259 for some earlier investigations
To Reproduce
look at some recent CI builds
The text was updated successfully, but these errors were encountered: