-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
htex command channel use is not thread safe #3376
Comments
Another observation I made, that I think is relevant to this issue:
|
Further to that last comment, I'm unclear in which situations that re-establish code will work. I think it makes the following assumptions: |
This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Tagging issue #3376 (command client is not thread safe) because I feel like reworking this timeout behaviour and reworking that thread safety might be a single piece of deeper work.
) Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing. The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.
Describe the bug
The interchange command channel is presented on the submit side as an object
executor.command_client
that is invoked from any thread. However, it uses a ZMQ channel which is not thread safe. When two threads hit this too close together in time, this causes weird behavior in ZMQ: at least sometimes a segfault style crash, and I suspect possibly also some hangs.The command channel accesses should be made less thread safe. I don't have a favoured solution. Some ideas that I am not super fond of but I guess would be ok:
i) make a new ZMQ connection (in the opposite TCP-direction to the current command channel) for every thread that makes a command client invocation.
ii) use an internal cross-thread RPC mechanism and gateway those RPCs in the thread that owns the ZMQ connection.
In
desc
branch parsl, I've put a lot of locks around ZMQ command channel accesses, but I don't believe this is sufficient to guarantee thread-safety given ZMQ's discussion of eg. memory barriersTo Reproduce
it's a race condition
Expected behavior
this thread unsafety should not happen
The text was updated successfully, but these errors were encountered: