htex command channel use is not thread safe #3376

benclifford · 2024-04-20T09:44:49Z

Describe the bug

The interchange command channel is presented on the submit side as an object executor.command_client that is invoked from any thread. However, it uses a ZMQ channel which is not thread safe. When two threads hit this too close together in time, this causes weird behavior in ZMQ: at least sometimes a segfault style crash, and I suspect possibly also some hangs.

The command channel accesses should be made less thread safe. I don't have a favoured solution. Some ideas that I am not super fond of but I guess would be ok:
i) make a new ZMQ connection (in the opposite TCP-direction to the current command channel) for every thread that makes a command client invocation.
ii) use an internal cross-thread RPC mechanism and gateway those RPCs in the thread that owns the ZMQ connection.

In desc branch parsl, I've put a lot of locks around ZMQ command channel accesses, but I don't believe this is sufficient to guarantee thread-safety given ZMQ's discussion of eg. memory barriers

To Reproduce
it's a race condition

Expected behavior
this thread unsafety should not happen

The text was updated successfully, but these errors were encountered:

benclifford · 2024-05-15T10:21:55Z

Another observation I made, that I think is relevant to this issue:

The interchange command client contains this:

                except zmq.ZMQError:
                    logger.exception("Potential ZMQ REQ-REP deadlock caught")
                    logger.info("Trying to reestablish context")
                    self.zmq_context.recreate()
                    self.create_socket_and_bind()

The only reason I can figure out for this is that the command client is used in a non-thread-safe manner and that multiple commands can be invoked from multiple threads without receiving responses - which is non-thread-safe in two ways: 1) you can't send two REQs on the same ZMQ socket in a row, without waiting for the first REP; and 2) you can't use the same ZMQ socket across multiple threads even if you sequence your operations properly.
#3376 talks about non-threadsafe use of the command client, and I guess I hope that if #3376 is fixed, the above can go away.

benclifford · 2024-05-15T10:43:37Z

Further to that last comment, I'm unclear in which situations that re-establish code will work. I think it makes the following assumptions:
i) the previous TCP-level listening socket has been closed due to this ZMQError - because otherwise, the rebind in this except block will not be able to rebind
ii) the interchange-side command ZMQ socket is in a "waiting for REQ" state, so that it will rebind in a way that is compatible with the above protocol flow: what happens, for example, if the rebind happens with a reply waiting to be sent back to the command client? (does ZMQ reconnect and then attempt to send the REP into the reconnected socket that is not expecting a REP?)

This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Tagging issue #3376 (command client is not thread safe) because I feel like reworking this timeout behaviour and reworking that thread safety might be a single piece of deeper work.

) Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing. The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.

benclifford added bug executor:htex labels Apr 20, 2024

benclifford mentioned this issue May 17, 2024

Add a timeout option to htex command client #3441

Merged

This was referenced May 27, 2024

Use CommandClient for interchange port reporting instead of Queue #3461

Merged

zmq channels should not close their contexts #3457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htex command channel use is not thread safe #3376

htex command channel use is not thread safe #3376

benclifford commented Apr 20, 2024

benclifford commented May 15, 2024 •

edited

Loading

benclifford commented May 15, 2024

htex command channel use is not thread safe #3376

htex command channel use is not thread safe #3376

Comments

benclifford commented Apr 20, 2024

benclifford commented May 15, 2024 • edited Loading

benclifford commented May 15, 2024

benclifford commented May 15, 2024 •

edited

Loading