Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htex command channel use is not thread safe #3376

Open
benclifford opened this issue Apr 20, 2024 · 2 comments
Open

htex command channel use is not thread safe #3376

benclifford opened this issue Apr 20, 2024 · 2 comments

Comments

@benclifford
Copy link
Collaborator

Describe the bug

The interchange command channel is presented on the submit side as an object executor.command_client that is invoked from any thread. However, it uses a ZMQ channel which is not thread safe. When two threads hit this too close together in time, this causes weird behavior in ZMQ: at least sometimes a segfault style crash, and I suspect possibly also some hangs.

The command channel accesses should be made less thread safe. I don't have a favoured solution. Some ideas that I am not super fond of but I guess would be ok:
i) make a new ZMQ connection (in the opposite TCP-direction to the current command channel) for every thread that makes a command client invocation.
ii) use an internal cross-thread RPC mechanism and gateway those RPCs in the thread that owns the ZMQ connection.

In desc branch parsl, I've put a lot of locks around ZMQ command channel accesses, but I don't believe this is sufficient to guarantee thread-safety given ZMQ's discussion of eg. memory barriers

To Reproduce
it's a race condition

Expected behavior
this thread unsafety should not happen

@benclifford
Copy link
Collaborator Author

benclifford commented May 15, 2024

Another observation I made, that I think is relevant to this issue:

The interchange command client contains this:

                except zmq.ZMQError:
                    logger.exception("Potential ZMQ REQ-REP deadlock caught")
                    logger.info("Trying to reestablish context")
                    self.zmq_context.recreate()
                    self.create_socket_and_bind()

The only reason I can figure out for this is that the command client is used in a non-thread-safe manner and that multiple commands can be invoked from multiple threads without receiving responses - which is non-thread-safe in two ways: 1) you can't send two REQs on the same ZMQ socket in a row, without waiting for the first REP; and 2) you can't use the same ZMQ socket across multiple threads even if you sequence your operations properly.
#3376 talks about non-threadsafe use of the command client, and I guess I hope that if #3376 is fixed, the above can go away.


@benclifford
Copy link
Collaborator Author

Further to that last comment, I'm unclear in which situations that re-establish code will work. I think it makes the following assumptions:
i) the previous TCP-level listening socket has been closed due to this ZMQError - because otherwise, the rebind in this except block will not be able to rebind
ii) the interchange-side command ZMQ socket is in a "waiting for REQ" state, so that it will rebind in a way that is compatible with the above protocol flow: what happens, for example, if the rebind happens with a reply waiting to be sent back to the command client? (does ZMQ reconnect and then attempt to send the REP into the reconnected socket that is not expecting a REP?)

benclifford added a commit that referenced this issue May 17, 2024
This is in preparation in an upcoming PR for replacing the current
multiprocessing.Queue based report of interchange ports (which has a 120s
timeout) with a command client based retrieval of that information
(so now the command client needs to implement a 120s timeout to closely
replicate that behaviour). That work is itself part of using fork/exec to
launch the interchange, rather than multiprocessing.fork (issue #3373)

When the command client timeouts after sending a command, then it sets itself
to permanently bad: this is because the state of the channel is now unknown.
eg. Should the next action be to receive a response from a previously timed out
command that was eventually executed? Should the channel be recreated assuming
a previously sent command was never sent?

Tagging issue #3376 (command client is not thread safe) because I feel like
reworking this timeout behaviour and reworking that thread safety might be a
single piece of deeper work.
benclifford added a commit that referenced this issue May 28, 2024
)

Before this PR, the interchange used a multiprocessing.Queue to send a single
message containing the ports it is listening on back to the submitting
process. This ties the interchange into being forked via multiprocessing,
even though the rest of the interchange is architected to be forked anyhow,
as part of earlier remote-interchange work.

After this PR, the CommandClient used for other submit-side to interchange
communication is used to retrieve this information, removing that reliance
on multiprocessing and reducing the number of different communication channels
used between the interchange and submit side by one.

See issue #3373 for more context on launching the interchange via fork/exec
rather than using multiprocessing.

The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant