-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interchange: launch as a fork/exec not Python multiprocessing #3373
Comments
regarding this point, I think I have encountered a situation (although it's quite awkward to trace) where some changes I would like to contribute to Parsl make use of regular Parsl code that logs to a parsl.* logging channel, and then hangs - I think (although I am not completely convinced) that this is because of a fork-but-no-exec lock held over the fork for some part of the regular parsl logging setup. Implementing this issue would remove that possibility. |
in the immediately preceeding comment, the hang looks like this:
always on the lock belonging to This failure is extremely sensitive to peturbation - adding a log statement right before launching the interchange makes this hang not happen, for example. I could go further and modify the C source code to track which process actually took that lock to confirm that this lock really is coming from the parent locked; or I could an assert near startup to assert it is not locked. I'm not going to dig into that further, because this is good enough evidence for me based on what I've seen debugging this sort of thing previously that this is a logging lock held across a fork. So, I'll proceed to work on this issue #3373 because I need to be able to work on the interchange for further DESC monitoring work. |
This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Tagging issue #3376 (command client is not thread safe) because I feel like reworking this timeout behaviour and reworking that thread safety might be a single piece of deeper work.
This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Co-authored-by: rjmello <[email protected]>
Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing.
) Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing. The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.
…l, not multiprocessing any downstream packaging will need to be aware of the presence of interchange.py as a new command-line invocable script and this might break some build instructions which do not configure installed scripts onto the path. this PR replaces keyword arguments with argparse command line parameters. it does not attempt to make those command line arguments differently-optional than the constructor of the Interchange class (for example, worker_ports and worker_port_range are both mandatory, because they are both specified before this PR) i'm somewhat uncomfortable with this seeming like an ad-hoc serialise/deserialise protocol for what was previously effecting a dict of typed python objects... but it's what process worker pool does. see issue #3373 for interchange specific issue see issue #2343 for parsl general fork vs threads issue see possibly issue #3378?
Implemented by #3463 |
Is your feature request related to a problem? Please describe.
The interchange is launched using Python
multiprocessing
but mostly wants to be its own freestanding process: for example, it awkwardly works around inheriting the main process logger by avoiding the "parsl" logging topic; and it communicates with the main process mostly via ZMQ (with only a small bit of multiprocessing-based communication near the start).Multiprocessing-fork is a Bad Thing (see issue #2343) so launching the interchange this way should go away.
Launching as a fresh process would also allow more normal logging (for example, logging everything to the interchange log file rather than carefully avoiding what was inherited).
The only use of multiprocessing queues, which is to communicate the ZMQ port numbers to be used by workers, could be transmitted in a different way: for example add a
WORKER_PORTS
command to the command channel.Describe the solution you'd like
Use normal fork/exec so this behaves like a freestanding process
The text was updated successfully, but these errors were encountered: