Interchange: launch as a fork/exec not Python multiprocessing #3373

benclifford · 2024-04-20T09:29:29Z

Is your feature request related to a problem? Please describe.
The interchange is launched using Python multiprocessing but mostly wants to be its own freestanding process: for example, it awkwardly works around inheriting the main process logger by avoiding the "parsl" logging topic; and it communicates with the main process mostly via ZMQ (with only a small bit of multiprocessing-based communication near the start).

Multiprocessing-fork is a Bad Thing (see issue #2343) so launching the interchange this way should go away.

Launching as a fresh process would also allow more normal logging (for example, logging everything to the interchange log file rather than carefully avoiding what was inherited).

The only use of multiprocessing queues, which is to communicate the ZMQ port numbers to be used by workers, could be transmitted in a different way: for example add a WORKER_PORTS command to the command channel.

Describe the solution you'd like
Use normal fork/exec so this behaves like a freestanding process

The text was updated successfully, but these errors were encountered:

benclifford · 2024-05-04T15:57:32Z

Launching as a fresh process would also allow more normal logging (for example, logging everything to the interchange log file rather than carefully avoiding what was inherited).

regarding this point, I think I have encountered a situation (although it's quite awkward to trace) where some changes I would like to contribute to Parsl make use of regular Parsl code that logs to a parsl.* logging channel, and then hangs - I think (although I am not completely convinced) that this is because of a fork-but-no-exec lock held over the fork for some part of the regular parsl logging setup.

Implementing this issue would remove that possibility.

benclifford · 2024-05-06T14:27:27Z

in the immediately preceeding comment, the hang looks like this:

(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x6473d2908b30)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x6473d2908b30, expected=expected@entry=0, clockid=clockid@entry=0, 
    abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x00007b717460befb in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x6473d2908b30, expected=expected@entry=0, 
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007b7174616c3f in do_futex_wait (sem=sem@entry=0x6473d2908b30, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007b7174616cd0 in __new_sem_wait_slow64 (sem=sem@entry=0x6473d2908b30, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007b7174616d39 in __new_sem_wait (sem=sem@entry=0x6473d2908b30) at ./nptl/sem_wait.c:42
#6  0x00006473d0fed3b8 in PyThread_acquire_lock_timed (lock=0x6473d2908b30, microseconds=<optimized out>, intr_flag=intr_flag@entry=0)
    at Python/thread_pthread.h:486
#7  0x00006473d0fed63c in PyThread_acquire_lock (lock=<optimized out>, waitflag=waitflag@entry=1) at Python/thread_pthread.h:744
#8  0x00006473d103305a in _enter_buffered_busy (self=self@entry=0x7b716c364720) at ./Modules/_io/bufferedio.c:300
#9  0x00006473d1037c6b in _io_BufferedWriter_write_impl (buffer=0x7ffd754ba520, self=0x7b716c364720) at ./Modules/_io/bufferedio.c:2028
#10 _io_BufferedWriter_write (self=0x7b716c364720, arg=0x7b716c3144b0) at ./Modules/_io/clinic/bufferedio.c.h:955
#11 0x00006473d0e6ce31 in method_vectorcall_O (func=0x7b717423f240, args=0x7ffd754ba640, nargsf=<optimized out>, kwnames=0x0)
    at Objects/descrobject.c:482
#12 0x00006473d0e5fda5 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffd754ba640, callable=0x7b717423f240, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at ./Include/internal/pycore_call.h:92
#13 PyObject_VectorcallMethod (name=name@entry=0x6473d1350580 <_PyRuntime+61664>, args=args@entry=0x7ffd754ba640, nargsf=<optimized out>, 
    nargsf@entry=9223372036854775810, kwnames=kwnames@entry=0x0) at Objects/call.c:887
#14 0x00006473d10398b4 in PyObject_CallMethodOneArg (arg=0x7b716c3144b0, name=0x6473d1350580 <_PyRuntime+61664>, self=<optimized out>)
    at ./Include/cpython/abstract.h:103
#15 _textiowrapper_writeflush (self=self@entry=0x7b716eca5460) at ./Modules/_io/textio.c:1629
#16 0x00006473d103acf2 in _io_TextIOWrapper_flush_impl (self=0x7b716eca5460) at ./Modules/_io/textio.c:3095
#17 _io_TextIOWrapper_flush (self=0x7b716eca5460, _unused_ignored=<optimized out>) at ./Modules/_io/clinic/textio.c.h:966
#18 0x00006473d0df7e0f in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b717484a560, throwflag=<optimized out>)
    at Python/bytecodes.c:3150
#19 0x00006473d0e62731 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=4, args=0x7ffd754ba860, callable=0x7b71736191c0, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at ./Include/internal/pycore_call.h:92
#20 method_vectorcall (method=<optimized out>, args=0x7b716c1e96d8, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:91
#21 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b717484a130, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#22 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c3c8d40, nargsf=<optimized out>, args=0x7ffd754baa80, 
    callable=0x7b716f236340, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#23 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716f236340, 
    obj=obj@entry=0x7b717287eb10, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b7172895b80)
    at Objects/call.c:508
#24 0x00006473d0ee9ead in slot_tp_init (self=0x7b717287eb10, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b7172895b80)
    at Objects/typeobject.c:9013
#25 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b7172895b80)
    at Objects/typeobject.c:1673
#26 0x00006473d0e60b89 in PyObject_Call () at Objects/call.c:376
--Type <RET> for more, q to quit, c to continue without paging--c
#27 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849f90, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#28 0x00006473d0e60f34 in _PyObject_FastCallDictTstate (kwargs=0x0, nargsf=2, args=0x7ffd754bad70, callable=0x7b716ed62e80, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:133
#29 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716ed62e80, 
    obj=obj@entry=0x7b716c2267b0, args=args@entry=0x7b716c21ae00, kwargs=kwargs@entry=0x0) at Objects/call.c:508
#30 0x00006473d0ee9ead in slot_tp_init (self=0x7b716c2267b0, args=0x7b716c21ae00, kwds=0x0) at Objects/typeobject.c:9013
#31 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x7b716c21ae00, kwds=0x0) at Objects/typeobject.c:1673
#32 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x6473d2f4fc80, 
    args=args@entry=0x7b7174849c80, nargs=<optimized out>, keywords=0x0) at Objects/call.c:240
#33 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174849c80, 
    callable=0x6473d2f4fc80, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#34 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849c18, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#35 0x00006473d0e60f34 in _PyObject_FastCallDictTstate (kwargs=0x0, nargsf=2, args=0x7ffd754bb050, callable=0x7b716f19c540, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:133
#36 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b716f19c540, 
    obj=obj@entry=0x7b716c39a2d0, args=args@entry=0x7b716c1f83a0, kwargs=kwargs@entry=0x0) at Objects/call.c:508
#37 0x00006473d0ee9ead in slot_tp_init (self=0x7b716c39a2d0, args=0x7b716c1f83a0, kwds=0x0) at Objects/typeobject.c:9013
#38 0x00006473d0ed75d7 in type_call (type=<optimized out>, args=0x7b716c1f83a0, kwds=0x0) at Objects/typeobject.c:1673
#39 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x6473d2e1d240, 
    args=args@entry=0x7b71748497c0, nargs=<optimized out>, keywords=0x0) at Objects/call.c:240
#40 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748497c0, 
    callable=0x6473d2e1d240, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#41 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174849758, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#42 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c1f8580, nargsf=<optimized out>, args=0x7ffd754bb330, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#43 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628d60, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c23e100)
    at Objects/call.c:508
#44 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628d60, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c23e100)
    at Objects/typeobject.c:8769
#45 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173628d60, 
    args=args@entry=0x7b7174849210, nargs=<optimized out>, keywords=0x7b71736ffeb0) at Objects/call.c:240
#46 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174849210, 
    callable=0x7b7173628d60, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#47 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b71748491b0, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#48 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716c1f8df0, nargsf=<optimized out>, args=0x7ffd754bb5f0, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#49 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628f40, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c3a8800)
    at Objects/call.c:508
#50 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628f40, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c3a8800)
    at Objects/typeobject.c:8769
#51 0x00006473d0e60b89 in PyObject_Call () at Objects/call.c:376
#52 0x00006473d0df72e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848e48, throwflag=<optimized out>)
    at Python/bytecodes.c:3254
#53 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716ed57a80, nargsf=<optimized out>, args=0x7ffd754bb8c0, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#54 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b71736290d0, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716c1cbfc0)
    at Objects/call.c:508
#55 0x00006473d0ee476d in slot_tp_call (self=0x7b71736290d0, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716c1cbfc0)
    at Objects/typeobject.c:8769
#56 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b71736290d0, 
    args=args@entry=0x7b71748488f0, nargs=<optimized out>, keywords=0x7b71737e6c00) at Objects/call.c:240
#57 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748488f0, 
    callable=0x7b71736290d0, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#58 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848870, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#59 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b716effca30, nargsf=<optimized out>, args=0x7ffd754bbb80, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#60 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b71736291c0, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716f04c4c0)
    at Objects/call.c:508
#61 0x00006473d0ee476d in slot_tp_call (self=0x7b71736291c0, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716f04c4c0)
    at Objects/typeobject.c:8769
#62 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b71736291c0, 
    args=args@entry=0x7b7174848600, nargs=<optimized out>, keywords=0x7b7173a6b2e0) at Objects/call.c:240
#63 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b7174848600, 
    callable=0x7b71736291c0, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#64 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7b7174848598, throwflag=<optimized out>)
    at Python/bytecodes.c:2706
#65 0x00006473d0e60fdf in _PyObject_FastCallDictTstate (kwargs=0x7b7172c7fd30, nargsf=<optimized out>, args=0x7ffd754bbe40, 
    callable=0x7b7173f21940, tstate=0x6473d13b1828 <_PyRuntime+459656>) at Objects/call.c:144
#66 _PyObject_Call_Prepend (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173f21940, 
    obj=obj@entry=0x7b7173628400, args=args@entry=0x6473d1353e88 <_PyRuntime+76264>, kwargs=kwargs@entry=0x7b716f1b6040)
    at Objects/call.c:508
#67 0x00006473d0ee476d in slot_tp_call (self=0x7b7173628400, args=0x6473d1353e88 <_PyRuntime+76264>, kwds=0x7b716f1b6040)
    at Objects/typeobject.c:8769
#68 0x00006473d0e5e3d4 in _PyObject_MakeTpCall (tstate=0x6473d13b1828 <_PyRuntime+459656>, callable=callable@entry=0x7b7173628400, 
    args=args@entry=0x7b71748481d0, nargs=<optimized out>, keywords=0x7b7173a6ba60) at Objects/call.c:240
#69 0x00006473d0e5ecff in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7b71748481d0, 
    callable=0x7b7173628400, tstate=<optimized out>) at ./Include/internal/pycore_call.h:90
#70 0x00006473d0df7a07 in _PyEval_EvalFrameDefault (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, frame=0x7b7174848120, 
    frame@entry=0x7b7174848020, throwflag=throwflag@entry=0) at Python/bytecodes.c:2706
#71 0x00006473d0f7b617 in _PyEval_EvalFrame (throwflag=0, frame=0x7b7174848020, tstate=0x6473d13b1828 <_PyRuntime+459656>)
    at ./Include/internal/pycore_ceval.h:89
#72 _PyEval_Vector (args=0x0, argcount=0, kwnames=0x0, locals=0x7b717428dac0, func=0x7b7174055120, 
    tstate=0x6473d13b1828 <_PyRuntime+459656>) at Python/ceval.c:1683
#73 PyEval_EvalCode (co=co@entry=0x7b7174214630, globals=globals@entry=0x7b717428dac0, locals=locals@entry=0x7b717428dac0)
    at Python/ceval.c:578
#74 0x00006473d0fd7aa8 in run_eval_code_obj (tstate=tstate@entry=0x6473d13b1828 <_PyRuntime+459656>, co=co@entry=0x7b7174214630, 
    globals=globals@entry=0x7b717428dac0, locals=locals@entry=0x7b717428dac0) at Python/pythonrun.c:1722
#75 0x00006473d0fd7b97 in run_mod (mod=<optimized out>, filename=filename@entry=0x7b71740cc870, globals=globals@entry=0x7b717428dac0, 
    locals=locals@entry=0x7b717428dac0, flags=flags@entry=0x7ffd754bc278, arena=arena@entry=0x7b71741afe30) at Python/pythonrun.c:1743
#76 0x00006473d0fda74c in pyrun_file (flags=0x7ffd754bc278, closeit=1, locals=0x7b717428dac0, globals=0x7b717428dac0, start=257, 
    filename=0x7b71740cc870, fp=0x6473d14c5c50) at Python/pythonrun.c:1643
#77 _PyRun_SimpleFileObject (fp=fp@entry=0x6473d14c5c50, filename=filename@entry=0x7b71740cc870, closeit=closeit@entry=1, 
    flags=flags@entry=0x7ffd754bc278) at Python/pythonrun.c:433
#78 0x00006473d0fdad2c in _PyRun_AnyFileObject (fp=0x6473d14c5c50, filename=filename@entry=0x7b71740cc870, closeit=closeit@entry=1, 
    flags=flags@entry=0x7ffd754bc278) at Python/pythonrun.c:78
#79 0x00006473d10019e9 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7b71740cc870, program_name=0x7b71740cc8d0)
    at Modules/main.c:360
#80 pymain_run_file (config=0x6473d1354408 <_PyRuntime+77672>) at Modules/main.c:379
#81 pymain_run_python (exitcode=exitcode@entry=0x7ffd754bc3b0) at Modules/main.c:629
#82 0x00006473d100202a in Py_RunMain () at Modules/main.c:709
#83 pymain_main (args=0x7ffd754bc370) at Modules/main.c:739
#84 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:763
#85 0x00007b71745ad24a in __libc_start_call_main (main=main@entry=0x6473d0df6740 <main>, argc=argc@entry=5, argv=argv@entry=0x7ffd754bc4e8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#86 0x00007b71745ad305 in __libc_start_main_impl (main=0x6473d0df6740 <main>, argc=5, argv=0x7ffd754bc4e8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd754bc4d8) at ../csu/libc-start.c:360
#87 0x00006473d0e03011 in _start ()
(gdb)

always on the lock belonging to ./Modules/_io/bufferedio.c (in the Python source code).

This failure is extremely sensitive to peturbation - adding a log statement right before launching the interchange makes this hang not happen, for example.

I could go further and modify the C source code to track which process actually took that lock to confirm that this lock really is coming from the parent locked; or I could an assert near startup to assert it is not locked. I'm not going to dig into that further, because this is good enough evidence for me based on what I've seen debugging this sort of thing previously that this is a logging lock held across a fork.

So, I'll proceed to work on this issue #3373 because I need to be able to work on the interchange for further DESC monitoring work.

This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Tagging issue #3376 (command client is not thread safe) because I feel like reworking this timeout behaviour and reworking that thread safety might be a single piece of deeper work.

This is in preparation in an upcoming PR for replacing the current multiprocessing.Queue based report of interchange ports (which has a 120s timeout) with a command client based retrieval of that information (so now the command client needs to implement a 120s timeout to closely replicate that behaviour). That work is itself part of using fork/exec to launch the interchange, rather than multiprocessing.fork (issue #3373) When the command client timeouts after sending a command, then it sets itself to permanently bad: this is because the state of the channel is now unknown. eg. Should the next action be to receive a response from a previously timed out command that was eventually executed? Should the channel be recreated assuming a previously sent command was never sent? Co-authored-by: rjmello <[email protected]>

Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing.

) Before this PR, the interchange used a multiprocessing.Queue to send a single message containing the ports it is listening on back to the submitting process. This ties the interchange into being forked via multiprocessing, even though the rest of the interchange is architected to be forked anyhow, as part of earlier remote-interchange work. After this PR, the CommandClient used for other submit-side to interchange communication is used to retrieve this information, removing that reliance on multiprocessing and reducing the number of different communication channels used between the interchange and submit side by one. See issue #3373 for more context on launching the interchange via fork/exec rather than using multiprocessing. The CommandClient is not threadsafe - see #3376 - and it is possible that this will introduce a new thread-unsafety, as this will be called from the main thread of execution, and most invocations happen (later on) from the status poller thread.

…l, not multiprocessing any downstream packaging will need to be aware of the presence of interchange.py as a new command-line invocable script and this might break some build instructions which do not configure installed scripts onto the path. this PR replaces keyword arguments with argparse command line parameters. it does not attempt to make those command line arguments differently-optional than the constructor of the Interchange class (for example, worker_ports and worker_port_range are both mandatory, because they are both specified before this PR) i'm somewhat uncomfortable with this seeming like an ad-hoc serialise/deserialise protocol for what was previously effecting a dict of typed python objects... but it's what process worker pool does. see issue #3373 for interchange specific issue see issue #2343 for parsl general fork vs threads issue see possibly issue #3378?

benclifford · 2024-07-02T13:34:22Z

Implemented by #3463

benclifford added enhancement executor:htex labels Apr 20, 2024

benclifford mentioned this issue May 17, 2024

Add a timeout option to htex command client #3441

Merged

benclifford mentioned this issue May 27, 2024

Use CommandClient for interchange port reporting instead of Queue #3461

Merged

benclifford closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interchange: launch as a fork/exec not Python multiprocessing #3373

Interchange: launch as a fork/exec not Python multiprocessing #3373

benclifford commented Apr 20, 2024

benclifford commented May 4, 2024

benclifford commented May 6, 2024

benclifford commented Jul 2, 2024

Interchange: launch as a fork/exec not Python multiprocessing #3373

Interchange: launch as a fork/exec not Python multiprocessing #3373

Comments

benclifford commented Apr 20, 2024

benclifford commented May 4, 2024

benclifford commented May 6, 2024

benclifford commented Jul 2, 2024