You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Workers are not cleaned up from a launch node after a job is terminated by the schedular.
My app places the FuncX manager on the MOM node of a Cray supercomputer so that workers can launch MPI applications via system calls. The manager process is killed when the job exits but the workers stay afterwards.
Looking at the logs, I note the workers report receiving a Signal 15 but do not exit. Is that expected?
(miniconda-3/latest//home/lward/exalearn/edw/env) lward@thetalogin6:~/.funcx/nwchem/HighThroughputExecutor/worker_logs/70f647195873> more funcx_worker_32.log
1649883351.811096 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:85 __init__ Initializing worker 32
1649883351.813936 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:86 __init__ Worker is of type: RAW
1649883351.814639 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:95 __init__ Trying to connect to : tcp://127.0.0.1:52075
1649883351.815704 2022-04-13 20:55:51 INFO MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:109 start Starting worker
1649884526.686283 2022-04-13 21:15:26 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15
1649884762.299373 2022-04-13 21:19:22 ERROR MainProcess-70029 MainThread-140634291058496 funcx_endpoint.executors.high_throughput.funcx_worker:101 handler Signal handler called with signal 15
To Reproduce
TBD. My app has a complex set up, but I can create a minimal example on request.
Expected behavior
Everything dies when Cobalt commands it.
Environment
OS: CentOS
OS & Container technology: None
Python version @ 3.8.12
Python version @ 3.8.12
funcx version @ 58493f
funcx-endpoint version @ 58493f
Distributed Environment
Where are you running the funcX script from? Login node
Where does the endpoint run? Login node
What is your endpoint-uuid? ff59d7d1-e2f5-4a38-8bb8-ba6de588c7c7
Attach endpoint logs at ~/.funcx/<ENDPOINT_NAME> if this is an endpoint issue.
Please let us know if you'd prefer to share logs privately.
Describe the bug
Workers are not cleaned up from a launch node after a job is terminated by the schedular.
My app places the FuncX manager on the MOM node of a Cray supercomputer so that workers can launch MPI applications via system calls. The manager process is killed when the job exits but the workers stay afterwards.
Looking at the logs, I note the workers report receiving a Signal 15 but do not exit. Is that expected?
To Reproduce
TBD. My app has a complex set up, but I can create a minimal example on request.
Expected behavior
Everything dies when Cobalt commands it.
Environment
OS: CentOS
OS & Container technology: None
Python version @ 3.8.12
Python version @ 3.8.12
funcx version @ 58493f
funcx-endpoint version @ 58493f
Distributed Environment
~/.funcx/<ENDPOINT_NAME>
if this is an endpoint issue.Please let us know if you'd prefer to share logs privately.
worker-no-die.tar.gz
The text was updated successfully, but these errors were encountered: