Exception in WorkerPlugin is not handled correctly #6906

vepadulano · 2022-08-18T13:44:15Z

I have an application that may output errors to stderr (which I then save to a file). The errors may sometimes lead to the worker dying and the Nanny restarting the worker. I was hoping to use a worker plugin to read the errors that were output from the worker to a file and raise an exception accordingly. After reading #4297 it seems that the only place where exceptions are actually handled is in the WorkerPlugin's setup method. With the following reproducer, a simple exception is not raised properly

import inspect
import os

from dask.distributed import LocalCluster, Client, WorkerPlugin
from dask import delayed


class MyWorkerPlugin(WorkerPlugin):
    def setup(self, *args, **kwargs):
        with open("worker_plugin.txt", "a") as f:
            f.write("#"*20 + "\n")
            f.write(f"BEGIN {inspect.stack()[0][3]}\n")
            f.write(f"{args}\n")
            f.write(f"{kwargs}\n")
            f.write(f"END {inspect.stack()[0][3]}\n")
            f.write("#"*20 + "\n")

        if os.path.exists("error_log.txt"):
            with open("worker_plugin.txt", "a") as f:
                f.write("FOUND ERROR WHEN STARTING WORKER\n")
            raise RuntimeError("Error during worker setup!")

        super().setup(*args, **kwargs)

    def transition(self, *args, **kwargs):
        with open("worker_plugin.txt", "a") as f:
            f.write("#"*20 + "\n")
            f.write(f"BEGIN {inspect.stack()[0][3]}\n")
            f.write(f"{args}\n")
            f.write(f"{kwargs}\n")
            f.write(f"END {inspect.stack()[0][3]}\n")
            f.write("#"*20 + "\n")

        super().transition(*args, **kwargs)

    def teardown(self, *args, **kwargs):
        with open("worker_plugin.txt", "a") as f:
            f.write("#"*20 + "\n")
            f.write(f"BEGIN {inspect.stack()[0][3]}\n")
            f.write(f"{args}\n")
            f.write(f"{kwargs}\n")
            f.write(f"END {inspect.stack()[0][3]}\n")
            f.write("#"*20 + "\n")

        super().teardown(*args, **kwargs)


def create_connection():
    cluster = LocalCluster(n_workers=1, threads_per_worker=1, processes=True)
    client = Client(cluster)
    client.register_worker_plugin(MyWorkerPlugin())
    return client


@delayed
def task():
    print("IN TASK")
    with open("error_log.txt", "w") as f:
        f.write("An error was triggered.\n")
    os._exit(0)  # Used to trigger worker restart


def main():
    client = create_connection()
    task().compute()


if __name__ == "__main__":
    raise SystemExit(main())

Notice that the os._exit call is made in the delayed task to reproduce the scenario of a worker dying and being restarted after some crash. As you can see, the WorkerPlugin setup method checks whether the output error log exists and if so it raises an exception. Unfortunately, this doesn't work as intended:

$: python test.py
IN TASK
2022-08-18 15:38:27,372 - distributed.nanny - WARNING - Restarting worker
IN TASK
2022-08-18 15:38:27,842 - distributed.nanny - WARNING - Restarting worker
IN TASK
2022-08-18 15:38:28,303 - distributed.nanny - WARNING - Restarting worker
2022-08-18 15:38:28,599 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-bc_8koau', purging
IN TASK
2022-08-18 15:38:28,778 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "/home/vpadulan/projects/rootcode/dask-segfault-repro/trigger-error-from-log-file/test.py", line 69, in <module>
    raise SystemExit(main())
  File "/home/vpadulan/projects/rootcode/dask-segfault-repro/trigger-error-from-log-file/test.py", line 65, in main
    task().compute()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/dask/base.py", line 598, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/client.py", line 3001, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/client.py", line 2175, in gather
    return self.sync(
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync
    return sync(
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync
    raise exc.with_traceback(tb)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/utils.py", line 378, in f
    result = yield future
  File "/home/vpadulan/.local/lib/python3.10/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/client.py", line 2038, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('task-54fed301-8571-4966-9392-0380265630ea', <WorkerState 'tcp://127.0.0.1:44289', name: 0, status: closed, memory: 0, processing: 1>)

The nanny restarts the worker but the exception is never raised and the app continues until the scheduler kills the worker for too many retries.
The output log file from the worker plugin correctly shows that the error was found:

####################
BEGIN setup
()
{'worker': <Worker 'tcp://127.0.0.1:40513', name: 0, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>}
END setup
####################
####################
BEGIN transition
('task-54fed301-8571-4966-9392-0380265630ea', 'released', 'waiting')
{}
END transition
####################
####################
BEGIN transition
('task-54fed301-8571-4966-9392-0380265630ea', 'waiting', 'ready')
{}
END transition
####################
####################
BEGIN transition
('task-54fed301-8571-4966-9392-0380265630ea', 'ready', 'executing')
{}
END transition
####################
####################
BEGIN setup
()
{'worker': <Worker 'tcp://127.0.0.1:38679', name: 0, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>}
END setup
####################
FOUND ERROR WHEN STARTING WORKER
[...]

Bottom line, I have the following two questions:

Why is setup the only place where a WorkerPlugin can raise an exception that is properly handled?
Why is the exception being handled only if it's written at the main scope of the function?

Setup

Python 3.10
Dask 2022.7.1

The text was updated successfully, but these errors were encountered:

vepadulano · 2022-08-19T15:22:25Z

I've found out that the offending code is in the _register_with_scheduler function:

distributed/distributed/worker.py

Lines 1130 to 1135 in a1d2011

    
           await asyncio.gather( 
        
               *( 
        
                   self.plugin_add(name=name, plugin=plugin) 
        
                   for name, plugin in response["worker-plugins"].items() 
        
               ) 
        
           )

Similarly to #4297 , the exceptions coming from the plugin setup method are not correctly propagated. I tried to copy the same logic to this part of the function, the diff is

--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1127,12 +1127,22 @@ class Worker(BaseWorker, ServerNode):
             logger.error(f"Unable to connect to scheduler: {msg}")
             raise ValueError(f"Unexpected response from register: {response!r}")
         else:
-            await asyncio.gather(
+            plugins_msgs = await asyncio.gather(
                 *(
-                    self.plugin_add(name=name, plugin=plugin)
+                    self.plugin_add(name=name, plugin=plugin, catch_errors=False)
                     for name, plugin in response["worker-plugins"].items()
-                )
+                ),
+                return_exceptions=True,
             )
+            plugins_exceptions = [msg for msg in plugins_msgs if isinstance(msg, Exception)]
+            if len(plugins_exceptions) >= 1:
+                if len(plugins_exceptions) > 1:
+                    logger.error(
+                        "Multiple plugin exceptions raised. All exceptions will be logged, the first is raised."
+                    )
+                    for exc in plugins_exceptions:
+                        logger.error(repr(exc))
+                raise plugins_exceptions[0]

This makes things a bit better but doesn't fix the issue. Practically, the exception is now properly raised, but the program never ends after the worker dies.

$: python test.py
[...]
2022-08-19 17:12:45,062 - distributed.nanny - ERROR - Failed to restart worker after its process exited
Traceback (most recent call last):
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/core.py", line 481, in start
    await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
  File "/usr/lib64/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/worker.py", line 1418, in start_unsafe
    await self._register_with_scheduler()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/worker.py", line 1147, in _register_with_scheduler
    raise plugins_exceptions[0]
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/utils.py", line 799, in wrapper
    return await func(*args, **kwargs)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/worker.py", line 1805, in plugin_add
    result = plugin.setup(worker=self)
  File "/home/vpadulan/projects/rootcode/dask-segfault-repro/trigger-error-from-log-file/test.py", line 45, in setup
    raise RuntimeError("Error during worker setup!")
RuntimeError: Error during worker setup!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/nanny.py", line 524, in _on_worker_exit
    await self.instantiate()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/nanny.py", line 423, in instantiate
    result = await self.process.start()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/nanny.py", line 671, in start
    msg = await self._wait_until_connected(uid)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/nanny.py", line 799, in _wait_until_connected
    raise msg["exception"]
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/nanny.py", line 868, in run
    await worker
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/core.py", line 489, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

There must be some place where this exception is leading to the worker not properly closing itself (maybe there must be some extra communication to the nanny/scheduler). Unfortunately, I wasn't able to find it so far.

jtlz2 · 2023-06-15T12:36:38Z

@vepadulano Did you ever get to the end of the story on this one? :\

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception in WorkerPlugin is not handled correctly #6906

Exception in WorkerPlugin is not handled correctly #6906

vepadulano commented Aug 18, 2022

vepadulano commented Aug 19, 2022

jtlz2 commented Jun 15, 2023

Exception in WorkerPlugin is not handled correctly #6906

Exception in WorkerPlugin is not handled correctly #6906

Comments

vepadulano commented Aug 18, 2022

Setup

vepadulano commented Aug 19, 2022

jtlz2 commented Jun 15, 2023