-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception in WorkerPlugin is not handled correctly #6906
Comments
I've found out that the offending code is in the distributed/distributed/worker.py Lines 1130 to 1135 in a1d2011
Similarly to #4297 , the exceptions coming from the plugin --- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1127,12 +1127,22 @@ class Worker(BaseWorker, ServerNode):
logger.error(f"Unable to connect to scheduler: {msg}")
raise ValueError(f"Unexpected response from register: {response!r}")
else:
- await asyncio.gather(
+ plugins_msgs = await asyncio.gather(
*(
- self.plugin_add(name=name, plugin=plugin)
+ self.plugin_add(name=name, plugin=plugin, catch_errors=False)
for name, plugin in response["worker-plugins"].items()
- )
+ ),
+ return_exceptions=True,
)
+ plugins_exceptions = [msg for msg in plugins_msgs if isinstance(msg, Exception)]
+ if len(plugins_exceptions) >= 1:
+ if len(plugins_exceptions) > 1:
+ logger.error(
+ "Multiple plugin exceptions raised. All exceptions will be logged, the first is raised."
+ )
+ for exc in plugins_exceptions:
+ logger.error(repr(exc))
+ raise plugins_exceptions[0] This makes things a bit better but doesn't fix the issue. Practically, the exception is now properly raised, but the program never ends after the worker dies.
There must be some place where this exception is leading to the worker not properly closing itself (maybe there must be some extra communication to the nanny/scheduler). Unfortunately, I wasn't able to find it so far. |
@vepadulano Did you ever get to the end of the story on this one? :\ |
I have an application that may output errors to stderr (which I then save to a file). The errors may sometimes lead to the worker dying and the Nanny restarting the worker. I was hoping to use a worker plugin to read the errors that were output from the worker to a file and raise an exception accordingly. After reading #4297 it seems that the only place where exceptions are actually handled is in the WorkerPlugin's setup method. With the following reproducer, a simple exception is not raised properly
Notice that the
os._exit
call is made in the delayed task to reproduce the scenario of a worker dying and being restarted after some crash. As you can see, the WorkerPluginsetup
method checks whether the output error log exists and if so it raises an exception. Unfortunately, this doesn't work as intended:The nanny restarts the worker but the exception is never raised and the app continues until the scheduler kills the worker for too many retries.
The output log file from the worker plugin correctly shows that the error was found:
Bottom line, I have the following two questions:
setup
the only place where a WorkerPlugin can raise an exception that is properly handled?Setup
Python 3.10
Dask 2022.7.1
The text was updated successfully, but these errors were encountered: