You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With #6361, any temporary network disconnect will shut down the worker.
Currently, that won't be considered a safe closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.
Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.
But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.
The text was updated successfully, but these errors were encountered:
any temporary network disconnect will shut down the worker.
This is not entirely true. Network disconnects that are shorter than distributed.comm.timeouts.tcp (default 30s) will not even be noticed by users.
But perhaps we could at least try to signal this from the Nanny?
Our assumption for removing the reconnect is that networks are reliable given a sufficiently large TCP timeout. Given this, I don't think we should increase our code complexity in dealing with a more sophisticated allowed-failures detection.
The one thing I would perceive as valuable is to improve the logic of our suspicious counters. The counter is currently very crude since it consideres only tasks in state processing (scheduler) but doesn't distinguish between waiting (worker) and executing (worker). Probably, we should only increase the counter if the tasks are in executing (worker).
If this was true, if a worker would disconnect from network repeatedly if a certain task is being executed, I think it's fine to raise a KilledWorkerException.
See #6396
With #6361, any temporary network disconnect will shut down the worker.
Currently, that won't be considered a
safe
closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.
But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.
The text was updated successfully, but these errors were encountered: