Consider connection-failure worker closures as safe? #6386

gjoseph92 · 2022-05-20T00:16:31Z

With #6361, any temporary network disconnect will shut down the worker.

Currently, that won't be considered a safe closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.

Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.

But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.

The text was updated successfully, but these errors were encountered:

fjetter · 2022-05-20T09:39:11Z

any temporary network disconnect will shut down the worker.

This is not entirely true. Network disconnects that are shorter than distributed.comm.timeouts.tcp (default 30s) will not even be noticed by users.

But perhaps we could at least try to signal this from the Nanny?

Our assumption for removing the reconnect is that networks are reliable given a sufficiently large TCP timeout. Given this, I don't think we should increase our code complexity in dealing with a more sophisticated allowed-failures detection.

The one thing I would perceive as valuable is to improve the logic of our suspicious counters. The counter is currently very crude since it consideres only tasks in state processing (scheduler) but doesn't distinguish between waiting (worker) and executing (worker). Probably, we should only increase the counter if the tasks are in executing (worker).
If this was true, if a worker would disconnect from network repeatedly if a certain task is being executed, I think it's fine to raise a KilledWorkerException.
See #6396

This was referenced May 20, 2022

Restart worker via Nanny on connection failure #6387

Open

Worker reconnect removal follow-ups #6384

Open

fjetter mentioned this issue May 20, 2022

WIP Only increase suspicious counter for executing tasks #6396

Closed

fjetter added networking discussion Discussing a topic with no specific actions yet labels Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider connection-failure worker closures as safe? #6386

Consider connection-failure worker closures as safe? #6386

gjoseph92 commented May 20, 2022

fjetter commented May 20, 2022 •

edited

Loading

Consider connection-failure worker closures as safe? #6386

Consider connection-failure worker closures as safe? #6386

Comments

gjoseph92 commented May 20, 2022

fjetter commented May 20, 2022 • edited Loading

fjetter commented May 20, 2022 •

edited

Loading