Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider connection-failure worker closures as safe? #6386

Open
Tracked by #6384
gjoseph92 opened this issue May 20, 2022 · 1 comment
Open
Tracked by #6384

Consider connection-failure worker closures as safe? #6386

gjoseph92 opened this issue May 20, 2022 · 1 comment
Labels
discussion Discussing a topic with no specific actions yet networking

Comments

@gjoseph92
Copy link
Collaborator

With #6361, any temporary network disconnect will shut down the worker.

Currently, that won't be considered a safe closure. Any tasks running on the worker will be marked as suspicious. In an unreliable network environment, that could lead to tasks being errored (with a KilledWorker exception) just due to network disconnects.

Differentiating between a transient network failure and a worker crashing and disconnecting isn't possible on the scheduler side (until we re-implement reconnection). So there may be nothing we can do here.

But perhaps we could at least try to signal this from the Nanny? For example, if the worker is shutting down due to network interrupt, it could signal this to the Nanny, which could try to signal it to the scheduler? There are some race conditions here though around when the worker<->scheduler comm is broken, since the scheduler immediately removes the worker state and marks the tasks as suspicious.

@fjetter
Copy link
Member

fjetter commented May 20, 2022

any temporary network disconnect will shut down the worker.

This is not entirely true. Network disconnects that are shorter than distributed.comm.timeouts.tcp (default 30s) will not even be noticed by users.

But perhaps we could at least try to signal this from the Nanny?

Our assumption for removing the reconnect is that networks are reliable given a sufficiently large TCP timeout. Given this, I don't think we should increase our code complexity in dealing with a more sophisticated allowed-failures detection.

The one thing I would perceive as valuable is to improve the logic of our suspicious counters. The counter is currently very crude since it consideres only tasks in state processing (scheduler) but doesn't distinguish between waiting (worker) and executing (worker). Probably, we should only increase the counter if the tasks are in executing (worker).
If this was true, if a worker would disconnect from network repeatedly if a certain task is being executed, I think it's fine to raise a KilledWorkerException.
See #6396

@fjetter fjetter added networking discussion Discussing a topic with no specific actions yet labels Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet networking
Projects
None yet
Development

No branches or pull requests

2 participants