-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix critical race condition in graceful shutdown #8522
Changes from 1 commit
0e329dd
1b4e6d2
4a6582b
ad2c3b1
013f5f9
efa635a
11f91c9
3cbde57
d967330
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -1272,12 +1272,24 @@ | |||||
self._update_latency(end - start) | ||||||
|
||||||
if response["status"] == "missing": | ||||||
# Scheduler thought we left. Reconnection is not supported, so just shut down. | ||||||
logger.error( | ||||||
f"Scheduler was unaware of this worker {self.address!r}. Shutting down." | ||||||
) | ||||||
# Something is out of sync; have the nanny restart us if possible. | ||||||
await self.close(nanny=False) | ||||||
Comment on lines
-1279
to
-1280
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I read your description correctly, this is the bug. what does it even mean that "something is out of sync"? In which cases would be want this worker to be restarted like this? The heartbeats only start once the worker is registered to the scheduler, see distributed/distributed/worker.py Lines 1477 to 1478 in 1211e79
I don't see what kind of "desync" would justify a restart and adding more complexity to this logic feels like trouble. Your test also passes if we just shut down the nanny as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test is designed to verify that the worker is not accidentally restarted after it's retired. If something kills off the worker and the nanny, it will not perturb the test.
I cannot come up with use cases. I'll remove the branch and see if anything breaks. |
||||||
# Scheduler thought we left. | ||||||
# Reconnection is not supported, so just shut down. | ||||||
|
||||||
if self.status == Status.closing_gracefully: | ||||||
# Called Scheduler.retire_workers(remove=True, close_workers=False) | ||||||
# The worker will remain indefinitely in this state, unknown to the | ||||||
# scheduler, until something else shuts it down. | ||||||
# Stopping the heartbeat is just a nice-to-have to reduce | ||||||
# unnecessary warnings on the scheduler log. | ||||||
logger.info("Stopping heartbeat to the scheduler") | ||||||
self.periodic_callbacks["heartbeat"].stop() | ||||||
else: | ||||||
logger.error( | ||||||
f"Scheduler was unaware of this worker {self.address!r}. " | ||||||
"Shutting down." | ||||||
) | ||||||
# Have the nanny restart us if possible | ||||||
await self.close(nanny=False, reason="worker-heartbeat-missing") | ||||||
return | ||||||
|
||||||
self.scheduler_delay = response["time"] - middle | ||||||
|
@@ -1290,7 +1302,7 @@ | |||||
logger.exception("Failed to communicate with scheduler during heartbeat.") | ||||||
except Exception: | ||||||
logger.exception("Unexpected exception during heartbeat. Closing worker.") | ||||||
await self.close() | ||||||
fjetter marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
await self.close(reason="worker-heartbeat-error") | ||||||
raise | ||||||
|
||||||
@fail_hard | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will either be missed entirely on github actions or will make the test very flaky. If you are waiting for some condition to occur, please wait until that condition did occur. Just having a plain sleep in here is not sufficient. Besides, this makes it also much harder to understand the test logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, getting a false negative on GH actions is a possibility.
No, because if GH is slower than my local machine it will simply not test the use case properly and return false negative.
No, I am waiting for a condition NOT to occur.
I will rewrite the unit test to check for the heartbeat stop. It will no longer verify that heartbeat does not call close().