-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix critical race condition in graceful shutdown #8522
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files ± 0 27 suites ±0 9h 54m 53s ⏱️ + 1m 1s For more details on these failures, see this check. Results for commit d967330. ± Comparison against base commit 1211e79. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
ca62f3a
to
67bb4bf
Compare
I think I found the code that calls retire_workers: distributed/distributed/deploy/spec.py Line 353 in 1211e79
The above uses the default parameters close_workers=False, remove=True .
I think it's not great that there are two different places in distributed.deploy that call distributed/distributed/deploy/adaptive.py Lines 195 to 199 in 1211e79
As explained in my previous post, however, without the certainty that everyone that's using a SpecCluster is also using nannies I'm not comfortable changing spec.py. |
2f69e25
to
3d53fec
Compare
3d53fec
to
0e329dd
Compare
# Something is out of sync; have the nanny restart us if possible. | ||
await self.close(nanny=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read your description correctly, this is the bug. what does it even mean that "something is out of sync"? In which cases would be want this worker to be restarted like this?
The heartbeats only start once the worker is registered to the scheduler, see
distributed/distributed/worker.py
Lines 1477 to 1478 in 1211e79
await self._register_with_scheduler() | |
self.start_periodic_callbacks() |
I don't see what kind of "desync" would justify a restart and adding more complexity to this logic feels like trouble. Your test also passes if we just shut down the nanny as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is designed to verify that the worker is not accidentally restarted after it's retired. If something kills off the worker and the nanny, it will not perturb the test.
In which cases would be want this worker to be restarted like this?
I cannot come up with use cases. I'll remove the branch and see if anything breaks.
distributed/tests/test_nanny.py
Outdated
# On Linux, it takes ~3.5s for the nanny to resuscitate a worker | ||
await asyncio.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will either be missed entirely on github actions or will make the test very flaky. If you are waiting for some condition to occur, please wait until that condition did occur. Just having a plain sleep in here is not sufficient. Besides, this makes it also much harder to understand the test logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will either be missed entirely on github actions
Yes, getting a false negative on GH actions is a possibility.
or will make the test very flaky.
No, because if GH is slower than my local machine it will simply not test the use case properly and return false negative.
If you are waiting for some condition to occur,
No, I am waiting for a condition NOT to occur.
I will rewrite the unit test to check for the heartbeat stop. It will no longer verify that heartbeat does not call close().
I would be very surprised if this was intentional. |
I don't get what this has to do with nannies. This entire system should work without reliance on nannies. the only purpose of a nanny is to shut down a worker if it breaches the configured memory limit and restart it. beyond that, the nanny shut just do what it's been told, i.e. if the scheduler asks the worker to shut down the nanny must oblige |
Not quite. The nanny has three purposes:
Only the first task is performed upon the scheduler's command; the other two happen autonomously.
If you change spec.py to call
The design is certainly overcomplicated. There are some benefits to it; namely a cluster manager could RPC into the worker and do some stuff. I would love to simplify it to a hardcoded |
@fjetter All review comments have been addressed. |
distributed/worker.py
Outdated
# This happens after one calls Scheduler.remove_worker(close=False). | ||
# ***DO NOT call close()!*** | ||
# Read: https://github.com/dask/distributed/pull/8522 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no longer a unit test that makes sure that future developers don't re-add self.close(nanny=False)
and cause the nanny to restart the worker.
Let us hope that whoever touches this code next takes the time to read and understand the race condition that this PR resolves.
Sorry but I'm having a hard time to follow. The above description is very long and somewhat off topic. Why are we not closing a worker and its nanny if the scheduler says it's missing/unknown? The bug I'm seeing is the |
for the follow-up,
Because if the worker does not use a nanny then we do not know what is there instead of a nanny |
fb6d2c4
to
1b4e6d2
Compare
I can now see three possible outcomes of this PR:
|
This reverts commit 1b4e6d2.
I did some research into the git history to explain why that close statement was/is there. This was intentionally added in #6505 to fix #6494 #6494 (comment) mentions this explicitly and suggest to not close the nanny in this situation |
Since then a lot has changed about how we deal internally with the status of our servers and I'm not sure if this condition will come back since there was no proper test added for it. |
Client.restart seems to call Nanny.kill, not remove_worker, so it should not be an issue. distributed/distributed/scheduler.py Lines 6211 to 6248 in 1211e79
|
@fjetter PR has been changed to call close(nanny=True) and is ready for review again. |
Executive summary
Coiled, as well as other dask clusters that can dynamically shrink down while there's data on the cluster, can accidentally lose vital data when shrinking. This will cause large parts of a computation to restart from scratch; scattered data will be lost forever, causing any computation that relies on it to fail.
Impact
Any cluster that calls
SpecCluster.scale
to shrink down while there is data on the workers; e.g.Use case
Coiled is a cloud cluster which, to my understanding, handles scaling down as follows:
Scheduler.retire_workers(addresses, close_workers=False, remove=True)
status=closing_gracefully
and copies over to other workers all unique data they contain. The recipient workers must havestatus=running
.Scheduler.retire_workers
callsScheduler.remove_worker(close=False)
This PR fixes a critical race condition that happens if there's a few seconds delay between steps 6 and 7:
6.1. the worker, which is still in
status=closing_gracefully
, heartbeats the scheduler6.2 the scheduler responds
{status: missing}
: "I don't know you and I have no memory of you".6.3 the worker interprets this as an irrecoverable desync (e.g. the scheduler process may have been restarted), so it shuts itself down
6.4 the nanny notices the worker process died and restarts it
6.5 a new worker from the same host connects to the scheduler, with
status=running
6.6 this brand new worker is noticed and immediately used by Active Memory Manager (AMM), which in the meantime is evacuating data from other workers on the cluster (AMM reconsiders the list of recipients every 2 seconds).
6.7 AMM instructs the new worker process to gather some data from other workers being retired
6.8 the new worker obliges and acquires a replica of the data
6.9 the other workers are removed from the cluster. The born-again worker now is the sole holder of important task outputs, or even worse scattered data
...and now we go back to step 7 from the bullet points above. The cloud manager finally reacts to the old worker being removed from the cluster, and implements the shutdown of the EC2 instance where the new worker is holding precious data.
All computed tasks on it are lost and need to be recomputed.
All scattered data on it is lost and cannot be recovered without the client sending it again explicitly.
Aside
Note that the above process is not the only way to retire a worker.
One could alternatively call
Scheduler.retire_workers(addresses, close_workers=True, remove=True)
. This would safely shut down the worker process and send the nanny tostatus=close_gracefully
, which will prevent it from restarting the worker process.I'm not too happy that there are multiple ways to shut down a worker; I guess this has to do with the possibility that one may not use a nanny at all and use a different, bespoke watchdog instead.
Reproducer
The above code starts 344 workers and, once there are 2000 tasks in memory, shrinks down to 1 worker, moving all the task outputs (None's) to the only survivor.
It never finishes because of the above race condition. Halfway through the cluster shrink down, tasks in memory are lost and are shunted back to queued state, which causes the cluster to enlarge again.
The issue disappears when the peak number of workers gets lower; this happens if you send either less than 2000 tasks or spend less than 30s per task. I could only reproduce it with
cluster.adapt(minimum=1, maximum=500)
; for some reason I failed to reproduce it when manually scaling down withcluster.scale(1)
.@ntabris
@jrbourbeau best effort for 2024.2.1