-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test retire workers deadlock #6240
Conversation
@@ -9,13 +9,16 @@ | |||
|
|||
import pytest | |||
|
|||
from distributed import Nanny, wait | |||
from distributed import Event, Nanny, wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from distributed import Event, Nanny, wait | |
from distributed import Event, Scheduler, Nanny, Worker, wait |
await asyncio.sleep(0.01) | ||
|
||
# `_track_retire_worker` _should_ now be sleeping for 0.5s, because there were >=200 keys on A. | ||
# In this test, everything from here on needs to happen within 0.5s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite true - everything from the very beginning of the replication needs to happen within 0.5s. Which I suspect it may cause flakiness on our CI.
assert isinstance(policy, RetireWorker) | ||
|
||
# This will drop all the `xs` from A (since they're already replicated on B). | ||
amm.run_once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary. Dropping is not a precondition for done; it's just to reduce memory pressure in case of spilled keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling run_once
is necessary, because what we wan to test is:
_track_retire_worker
is sleeping- policy runs and removes itself because all keys have been replicated
- another key appears on the retiring worker
_track_retire_worker
wakes up
By running it once here, it will remove itself.
# This will drop all the `xs` from A (since they're already replicated on B). | ||
amm.run_once() | ||
|
||
# The policy has removed itself, because there's no more data in need of replication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what happens if you don't manually force run_once though.
Co-authored-by: crusaderky <[email protected]>
Co-authored-by: crusaderky <[email protected]>
* If you set `poll_interval` to 0 or a small value in `_track_retire_worker`, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run. * With `--count=1000 -n10` on my machine it's passed 100% of the time (no skips even) * If you remove the critical change from `RetireWorkerPolicy.done()`, it always fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crusaderky I've incorporated as much of your feedback as I could, but ended up having to walk back to more or less having the test look like I originally wrote it. That's the only way I can get it to actually test the condition from #6234.
I'm pretty confident this test won't become flaky:
- If you set
poll_interval
to 0 or a small value in_track_retire_worker
, the test reliably skips itself ("Timing didn't work out"). This should make us confident it won't become flaky in CI; at worst, it just won't run. - With
--count=1000 -n10
on my machine it's passed 100% of the time (no skips) - If you remove the critical change (
RetireWorker
policy is done if removed #6234) fromRetireWorkerPolicy.done()
, it always fails.
The main downside is that it's extremely reliant on some very specific behavior, so if the retire_workers
mechanism or policy is refactored significantly, it might become meaningless. In that case though, I would still expect it to consistently fail, or skip itself.
assert isinstance(policy, RetireWorker) | ||
|
||
# This will drop all the `xs` from A (since they're already replicated on B). | ||
amm.run_once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling run_once
is necessary, because what we wan to test is:
_track_retire_worker
is sleeping- policy runs and removes itself because all keys have been replicated
- another key appears on the retiring worker
_track_retire_worker
wakes up
By running it once here, it will remove itself.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 36m 10s ⏱️ -23s For more details on these failures, see this check. Results for commit b09b1da. ± Comparison against base commit 344868a. |
The test has passed (not skipped) on every CI run here. Flaky tests are:
Ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll leave some time for @crusaderky since he's obviously the expert. but will merge tomorrow morning (my time) in case there is no more feedback.
Tests for #6234. The test is very timing-sensitive, so I think I've set it up to
pytest.skip
itself if the timing doesn't work out to test the condition we want.However, that won't work until #6239 is fixed. So I think we should probably hold off on merging, lest we introduce another flaky test.
pre-commit run --all-files