-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate and debug state machine on handle_compute_task #6327
Conversation
crusaderky
commented
May 11, 2022
•
edited
Loading
edited
- Partially closes test_stress_scatter_death #6305
0a088bf
to
91886ad
Compare
Unit Test Results 15 files + 3 15 suites +3 6h 58m 2s ⏱️ + 1h 9m 0s For more details on these failures, see this check. Results for commit 79402f2. ± Comparison against base commit 4b81f06. ♻️ This comment has been updated with latest results. |
ce5476c
to
4de1517
Compare
distributed/worker.py
Outdated
@@ -652,7 +652,7 @@ def __init__( | |||
("ready", "released"): self.transition_generic_released, | |||
("released", "error"): self.transition_generic_error, | |||
("released", "fetch"): self.transition_released_fetch, | |||
("released", "missing"): self.transition_released_fetch, | |||
("released", "missing"): self.transition_generic_missing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for dep_key, value in nbytes.items(): | ||
self.tasks[dep_key].nbytes = value | ||
|
||
self.update_who_has(who_has) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This move prevents deps to be created in resumed state by ensure_task_exists and then remain there because there's nothing actually needing them.
|
||
if ts.state in READY | {"executing", "waiting", "resumed"}: | ||
if ts.state in READY | {"executing", "long-running", "waiting", "resumed"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I omitted a unit test for this - something to write after the state machine refactor for sure
@@ -3417,7 +3434,6 @@ async def find_missing(self) -> None: | |||
self.scheduler.who_has, | |||
keys=[ts.key for ts in self._missing_dep_flight], | |||
) | |||
who_has = {k: v for k, v in who_has.items() if v} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant - update_who_has already throws away empty lists of workers
@@ -121,58 +120,43 @@ async def create_and_destroy_worker(delay): | |||
assert await c.compute(z) == 8000884.93 | |||
|
|||
|
|||
@gen_cluster(nthreads=[("127.0.0.1", 1)] * 10, client=True, timeout=60) | |||
@gen_cluster(nthreads=[("", 1)] * 10, client=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is functionally identical to before - all changes are just cosmetic.
Self-review Self-review self-review
f0425fa
to
9408cab
Compare
The CI failure is on these new lines at the end of for dep_ts in ts.dependencies:
assert dep_ts.state != "released", self.story(dep_ts) I can reproduce it in 0.4% (4 out of 1000) of runs on my desktop. I'll investigate over the next few days. |
The one-liner fix for the infinite transition has been merged in #6331. What is left in this PR is a wealth of hardening, which removes some cases for corrupted state and makes other crop up sooner. |
# ensure_tasks_exists() have been transitioned to fetch or flight | ||
assert all( | ||
ts2.state != "released" for ts2 in (ts, *ts.dependencies) | ||
), self.story(ts, *ts.dependencies) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment of writing, this assertion fails in test_stress_scatter_death 0.4% of the times on a fast desktop.
Explanation in #6305. Resolution out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this assert is not 100% correct. There is a case for valid tasks left in released in the case of cancelled/resumed tasks. I'll open a follow up PR with a case reproducing this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flowchart TD
A1[A1 - forgotten / not known] --> B1[B1 - flight]
A2[A1 - forgotten / not known] --> B1[B1 - flight]
B1 --> C1[C1 - waiting]
free-keys / cancel B1
flowchart TD
A1[A1 - forgotten / not known] --> B1[B1 - cancelled]
A2[A1 - forgotten / not known] --> B1[B1 - cancelled]
B1 --> C1[C1 - forgotten]
compute-task B1
flowchart TD
A1[A1 - released] --> B1[B1 - resumed]
A2[A1 - released] --> B1[B1 - resumed]
B1 --> C1[C1 - forgotten]
gather-dep finishes w/ Error
flowchart TD
A1[A1 - fetch] --> B1[B1 - waiting]
A2[A1 - fetch] --> B1[B1 - waiting]
B1 --> C1[C1 - forgotten]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand from the above diagram how this can end up with a released state by the end of compute-task?
Nothing here concerns me. The biggest change is the transition change, and that was from Guido anyway. The failing test is concerning, but it seems like it's just shining a light on something that was broken before. Merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was merged midway through my reviewing, but it also looks good to me. The failing test is just the 0.4% of cases where test_stress_scatter_death
is known to still fail, which is a regression that still needs to be addressed #6305.
Ah, my apologies for jumping ahead. |