-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix scheduler transition error on memory->erred
#8549
Conversation
@@ -2505,14 +2512,13 @@ def _transition_released_erred(self, key: Key, stimulus_id: str) -> RecsMsgs: | |||
assert ts.exception_blame | |||
assert not ts.who_has | |||
assert not ts.waiting_on | |||
assert not ts.waiters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion does not work in two-step transitions.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 27 files + 1 27 suites +1 10h 8m 49s ⏱️ + 47m 57s For more details on these failures, see this check. Results for commit 449f447. ± Comparison against base commit e16a7af. ♻️ This comment has been updated with latest results. |
@@ -1964,6 +1964,7 @@ def _transition( | |||
) | |||
|
|||
v = a_recs.get(key, finish) | |||
# The inner rec has higher priority? Is that always desired? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general comment about the two-step transitions. The recommendations created by the first step are executed before the second step, which may create weird state (as it did in this case).
@@ -2547,6 +2553,9 @@ def _transition_erred_released(self, key: Key, stimulus_id: str) -> RecsMsgs: | |||
|
|||
for dts in ts.dependents: | |||
if dts.state == "erred": | |||
# Does this make sense? | |||
# This goes via released | |||
# dts -> released -> waiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this makes no sense to me either. Is there a unit test anywhere to shed light on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't investigated this any further.
distributed/scheduler.py
Outdated
@@ -2621,8 +2630,8 @@ def _transition_processing_erred( | |||
self, | |||
key: Key, | |||
stimulus_id: str, | |||
worker: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a task be processing without a worker? Is it when the worker it was processing on died and it caused the task to increase its suspicious count too much? If so, it may be a good idea to note it in a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that this has been superseded by other changes in this PR. I'll remove it and see if CI complains.
Co-authored-by: crusaderky <[email protected]>
Co-authored-by: crusaderky <[email protected]>
Co-authored-by: crusaderky <[email protected]>
@crusaderky: All comments have been addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits only
distributed/scheduler.py
Outdated
if worker: | ||
ts.erred_on.add(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the type in the function declaration should change to worker: str | None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I didn't clean this up properly.
Co-authored-by: crusaderky <[email protected]>
Co-authored-by: crusaderky <[email protected]>
LGTM, however I have no idea if there are any regressions. |
Closes #8548
pre-commit run --all-files