Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scheduler transition error on memory->erred #8549

Merged
merged 26 commits into from
Mar 8, 2024

Conversation

hendrikmakait
Copy link
Member

Closes #8548

  • Tests added / passed
  • Passes pre-commit run --all-files

@@ -2505,14 +2512,13 @@ def _transition_released_erred(self, key: Key, stimulus_id: str) -> RecsMsgs:
assert ts.exception_blame
assert not ts.who_has
assert not ts.waiting_on
assert not ts.waiters
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion does not work in two-step transitions.

Copy link
Contributor

github-actions bot commented Mar 1, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  +    1      27 suites  +1   10h 8m 49s ⏱️ + 47m 57s
 4 050 tests +    2   3 935 ✅ +    2    110 💤 ± 0   5 ❌ ±0 
50 854 runs  +2 615  48 488 ✅ +2 531  2 335 💤 +84  31 ❌ ±0 

For more details on these failures, see this check.

Results for commit 449f447. ± Comparison against base commit e16a7af.

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait marked this pull request as ready for review March 5, 2024 09:20
@hendrikmakait hendrikmakait requested a review from fjetter as a code owner March 5, 2024 09:20
@hendrikmakait hendrikmakait requested review from crusaderky and removed request for fjetter March 5, 2024 09:21
@@ -1964,6 +1964,7 @@ def _transition(
)

v = a_recs.get(key, finish)
# The inner rec has higher priority? Is that always desired?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general comment about the two-step transitions. The recommendations created by the first step are executed before the second step, which may create weird state (as it did in this case).

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
@@ -2547,6 +2553,9 @@ def _transition_erred_released(self, key: Key, stimulus_id: str) -> RecsMsgs:

for dts in ts.dependents:
if dts.state == "erred":
# Does this make sense?
# This goes via released
# dts -> released -> waiting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this makes no sense to me either. Is there a unit test anywhere to shed light on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't investigated this any further.

@@ -2621,8 +2630,8 @@ def _transition_processing_erred(
self,
key: Key,
stimulus_id: str,
worker: str | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a task be processing without a worker? Is it when the worker it was processing on died and it caused the task to increase its suspicious count too much? If so, it may be a good idea to note it in a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this has been superseded by other changes in this PR. I'll remove it and see if CI complains.

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
@hendrikmakait
Copy link
Member Author

@crusaderky: All comments have been addressed.

@hendrikmakait hendrikmakait requested a review from crusaderky March 5, 2024 17:39
Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits only

Comment on lines 2687 to 2688
if worker:
ts.erred_on.add(worker)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the type in the function declaration should change to worker: str | None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I didn't clean this up properly.

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved
@crusaderky
Copy link
Collaborator

LGTM, however I have no idea if there are any regressions.
Do you want to merge it now or would you rather wait for #8560?

@hendrikmakait hendrikmakait mentioned this pull request Mar 8, 2024
4 tasks
@hendrikmakait hendrikmakait merged commit 91350ab into dask:main Mar 8, 2024
13 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler transition error when tasks (transitively) transitions memory -> erred
2 participants