-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative handling of killed/unexpected worker in transition_processing_erred
#6939
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 34m 21s ⏱️ + 7m 20s For more details on these failures, see this check. Results for commit 3ca7e35. ± Comparison against base commit c15a10e. |
@@ -7323,7 +7342,7 @@ def request_remove_replicas( | |||
) | |||
|
|||
|
|||
def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str: | |||
def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After putting some more thought into it, I think that _remove_from_processing
should always return None
. Its job is to
Remove ts from the set of processing tasks.
At either return statement, it has successfully completed its task.
In your PR, transition_processing_released
is the only place where we care about the undocumented feature that it makes a distinction for us between returning None
if it encounters a stale worker state and ws.address
if it encounters a current worker state. IMO, this muddles up mutation and querying logic and introduces additional complexity that would at least need to be documented. Why not let transition_processing_released
worry about it?
From #6392 we also know that returning an address is most likely not a good idea since it's not a unique identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with that. The one argument for it not always returning None is that it's kinda like a pop—it does ts.processing_on = None
, so after it's run, callers can't find out what worker the task was processing on.
It could be reasonable to say that callers need to pull out ts.processing_on
first, but then they also need to check if that WorkerState is stale, which is more places to get things wrong. Since _remove_from_processing
has to do the stale check anyway, I kinda like having it return the result of that check (in that it returns None if the worker is stale). Returning a WorkerState instead of an address in the non-stale case might be more practical though.
|
||
ts.erred_on.add(w) | ||
# NOTE: `worker` is None if it died, i.e. `KilledWorker` | ||
ts.erred_on.add(worker or maybe_stale_ws.address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two possible sources for the worker address we use here still feels like a consistency issue. Why not just use maybe_stale_ws.address
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because at the moment, it's possible that the task ran (and failed) on a worker we didn't expect it to. Since this PR was making the erred
case consistent with memory
, it's possible worker != maybe_stale_ws.address
, in which case worker
is the correct address since that's where the task actually ran and failed.
Closed by #6946 |
Alternative approach to #6884, just for discussion—easier than trying to write out what I was thinking.
_remove_from_processing
returns None again if the WorkerState is stale Always returnws.address
from_remove_from_processing
#6884 (comment)ws.address
from_remove_from_processing
#6884 (comment)I feel like we also need need a test triggering whatever caused #6874 to happen?
cc @hendrikmakait @fjetter
Closes #6874
pre-commit run --all-files