Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative handling of killed/unexpected worker in transition_processing_erred #6939

Closed

Conversation

gjoseph92
Copy link
Collaborator

Alternative approach to #6884, just for discussion—easier than trying to write out what I was thinking.

I feel like we also need need a test triggering whatever caused #6874 to happen?

cc @hendrikmakait @fjetter

Closes #6874

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 34m 21s ⏱️ + 7m 20s
  3 041 tests ±0    2 954 ✔️ +2    83 💤  - 1  4  - 1 
22 493 runs  ±0  21 517 ✔️ +3  972 💤  - 2  4  - 1 

For more details on these failures, see this check.

Results for commit 3ca7e35. ± Comparison against base commit c15a10e.

@@ -7323,7 +7342,7 @@ def request_remove_replicas(
)


def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str:
def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After putting some more thought into it, I think that _remove_from_processing should always return None. Its job is to

Remove ts from the set of processing tasks.

At either return statement, it has successfully completed its task.
In your PR, transition_processing_released is the only place where we care about the undocumented feature that it makes a distinction for us between returning None if it encounters a stale worker state and ws.address if it encounters a current worker state. IMO, this muddles up mutation and querying logic and introduces additional complexity that would at least need to be documented. Why not let transition_processing_released worry about it?

From #6392 we also know that returning an address is most likely not a good idea since it's not a unique identifier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with that. The one argument for it not always returning None is that it's kinda like a pop—it does ts.processing_on = None, so after it's run, callers can't find out what worker the task was processing on.

It could be reasonable to say that callers need to pull out ts.processing_on first, but then they also need to check if that WorkerState is stale, which is more places to get things wrong. Since _remove_from_processing has to do the stale check anyway, I kinda like having it return the result of that check (in that it returns None if the worker is stale). Returning a WorkerState instead of an address in the non-stale case might be more practical though.


ts.erred_on.add(w)
# NOTE: `worker` is None if it died, i.e. `KilledWorker`
ts.erred_on.add(worker or maybe_stale_ws.address)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two possible sources for the worker address we use here still feels like a consistency issue. Why not just use maybe_stale_ws.address?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because at the moment, it's possible that the task ran (and failed) on a worker we didn't expect it to. Since this PR was making the erred case consistent with memory, it's possible worker != maybe_stale_ws.address, in which case worker is the correct address since that's where the task actually ran and failed.

@gjoseph92
Copy link
Collaborator Author

Closed by #6946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError in ExceptionsTable.update
2 participants