Alternative handling of killed/unexpected worker in `transition_processing_erred` #6939

gjoseph92 · 2022-08-23T20:29:15Z

Alternative approach to #6884, just for discussion—easier than trying to write out what I was thinking.

_remove_from_processing returns None again if the WorkerState is stale Always return ws.address from _remove_from_processing #6884 (comment)
Allows a worker unexpectedly error a task, just like they're currently allowed to unexpectedly complete a task Always return ws.address from _remove_from_processing #6884 (comment)

I feel like we also need need a test triggering whatever caused #6874 to happen?

cc @hendrikmakait @fjetter

Closes #6874

Tests added / passed
Passes pre-commit run --all-files

…ssing_erred`

github-actions · 2022-08-23T21:16:37Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 34m 21s ⏱️ + 7m 20s
  3 041 tests ±0   2 954 ✔️ +2   83 💤 - 1 4 ❌ - 1
22 493 runs ±0 21 517 ✔️ +3 972 💤 - 2 4 ❌ - 1

For more details on these failures, see this check.

Results for commit 3ca7e35. ± Comparison against base commit c15a10e.

hendrikmakait · 2022-08-24T06:24:23Z

distributed/scheduler.py

@@ -7323,7 +7342,7 @@ def request_remove_replicas(
        )


-def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str:
+def _remove_from_processing(state: SchedulerState, ts: TaskState) -> str | None:


After putting some more thought into it, I think that _remove_from_processing should always return None. Its job is to

Remove ts from the set of processing tasks.

At either return statement, it has successfully completed its task.
In your PR, transition_processing_released is the only place where we care about the undocumented feature that it makes a distinction for us between returning None if it encounters a stale worker state and ws.address if it encounters a current worker state. IMO, this muddles up mutation and querying logic and introduces additional complexity that would at least need to be documented. Why not let transition_processing_released worry about it?

From #6392 we also know that returning an address is most likely not a good idea since it's not a unique identifier.

I'd be fine with that. The one argument for it not always returning None is that it's kinda like a pop—it does ts.processing_on = None, so after it's run, callers can't find out what worker the task was processing on.

It could be reasonable to say that callers need to pull out ts.processing_on first, but then they also need to check if that WorkerState is stale, which is more places to get things wrong. Since _remove_from_processing has to do the stale check anyway, I kinda like having it return the result of that check (in that it returns None if the worker is stale). Returning a WorkerState instead of an address in the non-stale case might be more practical though.

hendrikmakait · 2022-08-24T06:26:47Z

distributed/scheduler.py


-            ts.erred_on.add(w)
+            # NOTE: `worker` is None if it died, i.e. `KilledWorker`
+            ts.erred_on.add(worker or maybe_stale_ws.address)


Having two possible sources for the worker address we use here still feels like a consistency issue. Why not just use maybe_stale_ws.address?

Because at the moment, it's possible that the task ran (and failed) on a worker we didn't expect it to. Since this PR was making the erred case consistent with memory, it's possible worker != maybe_stale_ws.address, in which case worker is the correct address since that's where the task actually ran and failed.

gjoseph92 · 2022-08-25T00:53:35Z

Closed by #6946

Alternative handling of killed/unexpected worker in `transition_proce…

3ca7e35

…ssing_erred`

hendrikmakait reviewed Aug 24, 2022

View reviewed changes

This was referenced Aug 24, 2022

stimulus_task_erred is no-op when task-erred is not coming from the worker the task is actually processing on #6941

Closed

Propagate worker address and improve _remove_from_processing behavior #6946

Merged

gjoseph92 closed this Aug 25, 2022

gjoseph92 deleted the processing-erred-killed-worker branch August 25, 2022 00:53

hendrikmakait mentioned this pull request Aug 25, 2022

Improve handling of tasks failing/succeeding on unexpected workers #6956

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative handling of killed/unexpected worker in `transition_processing_erred` #6939

Alternative handling of killed/unexpected worker in `transition_processing_erred` #6939

gjoseph92 commented Aug 23, 2022

github-actions bot commented Aug 23, 2022

hendrikmakait Aug 24, 2022

gjoseph92 Aug 24, 2022

hendrikmakait Aug 24, 2022

gjoseph92 Aug 24, 2022

gjoseph92 commented Aug 25, 2022

Alternative handling of killed/unexpected worker in transition_processing_erred #6939

Alternative handling of killed/unexpected worker in transition_processing_erred #6939

Conversation

gjoseph92 commented Aug 23, 2022

github-actions bot commented Aug 23, 2022

Unit Test Results

hendrikmakait Aug 24, 2022

Choose a reason for hiding this comment

gjoseph92 Aug 24, 2022

Choose a reason for hiding this comment

hendrikmakait Aug 24, 2022

Choose a reason for hiding this comment

gjoseph92 Aug 24, 2022

Choose a reason for hiding this comment

gjoseph92 commented Aug 25, 2022

Alternative handling of killed/unexpected worker in `transition_processing_erred` #6939

Alternative handling of killed/unexpected worker in `transition_processing_erred` #6939