Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ensure_communicating #6165

Merged
merged 18 commits into from
May 11, 2022
Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Apr 20, 2022

Partially closes #5896

In scope for this PR

  • Refactor ensure_communicating() -> None to _ensure_communicating() -> RecsInstr
  • Remove self.loop.add_callback(self.gather_dep, ...)
  • ensure_communicating is no longer called periodically "just in case" - neither from every_cycle nor from find_missing

Out of scope for this PR, but in scope for #5896

  • Refactor gather_dep
  • Get rid of GatherDepDoneEvent (introduced in this PR)
  • Refactor _readd_busy_worker as an async instruction

Out of scope for #5896

  • Refactor find_missing
  • Get rid of EnsureCommunicatingLater (introduced in this PR) and _select_keys_for_gather

distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
@@ -3077,7 +3123,7 @@ async def gather_dep(
for k in self.in_flight_workers[worker]:
ts = self.tasks[k]
recommendations[ts] = tuple(msg.values())
raise
return GatherDepDoneEvent(stimulus_id=stimulus_id)
Copy link
Collaborator Author

@crusaderky crusaderky Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is now shielded from @log_errors. Previously it was double reported as it is already logged by logger.exceptions(e) on line 3117.

@@ -3152,7 +3196,6 @@ async def find_missing(self) -> None:
self.periodic_callbacks[
"find-missing"
].callback_time = self.periodic_callbacks["heartbeat"].callback_time
self.ensure_communicating()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer necessary as _ensure_communicating is now in all transitions to fetch.

@crusaderky crusaderky self-assigned this Apr 20, 2022
@crusaderky crusaderky requested a review from fjetter April 20, 2022 17:15
@crusaderky
Copy link
Collaborator Author

@fjetter you may give it a look now or wait until after I've fixed the regressions

@github-actions
Copy link
Contributor

github-actions bot commented Apr 20, 2022

Unit Test Results

       16 files  ±    0         16 suites  ±0   7h 37m 30s ⏱️ + 46m 58s
  2 771 tests +    2    2 693 ✔️ +    6       78 💤  -   3  0  - 1 
22 130 runs  +308  21 111 ✔️ +290  1 019 💤 +20  0  - 2 

Results for commit 95ce604. ± Comparison against base commit e390609.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator Author

crusaderky commented Apr 25, 2022

Current status: 8 tests failing. Pending investigation.

FAILED distributed/dashboard/tests/test_worker_bokeh.py::test_basic

This is because distributed.dashboard.components.worker.CrossFilter.update was never executed before and now for some reason is always executed - and its implementation is out of sync with the rest of the codebase. Need to investigate what it's meant to do.
[ EDIT ] logged in #6303

FAILED distributed/diagnostics/tests/test_eventstream.py::test_eventstream

FAILED distributed/tests/test_scheduler.py::test_decide_worker_coschedule_order_neighbors[nthreads1-1]

FAILED distributed/tests/test_worker.py::test_gather_many_small

This looks like a genuine regression in _select_keys_for_gather - concurrent fetches are being limited to 1 key per worker

FAILED distributed/tests/test_worker.py::test_acquire_replicas_already_in_flight

FAILED distributed/tests/test_worker.py::test_missing_released_zombie_tasks_2

FAILED distributed/tests/test_worker.py::test_gather_dep_cancelled_rescheduled

FAILED distributed/tests/test_worker.py::test_gather_dep_do_not_handle_response_of_not_requested_tasks

@crusaderky crusaderky force-pushed the ensure_communicating branch from 361b734 to 7569dd8 Compare May 6, 2022 21:46
# compute-task or acquire-replicas command from the scheduler, it allows
# clustering the transfers into less GatherDep instructions; see
# _select_keys_for_gather().
return {}, [EnsureCommunicatingLater(stimulus_id=stimulus_id)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to this was to delete _select_keys_for_gather and either

  1. Add logic to _handle_instructions to squash individual GatherDep instructions on the same worker
  2. Implement no grouping whatsoever and just rely on rpc pooling (needs performance testing)

Both systems would remove the out-of-priority fetch from workers and imply revisiting the limits for concurrent connections.

Either way it would be a major functional change so I opted for this somewhat dirty hack instead which is functionally identical to main.

@@ -2855,13 +2889,16 @@ def stimulus_story(
keys = {e.key if isinstance(e, TaskState) else e for e in keys_or_tasks}
return [ev for ev in self.stimulus_log if getattr(ev, "key", None) in keys]

def ensure_communicating(self) -> None:
def _ensure_communicating(self, *, stimulus_id: str) -> RecsInstrs:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetching the stimulus_id from outside means that now gather_dep commands will have the stimulus_id of the event that triggered them, e.g.

  • compute-task
  • acquire-replicas
  • find_missing
  • GatherDepDoneEvent


recommendations: Recs = {}
response = {}
to_gather_keys: set[str] = set()
cancelled_keys: set[str] = set()

def done_event():
return GatherDepDoneEvent(stimulus_id=f"gather-dep-done-{time()}")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp hack, to be removed when refactoring gather_dep

self.ensure_communicating()
self.handle_stimulus(
GatherDepDoneEvent(stimulus_id=f"readd-busy-worker-{time()}")
)
Copy link
Collaborator Author

@crusaderky crusaderky May 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this method in a later PR to an async instruction

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but it feels weird to issue a GatherDepDoneEvent when that isn't actually what happened. Something like

class BusyWorkerReAddedEvent(GatherDepDoneEvent):
    pass

might make it clearer that they're different things, just happen to be handled in the same way (for now).

But if GatherDepDoneEvent is itself a temporary hack and will be removed soon, then this isn't important.

@crusaderky
Copy link
Collaborator Author

All tests pass! This is ready for review and merge 🥳
Repeated failures in test_stress_scatter_death are unrelated (#6305).

@crusaderky crusaderky marked this pull request as ready for review May 9, 2022 10:06
@crusaderky crusaderky changed the title WIP: Refactor ensure_communicating Refactor ensure_communicating May 9, 2022
@hayesgb hayesgb requested a review from gjoseph92 May 9, 2022 14:51
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some naming and design questions, but overall this seems good.

distributed/worker.py Outdated Show resolved Hide resolved
stimulus_id=inst.stimulus_id
)
self.transitions(recs, stimulus_id=inst.stimulus_id)
self._handle_instructions(instructions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we recurse into _handle_instructions here, instead of adding the new instructions onto the end of the current instructions list (in a safe way)? I'm wondering why the new instructions are treated as "higher priority" than the current ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overhauled the method, please have another look

@@ -292,6 +287,12 @@ def to_dict(self) -> dict[str, Any]:
return d


@dataclass
class EnsureCommunicatingLater(Instruction):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the "later" part of EnsureCommunicatingLater a little confusing. EnsureCommunicatingOnce? EnsureCommunicatingIdempotent?

AFAIU the point of doing this as an instruction (instead of calling _ensure_communicating directly in many places) is to allow batching of multiple EnsureCommunicating instructions into one, via special logic in _handle_instructions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just a matter of doing it once; it must happen after all recommendations to transition to fetch have been enacted.
Renamed to EnsureCommunicatingAfterTransitions.

distributed/worker.py Show resolved Hide resolved
stimulus_id=stimulus_id,
# Note: given n tasks that must be fetched from the same worker, this method
# may generate anywhere between 1 and n GatherDep instructions, as multiple
# tasks may be clustered in the same instruction by _select_keys_for_gather
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# tasks may be clustered in the same instruction by _select_keys_for_gather
# tasks may be clustered in the same instruction by _select_keys_for_gather.
# The number will be greater than 1 when the tasks don't all fit in `target_message_size`.

This just took me a few reads to make sense of

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was incorrect to begin with - you'll never have more than one GatherDep from the same worker within the same iteration of ensure_communicating. I rewrote it.

self.ensure_communicating()
self.handle_stimulus(
GatherDepDoneEvent(stimulus_id=f"readd-busy-worker-{time()}")
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedantic, but it feels weird to issue a GatherDepDoneEvent when that isn't actually what happened. Something like

class BusyWorkerReAddedEvent(GatherDepDoneEvent):
    pass

might make it clearer that they're different things, just happen to be handled in the same way (for now).

But if GatherDepDoneEvent is itself a temporary hack and will be removed soon, then this isn't important.

@@ -292,6 +287,12 @@ def to_dict(self) -> dict[str, Any]:
return d


@dataclass
class EnsureCommunicatingLater(Instruction):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find it a little odd that unlike other instructions, EnsureCommunicatingLater doesn't contain any data. It's relying on the side effect of _add_to_data_needed having already mutated data_needed and data_needed_per_worker, but the instruction itself is pointless without that side effect having occurred. I can't think of a cleaner design that avoids this and still has batching though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but the whole instruction is a hack to begin with

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renaming and _handle_instructions refactor helped, thank you. Just some comment-rewording and type annotations for clarity.

distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
to_gather, total_nbytes = self._select_keys_for_gather(
worker, ts.key, all_keys_to_gather
)
all_keys_to_gather |= to_gather

self.log.append(
("gather-dependencies", worker, to_gather, stimulus_id, time())
)

self.comm_nbytes += total_nbytes
self.in_flight_workers[worker] = to_gather
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.in_flight_workers[worker] = to_gather
assert worker not in self.in_flight_workers, self.in_flight_workers[worker]
self.in_flight_workers[worker] = to_gather

Are we guaranteed that in_flight_workers[worker] is not already set? Because we'd be overwriting it if it is.

EDIT: I think we are because of the if w not in self.in_flight_workers above. Still might be nice to validate though? If this was not the case, it could probably cause a deadlock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's impossible to be there already due to the line you mentioned just above. I think validation should be overkill since it's directly above

distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
# 1. there are many fetches queued because all workers are in flight
# 2. a single compute-task or acquire-replicas command just sent
# many dependencies to fetch at once.
ensure_communicating = inst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even necessary to store the instruction right now (since it's just a sentinel), or could this just be a bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the stimulus_id

@crusaderky crusaderky merged commit 1937be7 into dask:main May 11, 2022
@crusaderky crusaderky deleted the ensure_communicating branch May 11, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate ensure_communicating transitions to new WorkerState event mechanism
2 participants