Accurate occupancy calculation / occupancy replacement #7027
Labels
discussion
Discussing a topic with no specific actions yet
enhancement
Improve existing functionality or make things work better
performance
scheduler
scheduling
stealing
Context
Occupancy is a measure of expected work on a Worker given a couple of assumptions. If there was one thread on that worker and if network communication and compute could not overlap, the occupancy would give us the expected time this worker will require to compute all of its keys. It is an estimation of busyness.
Occupancy is currently used for two mechanisms
Scheduler.worker_objective
to sort worker from least busy to most busy. Typically we will select the least busy worker for a new task assignmentThis measure has currently a couple of flaws
worker_objective
is trying to compute a "earliest start time" for a given key and tries to minimize this. Exclusively considering runtimes is not sufficient since it neglects task priorities. A task that is assigned to a worker with very high occupancy and thousands of tasks queued up would still have the chance to cut the line. In fact, we are relying on this mechanism heavily to reduce memory pressure. Simply looking at expected start time might even act suppressingly to achieve this goal. See Respect priority in earliest start time heuristic #5253 for a deeper discussionreevaluate_occupancy
. Therefore, drastic changes to number of workers and/or timing measurements require time to be propagated to the entire cluster which until corrected can cause very poor scheduling (and stealing) decisions, e.g. Root-ish tasks all schedule onto one worker #6573 (comment)To address problem 1.) we'd need to rework fundamentally how we schedule. some of the possibilities are discussed in #5253 without a conclusion due to performance concerns.
Problem 2.) and 3.) is mostly about the implementation itself and there are already small increments available on how to improve this (see #7020). I believe we can overcome these limitations entirely (not for free) while paving the way for a solution to 1.) at the same time.
Drop in replacement?
Assuming that the variance of durations within a
TaskGroup
is not too large, a suitable approximation for CPU occupancy can be computed accurately by counting the number of tasks per group assigned to a worker.Using the duration_average we can calculate an accurate occupancy in
O(TG)
instead ofO(T)
whereTG
is the number ofTaskGroup
s currently assigned to a worker andT
is the number of tasks assigned to a worker. I expectTG
to typically be <10 and consider iterating over this collection to be basically constant time.(Note: This entire argument can be made for
TaskPrefix
es as well if performance doesn't play out but I believe we should start measureing duration_average on TG level. That's a bit off-topic)Pseudo code
This pseudo code would effectively allow us to
I believe this implementation would server as a drop-in replacement to our existing logic but would allow us to remove
Scheduler.reevaluate_occupancy
and therefore not allow occupancy calculations to drift. The cost we'd pay for this is twofoldTask
to aWorker
. However, this is already happening inScheduler._set_duration_estimate
to calculate the comm cost so this will not add any computational complexity but rather shift itscheduler._add_to_memory
) to compute follow up transitions.worker_objective
which already isn't constant time since it is looping over a tasks dependencies to compute transfer cost. Is the number of task groups on a worker significantly larger than the number of dependencies of a task? At the same time, even if this is larger, I would be surprised if the removal ofScheduler.reevaluate_occupancy
would amortize this additional cost.worker_objective
also includes a loop over the dependencies which could leverage the newly introducedWorkerState.needs_what
Further progress towards problem 1.)
With this implementation we could further and more simply investigate scheduling based on task group / prefixes and would allow us to pivot towards a "uniform number of tasks per group" scheduling approach that no longer accounts for durations at all, removing the need for unknown_durations. For instance, if we could define an ordering of task groups we could infer if a task can cut the line and make progress towards 1.)
cc @hendrikmakait @gjoseph92 @wence- @crusaderky
The text was updated successfully, but these errors were encountered: