-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factor out and instrument task categorization logic - static graph analysis #6922
Comments
Thanks for writing this up. I agree that the way we categorize tasks for queuing and co-location is highly related to how we'd do it for STA. I'm not sure that the current root task detection logic could be done statically, though. (I think there could probably be a totally different static algorithm, but I don't know what it is yet.) I would actually like to change it to make the currently-static part cluster-size-dependent. e5175ce is that change I'd like to make. The commit message explains this:
I think that what the current root-task logic is actually trying to identify is when we're crossing the cluster-size boundary: when we switch from having more threads than tasks, to more tasks than threads. |
I have also thought about this a lot for task queuing, though none of it made it into the current PR. I think there are two major factors we should consider more in scheduling, which would encompass the current root task detection logic, co-assignment logic, widely-shared dependencies, and STA: 1. Amortized cost / return on investment of data transferGenerally, we want to minimize data transfers between workers. Of course, the trivial way to do this is to schedule all tasks onto one worker—then you'd never have transfers! But that is obviously a bad blanket policy. So we have to have some way of deciding when it's "worth it" to incur the short-term cost of copying data from one worker to another, because it would open up the opportunity for some long-term gain—namely, by copying the data, we can increase the parallelism of our overall computation. Right now, we don't ask this question in a formalized way. "Normal-mode" scheduling is actually unwilling to do this at all: if a task has 1 dependency, for example, it will only go to a worker that holds that dependency. (We tried to change this once in #4925.) Currently, work stealing and root-ish task logic are the only ways tasks can be assigned to workers that don't have any of their dependencies. This leads to problems like the dogpile. One way I've thought about this is to amortize the cost of moving a key over all the tasks that could then run on the new worker if it were moved—basically, how much it increases the opportunity for parallelism: b4ebbee. Lots more discussion of this in #5325 / #5326. Maybe a better way to think about it would be to try to estimate the return-on-investment of moving a key. Rather than purely looking at transfers as a cost, in some cases, they can be an investment. From that framework, some keys are a bad investment to copy: if a task only has one dependent, and it's on a multithreaded worker, copying it gains you almost no parallelism. But if a task has 100 dependents, duplicating it onto another worker doubles the parallelism of those 100 tasks. You could look at the wall-clock time you'd expect those 100 tasks to take with the current number of threads available to them, divided by the transfer time + the wall-clock time with the increased number of threads. Maybe that's your ROI. From that perspective, the root-ish task metric (lots of tasks, few dependencies) is actually a crude way of doing this amortization. When 10k From this perspective, then, maybe root-ish tasks don't actually need to be special-cased? That is, in these big-fan-out cases, or cluster-size-boundary-crossing cases, the handful of dependencies that thousands of tasks share would look like good enough investments to duplicate that we'd naturally consider every worker as a candidate for those root-ish downstream tasks, not just the workers holding the few dependencies. 2. Identifying families of tasks to co-locateThe other major thing the root task scheduling does is try to pick the same worker for tasks whose outputs will be used together. If C takes both A and B as inputs, you want A and B to run on the same worker—that way, you don't have to transfer any data to run C. This only works right now because we assign all ready tasks greedily, so we happen to iterate through them in priority order. When you switch to scheduler-side queuing, the iteration order changes and you lose this co-assignment. It turns out that both for both STA and root-task withholding, it would be valuable to have a quick way to identify "families"/groups of sibling/cousin tasks which should ideally all run on the same worker. I think of a family as a set of tasks that are all inputs to a common downstream task. A corollary is that in most cases, all tasks in a family will have to be in memory on the same worker at the same time. But coming up with an actual metric for this is harder once you consider all-to-all graphs, widely-shared dependencies, linear chains, etc. Still, I think there's probably a pragmatic definition we could come up with, and a way we could compute it with minimal additional cost during This, more than root-ish-ness, is the thing I'd like to explore formalizing and statically identifying in the graph.
Overall, my hypothesis is that these two special things that root-ish task logic considers would actually be good to think about for all tasks. Furthermore, those things might actually be the underlying problems in a number of different domains! So if we had the framework to measure these things easily, solutions to a variety of other problems (widely-shared dependencies dogpile, STA, co-assignment for queued tasks, etc.) might also pop out in a generic way. |
Thanks @gjoseph92 for your thorough writeup. I think your ideas do have some merit but are already a couple of steps ahead of what I have in mind.
Generally speaking what you are describing is already what work stealing is trying to do. I think this argument is sound but our problem is that, generally, we do not have this kind of information/measurement available at scheduling time. IIUC you are trying to make the point of "maybe we do not need static task classification"? I think what you are describing are further dynamic components that can weigh in on our scheduling decision (right now, it's basically mostly occupancy) but in this issue I wanted to discuss static graph analysis (similar to dask.order) we could utilize further
I think this is mostly a static graph property and also something I had in mind as a possible next step. Just to repeat, I'm not saying we should base scheduling decisions exclusively on static analysis but I think the static analysis is a major component we're not using sufficiently, yet. Overall, I think we should try to get more out of our dask.order than we are right now. If I take a look at "recent" improvements in this space, I think we should have a similar mechanism/visualization for root-(ish-)tasks. I think whatever we do here can also be used for further research (e.g. find groups) @eriknw I think the dask.order stuff was created by you? Maybe you are interested in this space as well |
Thanks for the ping! Yes, lots of things here interest me very much, and yes I've done a bunch of One idea I've been playing with for a while (and discussed with some folks at SciPy) is to have another In regard to detecting high level patterns in the task graph to schedule better, it would be nice to know when all dependent tasks "split" an input into smaller chunks. In this case, scheduling order should prefer to do BFS so we can release the big dependency ASAP. Knowing expected relative sizes of tasks can also improve low-level task fusion. |
This overhauls `decide_worker` into separate methods for different cases. More importantly, it explicitly turns `transition_waiting_processing` into the primary dispatch mechanism for ready tasks. All ready tasks (deps in memory) now always get recommended to processing, regardless of whether there are any workers in the cluster, whether the have restrictions, whether they're root-ish, etc. `transition_waiting_processing` then decides how to handle them (depending on whether they're root-ish or not), and calls the appropriate `decide_worker` method to search for a worker. If a worker isn't available, then it recommends them off to `queued` or `no-worker` (depending, again, on whether they're root-ish and the WORKER_SATURATION setting). This also updates the `no-worker` state to better match `queued`. Before, `bulk_schedule_after_adding_worker` would send `no-worker` tasks to `waiting`, which would then send them to `processing`. This was weird, because in order to be in `no-worker`, they should already be ready to run (just in need of a worker). So going straight to `processing` makes more sense than sending a ready task back to waiting. Finally, this adds a `SchedulerState.is_rootish` helper. Not quite the static field on a task @fjetter wants in dask#6922, but a step in that direction.
TL;DR I think we should build out our instrumentation around root-ish tasks to improve visibility, UX and enable further research in this space
When deciding on which worker to schedule a task on we're not treating all tasks equally.
Prior and ongoing work in this sector is the effort to enforce co-location and the effort to withhold tasks on the scheduler. Both cases single out a specific class of tasks and implements special scheduling heuristics for them. These issues are introducing something that is called a "root-ish task" which refers to nodes in the graph that are likely exhibiting a fan-out / reduce pattern. The reason why these need special treatment is an artifact of us assigning not yet runnable tasks (i.e. dependents) to workers just-in-time but are assigning ready tasks greedily. This can temporarily bypass the depth-first-search ordering of
dask.order
which causes suboptimal scheduling and significantly higher cluster-wide memory consuption for many use cases. This behavior is commonly referred to as root task overproduction.There are commonly two approaches discussed to fix this problem
We are approaching consensus that both solutions would address this problem but they are taking almost orthogonal approaches to scheduling. Both solutions come with benefits, opportunities, shortcomings, costs and risks. The approach of task withholding is currently the most likely short term fix for the situation since it requries comparatively few adjustments to the code base and can be implemented scheduler-side only.
A common theme between both approaches is how we detect "special" tasks and I would like to start a conversation about generalizing the approach taken for root-tasks and discuss how this could be expanded.
How are root tasks detected
Root tasks can be detected with a quadratic runtime algorithm trivially by walking up and down the task graph but this is not feasible to perform for every task.
The current approach instead utilizes
TaskGroup
s to infer the root task propertydistributed/distributed/scheduler.py
Lines 1799 to 1806 in a1d2011
The dynamic component of this if clause
(valid_workers is None and len(tg) > self.total_nthreads * 2)
is there to protect us from making a bad scheduling decision that would reduce our ability to parallelize.The static components are really what we use to classify the root tasks
len(tg.dependencies) < 2 and sum(map(len, tg.dependencies)) < 5
which is a way of describing that we're dealing with a task that has "few, relatively small reducers".This classification could be moved to become an actual TaskState property which has a couple of trivial benefits.
update_graph
or afterdask.order
. While this is not a costly computation, there is no need to do this at runtimeHow can we use this? What's the link to STA?
2.) will be very useful from a UX perspective. Our scheduling is already relatively opaque, if we now start to withhold some tasks because they are special, it would be nice if we could visualize this (e.g. as part of dask.order visualization, graph bokeh dashboard, annotations when hovering on the task stream).
From a developers perspective, I strongly believe that this is equally helpful to talk about less trivial (non-root) root-ish task graphs.
Apart from UX, I consider 2.), 4.) and to a lesser extend also 3.) as valueable components for further research in this space. For instance, are all commonly appearing "root-ish" tasks part of a same category of subtopologies or do they break up into further categories? Are there any topolgies for which we know that we can afford exact solutions[1]?
How valueable would it be to introduce manual annotations that mark certain tasks as special (For instance, data generators, high memory use, reducers, )? I'm sure there is more.
Apart from worker side state machine changes (which I consider quite doable after #5736) one of the big questions remaining in STA is "which tasks should we schedule ahead of time?"
As outlined above, I consider both the root task withholding and STA of dependents as a symmetrical problem when it comes to this decision. I believe further research into these task classifications could inform future decisions for both approaches.
[1] For example, think of trivial map-reduce/tree-reduce patterns. If we can detect that we're operating on these topologies it is almost trivial to implement perfect scheduling. Once we classified a task as root-ish we could further probe/analyze whether we are dealing with a trivial reduction and implement a special scheduling decision for this.
cc @gjoseph92
The text was updated successfully, but these errors were encountered: