-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cases for work stealing #6600
Comments
There is currently a proposal open to introduce a third use case C) Redistribute tasks that are currently assigned to paused workers (see #3761) |
In practice, I feel like work stealing mostly exists to rebalance the 10s or 100s of thousands of root tasks in This is a pretty different use case than moving a handful of downstream tasks to a new worker to increase parallelism. I think if these root vs downstream cases were separate, we could have better algorithms for both. Holding back root tasks on the scheduler (and implementing an alternative load-balancing approach for those held-back tasks) would do that. I think that with #6560, we might be okay without work stealing at all. I think we'd still want it eventually for some scenarios, but the immediate need for it would be much less. |
Sorry for missing this issue (I stopped tracking Dask issues a couple weeks ago while on vacation) Some thoughts:
Anyway, I agree with @gjoseph92 that if we hold back tasks on the scheduler then work stealing becomes more optional than it is today. I'd be curious about up-scaling, but it's probably not a huge issue. |
Work stealing is a fairly complex machinery intended to redistribute tasks on a cluster to achieve a homogeneous occupancy, i.e. all workers will be busy for approximately the same time.
I'm currently aware of two use cases for it
A) Adaptive scaling or generally upscaling scenarios, i.e. adding more workers to a cluster require some kind of load balancing. Without work stealing, a newly added worker would sit idle until the scheduler might assign a task which is not even guaranteed or might work very poorly (e.g. #4471)
By having work stealing enabled, we are automatically ensuring that any newly added worker is able to start working since it gets tasks assigned via the stealing mechanism. However, this is known to not work well (list not exhaustive)
B) Another application would be a workload with vastly different runtimes in a TaskGroup. This is particularly concerning if there are few tasks in this task group or the runtime distribution is asymmetrical such that even after running many tasks the runtime differences would not cancel themselves and we'd have few workers with very large queues, effectively extending overall runtime by having a large tail in the computation.
I am not entirely sure if this usecase is actually very relevant and would appreciate some additional information around it. If this is indeed relevant we may benefit from an improved runtime tracking, e.g. with error measurement (e.g. #4028) in combination with a simpler, more selective algorithm.
The current work stealing algorithm has a couple of issues. Currently open issues can be filtered by the label stealing
Stealing also is known to be a trigger for deadlocks (at least four have been reported and fixed by now) since it requires a handshake that can cause timing issues (see e.g. https://github.com/dask/distributed/pulls?q=is%3Apr+is%3Aclosed+stealing+label%3Astealing+label%3Adeadlock)
There are even cases where work stealing is known to cause harm by reverting smart scheduler decisions, e.g. #6573
I'm currently trying to estimate whether we should pursue work stealing and try to make it robust or abandon this extension in favor of a less general but more robust solution for A and possibly B.
Thoughts?
cc @mrocklin @crusaderky @gjoseph92
The text was updated successfully, but these errors were encountered: