Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scheduling algorithm for better throughput #249

Open
shonfeder opened this issue Sep 6, 2024 · 1 comment
Open

Improve scheduling algorithm for better throughput #249

shonfeder opened this issue Sep 6, 2024 · 1 comment

Comments

@shonfeder
Copy link
Contributor

We often see the current scheduling to result in one worker getting overloaded with a massive queue due to cache hints while other workers are starved for work.

We want the job scheduling algorithm to take cache hints into account, but also to consider worker capacity and availability. There are times where this should be able to really improve our thru put.

Related to, but more general than #168

@talex5
Copy link
Contributor

talex5 commented Sep 6, 2024

We want the job scheduling algorithm to take cache hints into account, but also to consider worker capacity and availability.

It should already be doing that:

ocluster/scheduler/pool.ml

Lines 368 to 373 in a27ef61

(* A worker is available for this item, but perhaps there is some other
worker that should get it instead? e.g. that worker already has part of
the work cached and will be able to get to it fairly soon. *)
let assign_preferred t ticket =
let hint = Item.cache_hint ticket.item in
let cost = Item.cost_estimate ticket.item in

However, there are a few problems (or were, last time I looked):

  • The estimates of the times for cold vs warm caches are hard-coded and arbitrary:
    let default_estimate = S.{
    cached = 10; (* A build with cached dependencies usually only takes about 10 seconds. *)
    non_cached = 600; (* If we have to install dependencies, it'll probably take about 10 minutes. *)
    }
  • The cache may have been cleared on the worker, but the scheduler doesn't know.
  • If a machine gets stuck but remains connected then the scheduler will wait indefinitely for it do to the work it claims it can do, rather than deciding that it's broken and reassigning (though if the worker is well enough to be performing health-checks then it can pause itself, which fixes the problem).

@shonfeder shonfeder changed the title Improve scheduling algorithm Improve scheduling algorithm to improve throughput Sep 24, 2024
@shonfeder shonfeder changed the title Improve scheduling algorithm to improve throughput Improve scheduling algorithm for better throughput Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants