Improve scheduling algorithm for better throughput #249

shonfeder · 2024-09-06T01:14:38Z

We often see the current scheduling to result in one worker getting overloaded with a massive queue due to cache hints while other workers are starved for work.

We want the job scheduling algorithm to take cache hints into account, but also to consider worker capacity and availability. There are times where this should be able to really improve our thru put.

Related to, but more general than #168

talex5 · 2024-09-06T15:45:29Z

We want the job scheduling algorithm to take cache hints into account, but also to consider worker capacity and availability.

It should already be doing that:

ocluster/scheduler/pool.ml

Lines 368 to 373 in a27ef61

    
             (* A worker is available for this item, but perhaps there is some other 
        
                worker that should get it instead? e.g. that worker already has part of 
        
                the work cached and will be able to get to it fairly soon. *) 
        
             let assign_preferred t ticket = 
        
               let hint = Item.cache_hint ticket.item in 
        
               let cost = Item.cost_estimate ticket.item in

However, there are a few problems (or were, last time I looked):

The estimates of the times for cold vs warm caches are hard-coded and arbitrary:

ocluster/scheduler/cluster_scheduler.ml

Lines 33 to 36 in a27ef61

    
           let default_estimate = S.{ 
        
               cached = 10;                (* A build with cached dependencies usually only takes about 10 seconds. *) 
        
               non_cached = 600;           (* If we have to install dependencies, it'll probably take about 10 minutes. *) 
        
           }

The cache may have been cleared on the worker, but the scheduler doesn't know.
If a machine gets stuck but remains connected then the scheduler will wait indefinitely for it do to the work it claims it can do, rather than deciding that it's broken and reassigning (though if the worker is well enough to be performing health-checks then it can pause itself, which fixes the problem).

shonfeder changed the title ~~Improve scheduling algorithm~~ Improve scheduling algorithm to improve throughput Sep 24, 2024

shonfeder changed the title ~~Improve scheduling algorithm to improve throughput~~ Improve scheduling algorithm for better throughput Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scheduling algorithm for better throughput #249

Improve scheduling algorithm for better throughput #249

shonfeder commented Sep 6, 2024

talex5 commented Sep 6, 2024

Improve scheduling algorithm for better throughput #249

Improve scheduling algorithm for better throughput #249

Comments

shonfeder commented Sep 6, 2024

talex5 commented Sep 6, 2024