Multi-scheduler Job Starvation #585

thinkharderdev · 2023-01-05T16:50:03Z

Describe the bug
A clear and concise description of what the bug is.

In scenarios where multiple schedulers are running concurrently it is possible to run into the following scenario:

Job A gets submitted to scheduler A and is scheduled on all available task slots.
Job B gets submitted to scheduler B and there are no available task slots for scheduling.
All task updates from Job A go back to scheduler A. It can not schedule any tasks for Job B (because that job is owned by scheduler B)
Because no task updates land on scheduler B, Job B will never be scheduled anywhere.

To Reproduce
Steps to reproduce the behavior:

Start a cluster with two schedulers
Submit a job to scheduler 1 that consumes all available executor slots
Before any task on job 1 complete, submit a job to scheduler 2
Job 2 will never run

Expected behavior
A clear and concise description of what you expected to happen.

Job 2 should start running whenever executor task slots become available

Additional context
Add any other context about the problem here.

The fix here is simple. In the event loop, if a job is submitted and there are not task slots available, resubmit the job to the event loop (with a small delay to prevent excessive CPU consumption).

thinkharderdev added the bug Something isn't working label Jan 5, 2023

thinkharderdev mentioned this issue Jan 5, 2023

Handle job resubmission #586

Merged

thinkharderdev closed this as completed in #586 Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-scheduler Job Starvation #585

Multi-scheduler Job Starvation #585

thinkharderdev commented Jan 5, 2023

Multi-scheduler Job Starvation #585

Multi-scheduler Job Starvation #585

Comments

thinkharderdev commented Jan 5, 2023