You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
In scenarios where multiple schedulers are running concurrently it is possible to run into the following scenario:
Job A gets submitted to scheduler A and is scheduled on all available task slots.
Job B gets submitted to scheduler B and there are no available task slots for scheduling.
All task updates from Job A go back to scheduler A. It can not schedule any tasks for Job B (because that job is owned by scheduler B)
Because no task updates land on scheduler B, Job B will never be scheduled anywhere.
To Reproduce
Steps to reproduce the behavior:
Start a cluster with two schedulers
Submit a job to scheduler 1 that consumes all available executor slots
Before any task on job 1 complete, submit a job to scheduler 2
Job 2 will never run
Expected behavior
A clear and concise description of what you expected to happen.
Job 2 should start running whenever executor task slots become available
Additional context
Add any other context about the problem here.
The fix here is simple. In the event loop, if a job is submitted and there are not task slots available, resubmit the job to the event loop (with a small delay to prevent excessive CPU consumption).
The text was updated successfully, but these errors were encountered:
Describe the bug
A clear and concise description of what the bug is.
In scenarios where multiple schedulers are running concurrently it is possible to run into the following scenario:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Job 2 should start running whenever executor task slots become available
Additional context
Add any other context about the problem here.
The fix here is simple. In the event loop, if a job is submitted and there are not task slots available, resubmit the job to the event loop (with a small delay to prevent excessive CPU consumption).
The text was updated successfully, but these errors were encountered: