-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for workload preemption #83
Comments
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
I feel this is a very useful feature. On top of priority and reclaiming borrowed resources, this feature is useful to maintain fairness by assigning some notion of time slice to each workload. This way long running workloads will not block other workloads specially short-running ( in a namespace with tight resource quotas). With this approach, we don't need to think about preemption triggers. After the time-slice, a workload will be put back into the queue allowing for higher-priority jobs to be scheduled and borrowed resources to be reclaimed. |
@anirudhjayakumar are you thinking of max runtime per job that when reached, instead of declaring the job failed, we place it back into the queue? |
Yes, that is correct. |
I would think of that as a separate feature from preemption. It's actually much easier to implement, as we don't have to decide which workloads should be suspended. We have all the information in the Workload object (start time and max runtime). |
Time-slice RR will require us to switch between different workloads, considering the cost is really high, e.g. when we suspend a job, we'll delete all running pods, when we unsuspend the job, we'll create them again, so, I'm doubt about the earnings. For preemption, we can consider this in two dimensions, one is preempt between workloads in the same queue, we have priority for each workloads now. Another one is preempt between shared clusterQueue, currently we can borrow resources from another clusterQueue when they're in a same cohort, but when the borrowed clusterQueue is insufficient of resources, we should reclaim the borrowed resources. |
@anirudhjayakumar can you open a separate issue for time-based suspension? I don't see any problem with the feature request. The user intent is clear: if a job doesn't finish within X time, suspend and requeue. I guess there could be an option for just terminate. |
I think there're two different understandings here around the time-slice:
The good thing here is they are actually the same thing. Option2 seems better to me. |
I'm thinking more along the lines of 2. However, I would have it as a field in the Workload. And if an administrator wants to enforce a timeout for the cluster (or a namespace), they can add admission webhook. Or maybe we can justify adding the timeout in the LocalQueue or ClusterQueue. Although if we add it in ClusterQueue, I guess it's closer to option 1, and thus it's a form of preemption. |
@anirudhjayakumar maybe you can describe your use case a bit more, in the light of the suggestions above. |
My specific use cases is of long running jobs hogging resources while other jobs (short and long) keep waiting. The problem is mostly around user experience, where the user see no progress of their jobs for long periods of time. This also prevent the system from providing any loose guarantees around job execution. Example: A job that creates daily report only needs 10 mins of execution time, but the job cannot be run in this setup because there is no guarantee that the job will complete one run each day. My solution to this problem is to allow each job to get a slice of the resources. For my use case, I feel it is okay for low_pri job keep waiting in presence of high_pri jobs. But I'm not sure if that is acceptable as a general rule. May be low_jobs will have a smaller time slice compared to high_pri jobs. |
I see the appeal of having the timeout be defined by the priority, however I'm not sure where exactly we would add it. We could add it as an annotation to the PriorityClass object. Alternatives are adding it to the LocalQueue (@ahg-g has toyed with the idea of adding "priority" parameters here) or the ClusterQueue. |
@anirudhjayakumar the degree to which the approach you propose will work depends on the workload, specifically whether it can restart without loosing too much progress. If on each suspend/preemption it needs to start over, and it is a long running job, then the reward of suspending to give way for other jobs is not obvious. |
For my use case, job needs to suspend (and go back to the queue) only when there are other jobs waiting. But if that is not possible, I don't mind jobs put back into the queue once the timeout is reached. The assumption here is that, the system will soon figure out (in few seconds) that there is enough resources to run this suspended job that it resume execution of the job within a few seconds/minutes.
A global timeout is nice but in the absence of it, I would end up setting a timeout on each workload/job that gets scheduled.
I totally overlooked that. For my use case, we do checkpoint every few minutes, so progress is not lost. But I do understand that this may not be true for all batch use cases. But having said that, K8s could move pods between node due to various reasons. Shouldn't we assume that most workloads should have some mechanism to store intermediate state to make the job reliable. |
Couple priorityClass with timeout seems not a good idea to me, for different type of workloads may have different timeouts. If we define timeout in priorityClass, we have to apply multi priotyClasses, but as we known, this is usually privilege controlled by administrator. Maybe define the timeout in localQueue is a good idea for localQueue is tenant isolated, and usually we submit the similar jobs to the same queue. ClusterQueue usually works for resource sharing, so we may submit different types of jobs to the same clusterQueue for fine-gained resource managements, so I think it's not a good choice. If we want to gain more flexibilities, workload can also have a timeout but defaultly inherit from localQueue. |
This makes me think of DRF, it is also a common strategy. Maybe we can add this as another choice in #312. |
I agree that adding the timeout to the localQueue seems like a good idea. We should copy the value into the Workload like we do for priority. This seems like a very explicit way of doing preemption (only preempt if certain conditions are true). This is good to avoid surprises, but we probably need other rules for preemption. Otherwise there could be cases where jobs with higher priority need to run, but no jobs are past the deadline. |
Few points: Why do we want to preempt ?
How do we want to preempt ?
Also, what about workloads / jobs which have dynamic resource allocation ? |
One thing we should potentially add to the current proposal is cooperative preemption. If the workload does checkpointing, then we can assume they are able to communicate the latest checkpoint via a status condition. We can take that into account when selecting victims and prioritize ones that checkpointed lately. |
I think we can treat that as a separate requirement. Do you mind creating a new issue? |
I mean it is not separate from workload preemption, I feel we can append the KEP to take this into consideration? or you prefer we do it as a followup? |
I can use the same KEP, to save some process. But I'll do so after I finish implementing the design we already have. |
Preemption can be useful to reclaim borrowed capacity, however the obvious tradeoff is interrupting workloads and potentially losing significant progress.
There are two high-level design decision we need to make and whether they should be tunable:
The text was updated successfully, but these errors were encountered: