-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use information about last checkpoint on preemption #477
Comments
That may potentially create a bad incentive to not publish the checkpoints in low priority jobs or the job will have a higher chances of being preempted (vs those that doesn't do it). |
We could devise a policy to provide incentive to setting the checkpoint. For example: the assumed checkpoint of a workload that doesn't define any is equal to it's the maximum of its startTime and the median of the startTime of the workloads that define one. |
Although that might give an incentive to publish one checkpoint and never do it again. But any system where there is cooperative preemption has the same issue. I suppose it is called cooperative for a reason :) |
Right, cooperative preemption by design assumes that the jobs play nicely. This is not uncommon in environments where researchers share a cluster and use common libraries in their jobs that have builtin support for checkpointing. |
I'm wondering how much cooperativeness should assumed in the system. In the extreme, exaggerated case we wouldn't need any quotas and queues if everyone tried to play nicely. |
You will still need quotas and queues to automate "playing nice".
Users have a strong incentive to checkpoint if their jobs run for a long time. As for setting the status, a common setup is that users use sdks to deploy their workloads, those sdks are generally controlled by the batch admin / platform team and probably use common libraries for checkpointing that will force setting this value. Having said that, I think we want to distinguish between having the status and the incentives of setting it, the later can be improved as a followup if needed and based on user feedback. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
This is known as cooperative preemption
If the workload does checkpointing, then we can assume they are able to communicate the latest checkpoint via a status condition. We can take that into account when selecting victims and prioritize ones that checkpointed lately.
We can update the existing design doc for preemption to include this.
Originally posted by @ahg-g in #83 (comment)
The text was updated successfully, but these errors were encountered: