Use information about last checkpoint on preemption #477

alculquicondor · 2022-12-14T16:11:14Z

This is known as cooperative preemption

If the workload does checkpointing, then we can assume they are able to communicate the latest checkpoint via a status condition. We can take that into account when selecting victims and prioritize ones that checkpointed lately.

We can update the existing design doc for preemption to include this.

Originally posted by @ahg-g in #83 (comment)

mwielgus · 2023-01-03T15:11:46Z

That may potentially create a bad incentive to not publish the checkpoints in low priority jobs or the job will have a higher chances of being preempted (vs those that doesn't do it).

alculquicondor · 2023-01-03T16:29:59Z

We could devise a policy to provide incentive to setting the checkpoint.

For example: the assumed checkpoint of a workload that doesn't define any is equal to it's the maximum of its startTime and the median of the startTime of the workloads that define one.

alculquicondor · 2023-01-03T16:33:18Z

Although that might give an incentive to publish one checkpoint and never do it again. But any system where there is cooperative preemption has the same issue. I suppose it is called cooperative for a reason :)

ahg-g · 2023-01-03T18:47:47Z

Right, cooperative preemption by design assumes that the jobs play nicely. This is not uncommon in environments where researchers share a cluster and use common libraries in their jobs that have builtin support for checkpointing.

mwielgus · 2023-01-03T23:55:58Z

I'm wondering how much cooperativeness should assumed in the system. In the extreme, exaggerated case we wouldn't need any quotas and queues if everyone tried to play nicely.
People are nice up to a point when they learn that their goodwill is being exploited to their disadvantage. And here, publishing the status works against them, unless there is some other benefit that can balance the chances of being preempted first.

ahg-g · 2023-01-04T06:08:01Z

In the extreme, exaggerated case we wouldn't need any quotas and queues if everyone tried to play nicely.

You will still need quotas and queues to automate "playing nice".

People are nice up to a point when they learn that their goodwill is being exploited to their disadvantage. And here, publishing the status works against them, unless there is some other benefit that can balance the chances of being preempted first.

Users have a strong incentive to checkpoint if their jobs run for a long time.

As for setting the status, a common setup is that users use sdks to deploy their workloads, those sdks are generally controlled by the batch admin / platform team and probably use common libraries for checkpointing that will force setting this value.

Having said that, I think we want to distinguish between having the status and the incentives of setting it, the later can be improved as a followup if needed and based on user feedback.

k8s-triage-robot · 2023-04-04T16:41:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2023-04-05T09:15:32Z

/remove-lifecycle stale

alculquicondor · 2023-04-05T12:21:52Z

/lifecycle frozen

alculquicondor mentioned this issue Dec 22, 2022

Add roadmap to kueue #438

Merged

alculquicondor mentioned this issue Jan 26, 2023

Add support for max runtime for workloads #405

Closed

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2023

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use information about last checkpoint on preemption #477

Use information about last checkpoint on preemption #477

alculquicondor commented Dec 14, 2022 •

edited

Loading

mwielgus commented Jan 3, 2023

alculquicondor commented Jan 3, 2023 •

edited

Loading

alculquicondor commented Jan 3, 2023

ahg-g commented Jan 3, 2023

mwielgus commented Jan 3, 2023

ahg-g commented Jan 4, 2023 •

edited

Loading

k8s-triage-robot commented Apr 4, 2023

kerthcet commented Apr 5, 2023

alculquicondor commented Apr 5, 2023

Use information about last checkpoint on preemption #477

Use information about last checkpoint on preemption #477

Comments

alculquicondor commented Dec 14, 2022 • edited Loading

mwielgus commented Jan 3, 2023

alculquicondor commented Jan 3, 2023 • edited Loading

alculquicondor commented Jan 3, 2023

ahg-g commented Jan 3, 2023

mwielgus commented Jan 3, 2023

ahg-g commented Jan 4, 2023 • edited Loading

k8s-triage-robot commented Apr 4, 2023

kerthcet commented Apr 5, 2023

alculquicondor commented Apr 5, 2023

alculquicondor commented Dec 14, 2022 •

edited

Loading

alculquicondor commented Jan 3, 2023 •

edited

Loading

ahg-g commented Jan 4, 2023 •

edited

Loading