Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MaxWaitingTime in case of job starvation of lower priority #754

Open
3 tasks done
kerthcet opened this issue May 9, 2023 · 31 comments
Open
3 tasks done

Add MaxWaitingTime in case of job starvation of lower priority #754

kerthcet opened this issue May 9, 2023 · 31 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@kerthcet
Copy link
Contributor

kerthcet commented May 9, 2023

We enqueue the jobs via the priorities mainly, which will cause job starvation of lower priorities,we can provide a mechanism to avoid this.

What would you like to be added:

A new field MaxWaitingTime to avoid job starvation.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@kerthcet kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label May 9, 2023
@alculquicondor
Copy link
Contributor

what would happen when the workload hits the maxWaitingTime?

@kerthcet
Copy link
Contributor Author

kerthcet commented May 9, 2023

It will be popped out for scheduling whatever the priority.

@alculquicondor
Copy link
Contributor

You mean it will go to the head of the ClusterQueue?

It seems a bit aggressive. Another way this can be modeled is that the priority actually goes up.
Another important question is at which object this is controlled. Workload? ClusterQueue? global?

And how to prevent abuse?

@trasc
Copy link
Contributor

trasc commented May 10, 2023

Only moving the workload to the head of the queue will be pointless when preemption is enabled, it will just be evicted at the next scheduler cycle.

@tenzen-y
Copy link
Member

tenzen-y commented May 10, 2023

It seems a bit aggressive. Another way this can be modeled is that the priority actually goes up.

I like this.

Another important question is at which object this is controlled. Workload? ClusterQueue? global?

Maybe, ClusterQueue?
If Workload controls this, it means batch users can modify this.

@kerthcet
Copy link
Contributor Author

It seems a bit aggressive.

Yes, a bit aggressive but straightforward if some workloads have the deadline to run. But I'm not quite sure about this, I think we can wait for more feedbacks from the community.

Another way this can be modeled is that the priority actually goes up.

But how much to go up each time?

Another important question is at which object this is controlled. Workload? ClusterQueue? global?

localQueue or clusterQueue both make sense to me.

And how to prevent abuse?

I think it's hard to answer, this is configurable but considering localQueue should be created by admin, we can be optimistic about this?

@kerthcet
Copy link
Contributor Author

kerthcet commented May 10, 2023

Only moving the workload to the head of the queue will be pointless when preemption is enabled, it will just be evicted at the next scheduler cycle.

We're not talking about preemption, just in case low priority jobs are starved by continuous high priority jobs.

If resources are insufficient, this is a problem.

@Gekko0114
Copy link
Member

Hi,
It looks interesting, can I work on this issue?
If it is ok, I will clarify how to implement it later.

@kerthcet
Copy link
Contributor Author

kerthcet commented Jun 7, 2023

Yes, you can take it if you like, but we haven't reached an agreement about the implementation, like a maxWaitingTime, or increate the priority per scheduling cycle, or something else. And we should also consider the preemption.

@Gekko0114
Copy link
Member

Gekko0114 commented Jun 7, 2023

Sure, thanks! I will read this thread in detail and clarify it.
If I get an agreement, I will assign it to me.

@alculquicondor
Copy link
Contributor

I would also consider adding an interface somewhere that allows us to implement different policies for ordering/priorities.

If you have a concrete use case, that would help us understand what the API can look like.

@Gekko0114
Copy link
Member

Hi,

I did some research this weekend and summarized my proposal.
Could you take a look?


I am considering adding the following fields to clusterQueue.spec:
labels: Users can change the priority of specific jobs by setting labels.
maxWaitingTime: Set maxWaitingTime for timeout.
strategy: Set the action for handling timeouts in strategy field. You can choose "First" or "Prioritize" options (it might be good for adding "custom" option and preparing interface for customizing logic by developers, but I don't consider it now.)
addingPriority: If the "Prioritize" is chosen, set addingPriority. This value is added to each job's priority when timeout happens.

Use cases:
scenarios like raising the priority only for prod jobs or executing recommendation batches within 3 hours etc

What do you think? As discussed, since it is set for clusterQueue, developers wouldn't abuse this.

spec:
  timeoutStrategy:
      labels:
         environment: prod
         system: recommendation         
      maxWaitingTime: 1hour
      strategy: Prioritize
      addingPriority: 10  

@alculquicondor
Copy link
Contributor

Maybe we should consider adding a separate object WorkloadPriorityClass. Then workloads refer to a priority class that have a base priority and a "dynamicPolicy".

  basePriority: 100
  dynamicPolicy:
    delay: 1h
    action: PriorityDelta
    value: 10

Open questions:

  • what could another action be?
  • how to keep record of the last time the priority was increased? Do we actually update the priority in the object or we just track in memory?

Definitely worth a KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps

@Gekko0114
Copy link
Member

Hi @alculquicondor,

Thanks for your response.

I have a question.
Why do you think it's better to add a separate object WorkloadPriorityClass rather than adding a new field to clusterQueue? Is it because it would be easier for batch users to manage priority by choosing a PriorityClass for each job?

what could another action be?

I think we should consider implementing another action going to the head of the ClusterQueue if it times out, which was the initial idea @kerthcet mentioned.

how to keep record of the last time the priority was increased? Do we actually update the priority in the object or we just track in memory?

I don't have a strong opinion on this, but I think it's better to update the priority in the object. If so, admins can easily change the priority manually.

Definitely worth a KEP:

I agree. I will create it.

@alculquicondor
Copy link
Contributor

Why do you think it's better to add a separate object WorkloadPriorityClass rather than adding a new field to clusterQueue? Is it because it would be easier for batch users to manage priority by choosing a PriorityClass for each job?

Yes, I don't think it's realistic that all jobs in a ClusterQueue would have the same behavior.

I think we should consider implementing another action going to the head of the ClusterQueue if it times out, which was the initial idea @kerthcet mentioned.

I'm not too sure. Because this workload could be immediately preempted in the next iteration. It's better if it's priority changes.

I don't have a strong opinion on this, but I think it's better to update the priority in the object. If so, admins can easily change the priority manually.

Yes, but also to my point above about not being easily preempted.
Another thing to think about is how do we know when the priority was bumped for the last time. In memory is not enough, because the controller could be restarted.

Can you look around in other queueing systems if they have dynamic priority and how it works?

@Gekko0114
Copy link
Member

Gekko0114 commented Jun 14, 2023

Thanks for your explanation! @alculquicondor

I'm not too sure. Because this workload could be immediately preempted in the next iteration. It's better if it's priority changes.
Yes, but also to my point above about not being easily preempted.

I have one question about preemption.
If we update the priority in the object and a pod starts running, my understanding is that the pod would not be easily preempted since it would have a higher priority. Is that correct?

Another thing to think about is how do we know when the priority was bumped for the last time. In memory is not enough, because the controller could be restarted.

I agree. It's important.

Can you look around in other queueing systems if they have dynamic priority and how it works?

Sure. I will look around and rethink the design based on your explanation.

@alculquicondor
Copy link
Contributor

If we update the priority in the object and a pod starts running, my understanding is that the pod would not be easily preempted since it would have a higher priority. Is that correct?

Exactly. That's probably desired? Which means we should probably add a maxPriority field in the priorityClass.
So we have base, step, delayForStep, and max fields (exact names TBD).

@Gekko0114
Copy link
Member

Sounds good!

I wasn't quite clear on the intention behind step and maxPriority.

So, if I understand correctly, base represents the priority when the job is initially created, delayForStep indicates the time (e.g., 1 hour) to wait before changing the priority after a timeout.
step determines the adding priority to set when a timeout occurs, and maxPriority is the priority used to override and prevent preemption of actively running pods, is that correct?

@alculquicondor
Copy link
Contributor

Almost there. So after 1h, we have p = base + step. What happens after 2h? My thinking is that p = base + step * 2. Then we need a maxPriority, as we don't want it to go to infinity. By default we can probably cap it at MAX_INT32.

@Gekko0114
Copy link
Member

I see. Thanks.

Then I will create KEP (and also investigate on other queue systems).
If there's anything else we should discuss, please leave comments.

/assign

@alculquicondor
Copy link
Contributor

I created a simplified version of this feature request #973

@kerthcet
Copy link
Contributor Author

I think we can leverage the workload priority to mitigate this issue. The general idea would like when waiting time ends, we'll raise the workload priority to a higher value and job will enqueue. Next scheduling cycle, the job will not be preempted as designed.

cc @B1F030

@alculquicondor
Copy link
Contributor

More generally, we will be needing a policy for dynamic priority. Another use case that popped out is to decrease priority based on evictions #1311 (comment)

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 3, 2024

For this case, maybe we can be more straightforward here like:

  1. Set the maxWaitingTime in localQueue/clusterQueue/Job
  2. When hit the deadline, raise the priority to the top of the queue

Intuitive and easy to control the job queueing.

@B1F030
Copy link
Member

B1F030 commented Jan 3, 2024

Here I drawed a simple design of this mechanism:

image

The prototype represents that there is a basePriority, when this workload hits the maxWaitingTime, we just simply change its priority to MAX_INT32.

And based on the prototype, here I have another design:
First we use the workload's priority as basePriority, and after delayTime(for example, 30m), add its priority by step(100), then after another delayForStep(10m), add step again.
Until the workload hits the maxWaitingTime, its priority will be add up straight to the maxPriority(MAX_INT32).

But the second design may be a little complex, and I think we should not expose the step and delayForStep to user in alpha version. We can define it properly, provide a default policy, that will be easy to use.

What do you think?

@alculquicondor
Copy link
Contributor

Let's take a step back... do you have a real requirement from an end-user that you can share? What is the behavior they expect, regardless of how the API looks like.

To me, the idea of a Job getting the highest priority after some timeout sounds odd. Is the job really the most important of all? Additionally, this implies that the job is potentially never going to be preempted. Is this acceptable? Have you looked into older systems for inspiration?

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 4, 2024

The goal is to prevent job starvation, which is also part of fair-sharing, low priority job will pending in the queue forever if higher priority jobs enqueueing continuously.

Volcano has similar design https://github.com/volcano-sh/volcano/blob/master/docs/design/sla-plugin.md, however, the mechanism is when hitting the max waiting time, it will admit the queue and try to reserve resources for it until ready.

Some of our users are using Volcano, we hope to provide smooth migrations for them. At the least, I think this demand still sounds reasonable, hope I'm not the only one. But we can think more about the API design.

this implies that the job is potentially never going to be preempted

This is tricky... but we're somehow a preemption based system. Seems no better ways, I mean this design #754 (comment).

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 4, 2024

But the second design may be a little complex, and I think we should not expose the step and delayForStep to user in alpha version. We can define it properly, provide a default policy, that will be easy to use.

What do you think?

Complex + 1 .., let's discuss the rationality first.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2024
@kerthcet
Copy link
Contributor Author

kerthcet commented Apr 6, 2024

/lifecycle froze

@kerthcet
Copy link
Contributor Author

kerthcet commented Apr 6, 2024

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants