Skip to content

Commit

Permalink
Add documentation for fair sharing (#2322)
Browse files Browse the repository at this point in the history
* Add documentation for fair sharing

Change-Id: I3ec6245556c567d9630624f3ddd3d39e1e9fdd9d

* Review comments

Change-Id: Ifee0fcd664f437f6b056fc6082cba11ea098a573
  • Loading branch information
alculquicondor authored May 31, 2024
1 parent 0584b86 commit 69b407c
Show file tree
Hide file tree
Showing 5 changed files with 217 additions and 57 deletions.
20 changes: 14 additions & 6 deletions site/content/en/docs/concepts/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,22 +50,30 @@ A mechanism allowing internal or external components to influence the timing of

### Quota Reservation

Sometimes referred to as _workload scheduling_ or _job scheduling_
(not to be confused with [pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/)).
Is the process during which the kueue scheduler locks the resources needed by a workload within the targeted [ClusterQueues ResourceGroups](/docs/concepts/cluster_queue/#resource-groups)
_Quota reservation_ is the process during through which the kueue scheduler locks the resources needed by a workload within the targeted
[ClusterQueues ResourceGroups](/docs/concepts/cluster_queue/#resource-groups)

Quota reservation is sometimes referred to as _workload scheduling_ or _job scheduling_,
but it should not to be confused with [pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/).

### Admission

The process of admitting a Workload to start (Pods to be created). A Workload
_Admission_ is the process of allowing a Workload to start (Pods to be created). A Workload
is admitted when it has a Quota Reservation and all its [AdmissionCheckStates](/docs/concepts/admission_check)
are `Ready`.

### [Cohort](/docs/concepts/cluster_queue#cohort)

A group of ClusterQueues that can borrow unused quota from each other.
A _cohort_ is a group of ClusterQueues that can borrow unused quota from each other.

### Queueing

The time between a Workload is created until it is admitted by a ClusterQueue.
_Queueing_ is the state of a Workload since the time it is created until it Kueue admits it on a ClusterQueue.
Typically, the Workload will compete with other Workloads for available
quota based on the fair sharing rules of the ClusterQueue.

### [Preemption](/docs/concepts/preemption)

_Preemption_ is the process of evicting one or more admitted Workloads to accommodate another Workload.
The Workload being evicted might be of a lower priority or might be borrowing
resources that are now required by the owning ClusterQueue.
57 changes: 9 additions & 48 deletions site/content/en/docs/concepts/cluster_queue.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,13 @@ For each resource, you can define quotas for multiple _flavors_.
Flavors represent different variations of a resource (for example, different GPU
models). You can define a flavor using a [ResourceFlavor object](/docs/concepts/resource_flavor).
When definining quotas for a ClusterQueue, you can set the following values:
- `nominalQuota` is the quantity of this resource that is available for a ClusterQueue at a specific time.
- `borrowingLimit` is the maximum amount of quota that this ClusterQueue is allowed to borrow from the unused
nominal quota of other ClusterQueues in the same [cohort](#cohort).
- `lendingLimit` is the maximum amount of quota that this ClusterQueue allows other
ClusterQueues in the cohort to borrow when this ClusterQueue is not using its nominal quota.

In a process called [admission](/docs/concepts#admission), Kueue assigns to the
[Workload pod sets](/docs/concepts/workload#pod-sets) a flavor for each resource the pod set
requests.
Expand Down Expand Up @@ -453,54 +460,8 @@ The fields above do the following:
Note that an incoming Workload can preempt Workloads both within the
ClusterQueue and the cohort.

Kueue implements heuristics to preempt as few Workloads as possible.
Below we present a more detailed description of the algorithm.

### Preemption Algorithm overview

An incoming Workload, which does not fit within the unused quota, is eligible
to issue preemptions when one of the following
is true:
- the requests of the Workload are below the flavor's nominal quota, or
- `borrowWithinCohort` is enabled.

#### Candidates

The list of preemption candidates is compiled from Workloads within the Cluster
Queue satisfying the `withinClusterQueue` policy, and Workloads within the
cohort which satisfy the `reclaimWithinCohort` policy.

The list of candidates is sorted based on the following preference checks for
tie-breaking:
- Workloads from borrowing queues in the cohort,
- Workloads with the lowest priority,
- Workloads which got admitted the most recently.

#### Targets

The algorithm qualifies the candidates as preemption targets using the heuristics
below:

1. If all candidates belong to the target queue, then Kueue greedily
qualifies candidates until the incoming Workload can fit, allowing the usage of
the ClusterQueue to be above the nominal quota, up to the `borrowingLimit`.
This is referred as "borrowing" in the points below.

2. If `borrowWithinCohort` is enabled, then Kueue greedily qualifies
candidates (respecting the `borrowWithinCohort.maxPriorityThreshold` threshold),
until the incoming Workload can fit, allowing for borrowing.

3. If the current usage of the target queue is below nominal quota, then
Kueue greedily qualifies the candidates, until the incoming workload can fit,
disallowing for borrowing.

4. Kueue tries to greedily qualifies a subset of candidates which belong to the
target Cluster Queue, until the incoming Workload can fit, allowing for borrowing.

The last step of the algorithm is to optimize the set of targets. For this
purpose Kueue greedily traverses the list of initial targets in reverse and
removes them from the list of targets if the incoming Workload still can be
admitted when they are accounted back for quota usage.
Read [Preemption](/docs/concepts/preemption) to learn more about
the heuristics that Kueue implements to preempt as few Workloads as possible.

## FlavorFungibility

Expand Down
2 changes: 1 addition & 1 deletion site/content/en/docs/concepts/multikueue.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "MultiKueue"
date: 2024-02-26
weight: 7
weight: 8
description: >
Kueue multi cluster job dispatching.
---
Expand Down
192 changes: 192 additions & 0 deletions site/content/en/docs/concepts/preemption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
---
title: "Preemption"
date: 2024-05-28
weight: 7
description: >
Preemption is the process of evicting one or more admitted Workloads to accommodate another Workload.
---

In a preemption, the following terms are relevant:
- **Preemptees**: The preempted Workloads.
- **Target ClusterQueues**: The ClusterQueues to which the preemptees belong.
- **Preemptor**: The Workload being accommodated.
- **Preempting ClusterQueue**: The ClusterQueue to which the preemptor belongs.

## Reasons for preemption

A Workload can preempt one or more Workloads if it is admitted in a [ClusterQueue with preemption enabled](/docs/concepts/cluster_queue/#preemption)
and any of the following events happen:
- The preemptee belongs to the same [ClusterQueue](/docs/concepts/cluster_queue) as the preemptor and the preemptee has a lower priority.
- The preemptee belongs to the same [cohort](/docs/concepts/cluster_queue#cohort) as the preemptor and the preemptee's ClusterQueue has a usage above
the [nominal quota](/docs/concepts/cluster_queue#resources) for at least one resource that the preemptee and preemptor require.

The configured settings for preemption in the [Kueue Configuration](/docs/reference/kueue-config.v1beta1#FairSharing)
and in the [ClusterQueue](/docs/concepts/cluster_queue#preemption) can limit whether a Workload can preempt others, in addition
to the criteria above.

When preempting a Workload, Kueue adds entries in the `.status.conditions` field of the preempted Workload
that is similar to the following:

```yaml
status:
conditions:
- lastTransitionTime: "2024-05-31T18:42:33Z"
message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d)
in the ClusterQueue'
observedGeneration: 1
reason: Preempted
status: "True"
type: Evicted
- lastTransitionTime: "2024-05-31T18:42:33Z"
message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d)
in the ClusterQueue'
reason: InClusterQueue
status: "True"
type: Preempted
```
The `Evicted` condition indicates that the Workload was evicted with a reason `Preempted`,
whereas the `Preempted` condition gives more details about the preemption reason.

## Preemption algorithms

Kueue offers two preemption algorithms. The main difference between them is the criteria to allow
preemptions from a ClusterQueue to others in the Cohort, when the usage of the preempting ClusterQueue is
already above the nominal quota. The algorithms are:

- **[Classic Preemption](#classic-preemption)**: Preemption in the cohort only happens when the usage of the preempting ClusterQueue
will be under the nominal quota after the ongoing admission process, or when all the candidates for preemption belong to
the same ClusterQueue as the preempting Workload. In other words, ClusterQueues
can only borrow quota from others in the cohort if they do not preempt admitted Workloads from
other ClusterQueues. ClusterQueues in a cohort borrow resources in a first-come first-served fashion.
This algorithm is the most lightweight of the two.
- **[Fair sharing](#fair-sharing)**: ClusterQueues with pending Workloads can preempt other Workloads in their cohort
until the preempting ClusterQueue obtains an equal or weighted share of the borrowable resources.
The borrowable resources are the unused nominal quota of all the ClusterQueues in the cohort.

## Classic Preemption

An incoming Workload, which does not fit within the unused quota, is eligible
to issue preemptions when one of the following is true:
- the requests of the Workload are below the flavor's nominal quota, or
- `borrowWithinCohort` is enabled.

### Candidates

The list of preemption candidates is compiled from Workloads within the Cluster
Queue satisfying the `withinClusterQueue` policy, and Workloads within the
cohort which satisfy the `reclaimWithinCohort` policy.

The list of candidates is sorted based on the following preference checks for
tie-breaking:
- Workloads from borrowing queues in the cohort
- Workloads with the lowest priority
- Workloads which got admitted the most recently.

### Targets

The Classic Preemption algorithm qualifies the candidates as preemption targets using the heuristics
below:

1. If all candidates belong to the target queue, then Kueue greedily
qualifies candidates until the preemptor Workload can fit, allowing the usage of
the ClusterQueue to be above the nominal quota, up to the `borrowingLimit`.
This is referred as "borrowing" in the points below.

2. If `borrowWithinCohort` is enabled, then Kueue greedily qualifies
candidates (respecting the `borrowWithinCohort.maxPriorityThreshold` threshold),
until the preemptor Workload can fit, allowing for borrowing.

3. If the current usage of the target queue is below nominal quota, then
Kueue greedily qualifies the candidates, until the preemptor Workload can fit,
disallowing for borrowing.

4. If the Workload didn't fit by using the previous heuristics, Kueue greedily
qualifies only the candidates which belong to the preempting Cluster Queue,
until the preemptor Workload can fit, allowing for borrowing.

The last step of the algorithm is to minimize the set of targets. For this
purpose, Kueue greedily traverses the list of initial targets in reverse and
removes a Workload from the list of targets if the preemptor Workload still can be
admitted when accounting back the quota usage of the target Workload.

## Fair Sharing

{{% alert title="Note" color="primary" %}}
Available in Kueue v0.7.0 and newer
{{% /alert %}}

To enable fair sharing, [use a Kueue Configuration](/docs/installation#install-a-custom-configured-release-version) similar to the following:

```yaml
apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
fairSharing:
enable: true
preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
```

The attributes in this Kueue Configuration are described in the following sections.

### ClusterQueue share value

When you enable fair sharing, Kueue assigns a numeric share value to each ClusterQueue to summarize
the usage of borrowed resources in a ClusterQueue, in comparisson to others in the same cohort.
The share value is weighted by the `.spec.fairSharing.weight` defined in a ClusterQueue.

During admission, Kueue prefers to admit Workloads from ClusterQueues that have the lowest share value first.
During preemption, Kueue prefers to preempt Workloads from ClusterQueues that have the highest share value first.

You can obtain the share value of a ClusterQueue in the `.status.fairSharing.weightedShare` field or querying
the [`kueue_cluster_queue_weighted_share` metric](/docs/reference/metrics#optional-metrics).

### Preemption strategies

The `preemptionStrategies` field in the Kueue Configuration indicates which constraints should a
preemption satisfy, with regards to the share values of the target and preempting ClusterQueues,
before and after preempting a particular Workload.

Different `preemptionStrategies` can lead to less or more preemptions under specific scenarios.
These are the factors you should consider when configuring `preemptionStrategies`:
- Tolerance to disruptions, in particular when single Workloads use a significant amount of the borrowable resources.
- Speed of convergence, in other words, how important is it to reach a steady fair state as soon as possible.
- Overall utilization, because certain strategies might reduce the utilization of the cluster in the pursue of
fairness.

When you define multiple `preemptionStrategies`, the preemption algorithm will only use the next
strategy in the list if there aren't any more Workloads that are candidates for preemption that
satisfy the current strategy and the preemptor still doesn't fit.

The values you can put in the `preemptionStrategies` list are:
- `LessThanOrEqualToFinalShare`: Only preempt a Workload if the share of the preempting ClusterQueue
with the preemptor Workload is less than or equal to the share of the target ClusterQueue
without the preempted Workload.
This strategy might favor preemption of smaller workloads in the target ClusterQueue,
regardless of priority or start time, in an effort to keep the share of the ClusterQueue
as high as possible.
- `LessThanInitialShare`: Only preempt a Workload if the share of the preempting ClusterQueue
with the preemptor Workload is strictly less than the share of the target ClusterQueue.
Note that this strategy doesn't depend on the share usage of the Workload being preempted.
As a result, the strategy chooses to first preempt workloads with the lowest priority and
newest start time within the target ClusterQueue.
The default strategy is `[LessThanOrEqualToFinalShare, LessThanInitialShare]`

### Algorithm overview

The initial step of the algorithm is to identify the [Workloads that are candidate for preemption](#candidates),
with the same criteria and ordering as the classic preemption, and grouped by ClusterQueue.

Next, the above candidates are qualified as preemption targets,
following an algorithm that can be summarized as follows:

```
FindFairPreemptionTargets(X ClusterQueue, W Workload)
For each preemption strategy:
While W does not fit and there are workloads that are preemption candidates:
Find the ClusterQueue Y with the highest share value.
For each admitted Workload U in ClusterQueue Y:
If Workload U satisfies the preemption strategy:
Add workload U to the list of targets
In the reverse order of the list of targets:
Attempt to remove a Workload from the targets, while W still fits.
```
3 changes: 1 addition & 2 deletions site/content/en/docs/installation/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,7 @@ To install a custom-configured released version of Kueue in your cluster, execut
2. With an editor of your preference, open `manifests.yaml`.
3. In the `kueue-manager-config` ConfigMap manifest, edit the
`controller_manager_config.yaml` data entry. The entry represents
the default Kueue Configuration
struct ([v1beta1@main](https://pkg.go.dev/sigs.k8s.io/kueue@main/apis/config/v1beta1#Configuration)).
the default [KueueConfiguration](/docs/reference/kueue-config.v1beta1).
The contents of the ConfigMap are similar to the following:

```yaml
Expand Down

0 comments on commit 69b407c

Please sign in to comment.