diff --git a/site/content/en/docs/concepts/_index.md b/site/content/en/docs/concepts/_index.md index fc6108e2ad..bc99090153 100644 --- a/site/content/en/docs/concepts/_index.md +++ b/site/content/en/docs/concepts/_index.md @@ -50,22 +50,30 @@ A mechanism allowing internal or external components to influence the timing of ### Quota Reservation -Sometimes referred to as _workload scheduling_ or _job scheduling_ -(not to be confused with [pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/)). -Is the process during which the kueue scheduler locks the resources needed by a workload within the targeted [ClusterQueues ResourceGroups](/docs/concepts/cluster_queue/#resource-groups) +_Quota reservation_ is the process during through which the kueue scheduler locks the resources needed by a workload within the targeted +[ClusterQueues ResourceGroups](/docs/concepts/cluster_queue/#resource-groups) + +Quota reservation is sometimes referred to as _workload scheduling_ or _job scheduling_, +but it should not to be confused with [pod scheduling](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). ### Admission -The process of admitting a Workload to start (Pods to be created). A Workload +_Admission_ is the process of allowing a Workload to start (Pods to be created). A Workload is admitted when it has a Quota Reservation and all its [AdmissionCheckStates](/docs/concepts/admission_check) are `Ready`. ### [Cohort](/docs/concepts/cluster_queue#cohort) -A group of ClusterQueues that can borrow unused quota from each other. +A _cohort_ is a group of ClusterQueues that can borrow unused quota from each other. ### Queueing -The time between a Workload is created until it is admitted by a ClusterQueue. +_Queueing_ is the state of a Workload since the time it is created until it Kueue admits it on a ClusterQueue. Typically, the Workload will compete with other Workloads for available quota based on the fair sharing rules of the ClusterQueue. + +### [Preemption](/docs/concepts/preemption) + +_Preemption_ is the process of evicting one or more admitted Workloads to accommodate another Workload. +The Workload being evicted might be of a lower priority or might be borrowing +resources that are now required by the owning ClusterQueue. diff --git a/site/content/en/docs/concepts/cluster_queue.md b/site/content/en/docs/concepts/cluster_queue.md index 334057d79a..8fe4728251 100644 --- a/site/content/en/docs/concepts/cluster_queue.md +++ b/site/content/en/docs/concepts/cluster_queue.md @@ -56,6 +56,13 @@ For each resource, you can define quotas for multiple _flavors_. Flavors represent different variations of a resource (for example, different GPU models). You can define a flavor using a [ResourceFlavor object](/docs/concepts/resource_flavor). +When definining quotas for a ClusterQueue, you can set the following values: +- `nominalQuota` is the quantity of this resource that is available for a ClusterQueue at a specific time. +- `borrowingLimit` is the maximum amount of quota that this ClusterQueue is allowed to borrow from the unused + nominal quota of other ClusterQueues in the same [cohort](#cohort). +- `lendingLimit` is the maximum amount of quota that this ClusterQueue allows other + ClusterQueues in the cohort to borrow when this ClusterQueue is not using its nominal quota. + In a process called [admission](/docs/concepts#admission), Kueue assigns to the [Workload pod sets](/docs/concepts/workload#pod-sets) a flavor for each resource the pod set requests. @@ -453,54 +460,8 @@ The fields above do the following: Note that an incoming Workload can preempt Workloads both within the ClusterQueue and the cohort. -Kueue implements heuristics to preempt as few Workloads as possible. -Below we present a more detailed description of the algorithm. - -### Preemption Algorithm overview - -An incoming Workload, which does not fit within the unused quota, is eligible -to issue preemptions when one of the following -is true: -- the requests of the Workload are below the flavor's nominal quota, or -- `borrowWithinCohort` is enabled. - -#### Candidates - -The list of preemption candidates is compiled from Workloads within the Cluster -Queue satisfying the `withinClusterQueue` policy, and Workloads within the -cohort which satisfy the `reclaimWithinCohort` policy. - -The list of candidates is sorted based on the following preference checks for -tie-breaking: -- Workloads from borrowing queues in the cohort, -- Workloads with the lowest priority, -- Workloads which got admitted the most recently. - -#### Targets - -The algorithm qualifies the candidates as preemption targets using the heuristics -below: - -1. If all candidates belong to the target queue, then Kueue greedily -qualifies candidates until the incoming Workload can fit, allowing the usage of -the ClusterQueue to be above the nominal quota, up to the `borrowingLimit`. -This is referred as "borrowing" in the points below. - -2. If `borrowWithinCohort` is enabled, then Kueue greedily qualifies -candidates (respecting the `borrowWithinCohort.maxPriorityThreshold` threshold), -until the incoming Workload can fit, allowing for borrowing. - -3. If the current usage of the target queue is below nominal quota, then -Kueue greedily qualifies the candidates, until the incoming workload can fit, -disallowing for borrowing. - -4. Kueue tries to greedily qualifies a subset of candidates which belong to the -target Cluster Queue, until the incoming Workload can fit, allowing for borrowing. - -The last step of the algorithm is to optimize the set of targets. For this -purpose Kueue greedily traverses the list of initial targets in reverse and -removes them from the list of targets if the incoming Workload still can be -admitted when they are accounted back for quota usage. +Read [Preemption](/docs/concepts/preemption) to learn more about +the heuristics that Kueue implements to preempt as few Workloads as possible. ## FlavorFungibility diff --git a/site/content/en/docs/concepts/multikueue.md b/site/content/en/docs/concepts/multikueue.md index d45a51f3b0..6d4cc50539 100644 --- a/site/content/en/docs/concepts/multikueue.md +++ b/site/content/en/docs/concepts/multikueue.md @@ -1,7 +1,7 @@ --- title: "MultiKueue" date: 2024-02-26 -weight: 7 +weight: 8 description: > Kueue multi cluster job dispatching. --- diff --git a/site/content/en/docs/concepts/preemption.md b/site/content/en/docs/concepts/preemption.md new file mode 100644 index 0000000000..609c77282e --- /dev/null +++ b/site/content/en/docs/concepts/preemption.md @@ -0,0 +1,192 @@ +--- +title: "Preemption" +date: 2024-05-28 +weight: 7 +description: > + Preemption is the process of evicting one or more admitted Workloads to accommodate another Workload. +--- + +In a preemption, the following terms are relevant: +- **Preemptees**: The preempted Workloads. +- **Target ClusterQueues**: The ClusterQueues to which the preemptees belong. +- **Preemptor**: The Workload being accommodated. +- **Preempting ClusterQueue**: The ClusterQueue to which the preemptor belongs. + +## Reasons for preemption + +A Workload can preempt one or more Workloads if it is admitted in a [ClusterQueue with preemption enabled](/docs/concepts/cluster_queue/#preemption) +and any of the following events happen: +- The preemptee belongs to the same [ClusterQueue](/docs/concepts/cluster_queue) as the preemptor and the preemptee has a lower priority. +- The preemptee belongs to the same [cohort](/docs/concepts/cluster_queue#cohort) as the preemptor and the preemptee's ClusterQueue has a usage above + the [nominal quota](/docs/concepts/cluster_queue#resources) for at least one resource that the preemptee and preemptor require. + +The configured settings for preemption in the [Kueue Configuration](/docs/reference/kueue-config.v1beta1#FairSharing) +and in the [ClusterQueue](/docs/concepts/cluster_queue#preemption) can limit whether a Workload can preempt others, in addition +to the criteria above. + +When preempting a Workload, Kueue adds entries in the `.status.conditions` field of the preempted Workload +that is similar to the following: + +```yaml +status: + conditions: + - lastTransitionTime: "2024-05-31T18:42:33Z" + message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d) + in the ClusterQueue' + observedGeneration: 1 + reason: Preempted + status: "True" + type: Evicted + - lastTransitionTime: "2024-05-31T18:42:33Z" + message: 'Preempted to accommodate a workload (UID: 5515f7da-d2ea-4851-9e9c-6b8b3333734d) + in the ClusterQueue' + reason: InClusterQueue + status: "True" + type: Preempted +``` + +The `Evicted` condition indicates that the Workload was evicted with a reason `Preempted`, +whereas the `Preempted` condition gives more details about the preemption reason. + +## Preemption algorithms + +Kueue offers two preemption algorithms. The main difference between them is the criteria to allow +preemptions from a ClusterQueue to others in the Cohort, when the usage of the preempting ClusterQueue is +already above the nominal quota. The algorithms are: + +- **[Classic Preemption](#classic-preemption)**: Preemption in the cohort only happens when the usage of the preempting ClusterQueue + will be under the nominal quota after the ongoing admission process, or when all the candidates for preemption belong to + the same ClusterQueue as the preempting Workload. In other words, ClusterQueues + can only borrow quota from others in the cohort if they do not preempt admitted Workloads from + other ClusterQueues. ClusterQueues in a cohort borrow resources in a first-come first-served fashion. + This algorithm is the most lightweight of the two. +- **[Fair sharing](#fair-sharing)**: ClusterQueues with pending Workloads can preempt other Workloads in their cohort + until the preempting ClusterQueue obtains an equal or weighted share of the borrowable resources. + The borrowable resources are the unused nominal quota of all the ClusterQueues in the cohort. + +## Classic Preemption + +An incoming Workload, which does not fit within the unused quota, is eligible +to issue preemptions when one of the following is true: +- the requests of the Workload are below the flavor's nominal quota, or +- `borrowWithinCohort` is enabled. + +### Candidates + +The list of preemption candidates is compiled from Workloads within the Cluster +Queue satisfying the `withinClusterQueue` policy, and Workloads within the +cohort which satisfy the `reclaimWithinCohort` policy. + +The list of candidates is sorted based on the following preference checks for +tie-breaking: +- Workloads from borrowing queues in the cohort +- Workloads with the lowest priority +- Workloads which got admitted the most recently. + +### Targets + +The Classic Preemption algorithm qualifies the candidates as preemption targets using the heuristics +below: + +1. If all candidates belong to the target queue, then Kueue greedily +qualifies candidates until the preemptor Workload can fit, allowing the usage of +the ClusterQueue to be above the nominal quota, up to the `borrowingLimit`. +This is referred as "borrowing" in the points below. + +2. If `borrowWithinCohort` is enabled, then Kueue greedily qualifies +candidates (respecting the `borrowWithinCohort.maxPriorityThreshold` threshold), +until the preemptor Workload can fit, allowing for borrowing. + +3. If the current usage of the target queue is below nominal quota, then +Kueue greedily qualifies the candidates, until the preemptor Workload can fit, +disallowing for borrowing. + +4. If the Workload didn't fit by using the previous heuristics, Kueue greedily +qualifies only the candidates which belong to the preempting Cluster Queue, +until the preemptor Workload can fit, allowing for borrowing. + +The last step of the algorithm is to minimize the set of targets. For this +purpose, Kueue greedily traverses the list of initial targets in reverse and +removes a Workload from the list of targets if the preemptor Workload still can be +admitted when accounting back the quota usage of the target Workload. + +## Fair Sharing + +{{% alert title="Note" color="primary" %}} +Available in Kueue v0.7.0 and newer +{{% /alert %}} + +To enable fair sharing, [use a Kueue Configuration](/docs/installation#install-a-custom-configured-release-version) similar to the following: + +```yaml +apiVersion: config.kueue.x-k8s.io/v1beta1 +kind: Configuration +fairSharing: + enable: true + preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare] +``` + +The attributes in this Kueue Configuration are described in the following sections. + +### ClusterQueue share value + +When you enable fair sharing, Kueue assigns a numeric share value to each ClusterQueue to summarize +the usage of borrowed resources in a ClusterQueue, in comparisson to others in the same cohort. +The share value is weighted by the `.spec.fairSharing.weight` defined in a ClusterQueue. + +During admission, Kueue prefers to admit Workloads from ClusterQueues that have the lowest share value first. +During preemption, Kueue prefers to preempt Workloads from ClusterQueues that have the highest share value first. + +You can obtain the share value of a ClusterQueue in the `.status.fairSharing.weightedShare` field or querying +the [`kueue_cluster_queue_weighted_share` metric](/docs/reference/metrics#optional-metrics). + +### Preemption strategies + +The `preemptionStrategies` field in the Kueue Configuration indicates which constraints should a +preemption satisfy, with regards to the share values of the target and preempting ClusterQueues, +before and after preempting a particular Workload. + +Different `preemptionStrategies` can lead to less or more preemptions under specific scenarios. +These are the factors you should consider when configuring `preemptionStrategies`: +- Tolerance to disruptions, in particular when single Workloads use a significant amount of the borrowable resources. +- Speed of convergence, in other words, how important is it to reach a steady fair state as soon as possible. +- Overall utilization, because certain strategies might reduce the utilization of the cluster in the pursue of + fairness. + +When you define multiple `preemptionStrategies`, the preemption algorithm will only use the next +strategy in the list if there aren't any more Workloads that are candidates for preemption that +satisfy the current strategy and the preemptor still doesn't fit. + +The values you can put in the `preemptionStrategies` list are: +- `LessThanOrEqualToFinalShare`: Only preempt a Workload if the share of the preempting ClusterQueue + with the preemptor Workload is less than or equal to the share of the target ClusterQueue + without the preempted Workload. + This strategy might favor preemption of smaller workloads in the target ClusterQueue, + regardless of priority or start time, in an effort to keep the share of the ClusterQueue + as high as possible. +- `LessThanInitialShare`: Only preempt a Workload if the share of the preempting ClusterQueue + with the preemptor Workload is strictly less than the share of the target ClusterQueue. + Note that this strategy doesn't depend on the share usage of the Workload being preempted. + As a result, the strategy chooses to first preempt workloads with the lowest priority and + newest start time within the target ClusterQueue. +The default strategy is `[LessThanOrEqualToFinalShare, LessThanInitialShare]` + +### Algorithm overview + +The initial step of the algorithm is to identify the [Workloads that are candidate for preemption](#candidates), +with the same criteria and ordering as the classic preemption, and grouped by ClusterQueue. + +Next, the above candidates are qualified as preemption targets, +following an algorithm that can be summarized as follows: + +``` +FindFairPreemptionTargets(X ClusterQueue, W Workload) + For each preemption strategy: + While W does not fit and there are workloads that are preemption candidates: + Find the ClusterQueue Y with the highest share value. + For each admitted Workload U in ClusterQueue Y: + If Workload U satisfies the preemption strategy: + Add workload U to the list of targets + In the reverse order of the list of targets: + Attempt to remove a Workload from the targets, while W still fits. +``` diff --git a/site/content/en/docs/installation/_index.md b/site/content/en/docs/installation/_index.md index 0334558fac..a4064c5979 100644 --- a/site/content/en/docs/installation/_index.md +++ b/site/content/en/docs/installation/_index.md @@ -108,8 +108,7 @@ To install a custom-configured released version of Kueue in your cluster, execut 2. With an editor of your preference, open `manifests.yaml`. 3. In the `kueue-manager-config` ConfigMap manifest, edit the `controller_manager_config.yaml` data entry. The entry represents -the default Kueue Configuration -struct ([v1beta1@main](https://pkg.go.dev/sigs.k8s.io/kueue@main/apis/config/v1beta1#Configuration)). +the default [KueueConfiguration](/docs/reference/kueue-config.v1beta1). The contents of the ConfigMap are similar to the following: ```yaml