Skip to content

Commit

Permalink
KEP update: Allow replacement pods in groups of pods (kubernetes-sigs…
Browse files Browse the repository at this point in the history
…#1338)

* KEP: Simpler algorithm for admitting groups of Pods

Change-Id: I894785325d44ff3cc3287f9717e96add499a2b48

* Clarify that workload is automatically cleaned up

Change-Id: I5b07d1ad71c99a23bdb612f1662df7dc8b991bec

* Last attempt annotation and reclaimable quota

Change-Id: Ie0281d6ebd6a022abc85ae245a22b3282dd23881

* Simplify design

Change-Id: Ia36c32fdf6862881fe0002d3b565c11da5978a54

* fix note about failed pods

Change-Id: If5cf5dc60dfbf1f7afb45beee974e92bffa979b8

* Clarify that some fields will be excluded

Change-Id: I34fbfdac3e056f559167795938c3c2cab1f41977

* Add dynamic reclaim for non retriable groups

Change-Id: I971e0e87fb09ad9fba8a3805a61d2cc267b29bda
  • Loading branch information
alculquicondor authored and kannon92 committed Nov 19, 2024
1 parent 2f7d1fe commit 6ca7e79
Showing 1 changed file with 63 additions and 43 deletions.
106 changes: 63 additions & 43 deletions keps/976-plain-pods/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
- [Groups of Pods created beforehand](#groups-of-pods-created-beforehand)
- [Groups of pods where driver generates workers](#groups-of-pods-where-driver-generates-workers)
- [Tracking admitted and finished Pods](#tracking-admitted-and-finished-pods)
- [Retrying Failed Pods](#retrying-failed-pods)
- [Dynamically reclaiming Quota](#dynamically-reclaiming-quota)
- [Metrics](#metrics)
- [Test Plan](#test-plan)
- [Prerequisite testing updates](#prerequisite-testing-updates)
Expand Down Expand Up @@ -66,14 +68,11 @@ use of this API to implement queuing semantics for Pods.
- Support queueing of individual Pods.
- Support queueing of groups of Pods of fixed size, identified by a common label.
- Opt-in or opt-out Pods from specific namespaces from queueing.
- Support for [dynamic reclaiming of quota](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
for succeeded Pods.

### Non-Goals

- Support for [dynamic reclaiming quota](https://github.com/kubernetes-sigs/kueue/issues/78)

This feature is incompatible with supporting Pod replacements without knowing the behavior of a
parent controller for the Pods.

- Support for [partial-admission](https://github.com/kubernetes-sigs/kueue/issues/420).

Since all pods are already created, an implementation of partial admission would imply the
Expand Down Expand Up @@ -372,12 +371,11 @@ matchExpressions:
Once the webhook has marked Pods subject to queuing with the `kueue.x-k8s.io/managed: true` label,
the Pod reconciler can create the corresponding Workload object to feed the Kueue admission logic.

Note that the Workload cannot be owned by the Pod. Otherwise any cascade deletion of the Pod would
delete the Workload object, even before the Pod terminates (if it has a grace period).
This means that the controller needs to manually delete the Workload object once it has the
Finished condition (after we have determined that all pods have finished).
The Workload will be owned by all Pods. Once all the Pods that own the workload are deleted (and
their finalizers are removed), the Workload will be automatically cleaned up.

A possible extension here is to add a TTL, which we can consider based on user feedback.
If individual Pods in the group fail and a replacement Pod comes in, the replacement Pod will be
added as an owner of the Workload as well.

#### Single Pods

Expand Down Expand Up @@ -433,9 +431,12 @@ scheduling. The list of fields to keep are:
- `preemptionPolicy`
- `topologySpreadConstraints`
- `overhead`
- `volumes`
- `resourceClaims`

Note that fields like `env` and `command` can sometimes change among all the pods of a group and
they don't influence scheduling, so they are safe to skip. `volumes` can influence scheduling, but
they can be parameterized, like in StatefulSets, so we will ignore them for now.

A sha256 of the reamining Pod spec will be used as a name for a Workload podSet. The count for the
podSet will be the number of Pods that match the same sha256. The hash will be calculated by the
webhook and stored as an annotation: `kueue.x-k8s.io/role-hash`.
Expand Down Expand Up @@ -538,39 +539,54 @@ spec:
Pods need to have finalizers so that we can reliably track how many of them run to completion and be
able to determine when the Workload is Finished.

A Pod-group reconciler would keep track of the groups of Pods and their respective Workload object,
based on the `jobframework.Reconciler`.
After a Workload is admitted, as the Pod-group reconciler ungates pods, it would keep an in-memory
cache of expected ungated pods: the number of ungated pods that are not reflected in the informers
yet, per Pod-group. This number decreases as the event handler observes Pods transition from being
gated to ungated.
The Pod reconciler will run in a "composable" mode: a mode where a Workload is composed of multiple
objects. The `jobframework.Reconciler` will be reworked to accomodate this.

In the Pod event handler, we decrement the counter when we see a transition from having
the scheduling gate `kueue.x-k8s.io/admission` to not having it.
After a Workload is admitted, each Pod that owns the workload enters the reconciliation loop.
The reconciliation loop collects all the Pods that are not Failed and constructs an in-memory
Workload. If there is an existing Workload in the cache and it has smaller Pod counters than the
in-memory Workload, then it is considered unmatching and the Workload is evicted.

In the Pod-group reconciler:
1. If the Pod is not terminated,
create a Workload for the pod group if one does not exist.
2. If the Pod is terminated,
- If the Workloald doesn't exist or the workload is finished, remove the finalizer.
3. ungated_pods_in_client: the number of non-terminated pods in the client that are admitted.
We only look at non-terminated pods to allow for terminated pods to be replaced.
4. ungated_pods = ungated_pods_in_client + expected_ungated_pods. Note that this might temporarily
lead to double counting.
2. For gated pods:
- If ungated_pods < admission.count, remove the gate, set nodeSelector, an increase
expected_ungated_pods
- Else,
- If ungated_pods_in_informer < admission.count, we can't admit this Pod now to prevent
overbooking, but requeue this Pod for retry.
- Else, remove finalizer and delete the Pod, as it's beyond the allowed admission.
5. If the number of terminated pods with a finalizer is greater than or equal to the admission
count, and there are no non-terminated Pods, mark the Workload as Finished and remove the
finalizers from the Pods.

Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way
of managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after
terminated. In a future version, we can consider a better scheme similar to [Pod tracking in Jobs](https://kubernetes.io/blog/2022/12/29/scalable-job-tracking-ga/).
1. If the Pod is not terminated and doesn't have a deletionTimestamp,
create a Workload for the pod group if one does not exist.
2. Remove Pod finalizers if the Pod is terminated and the Workload is finished, has a deletion
timestamp or is finished.
3. Build the in-memory Workload. If its podset counters are greater than the stored Workload,
then evict the Workload.
4. For gated pods:
- remove the gate, set nodeSelector
5. If the number of succeeded pods is equal to the admission count, mark the Workload as Finished
and remove the finalizers from the Pods.

Note that we are only removing Pod finalizers once the Workload is finished. This is a simple way of
managing finalizers, but it might lead to too many Pods lingering in etcd for a long time after
terminated.

### Retrying Failed Pods

The Pod group will generally only be considered finished if all the Pods finish with a Succeeded
phase.
This allows the user to send replacement Pods when a Pod in the group fails or if the group is
preempted. The replacement Pods can have any name, but they must point to the same pod group.

To declare that a group is failed, a user can execute one of the following actions:
1. Issue a Delete for the Workload object. The controller would terminate all running Pods and
clean up Pod finalizers.
2. Add an annotation to any Pod in the group `kueue.x-k8s.io/retriable-in-group: false`.
The annotation can be added to an existing Pod or added on creation.

Kueue will consider a group finished if there are no running or pending Pods, and at
least one terminated Pod (Failed or Succeeded) has the `retriable-in-group: false` annotation.

### Dynamically reclaiming Quota

Succeeded Pods will not be considered replaceable. In other words, the quota
from Succeeded Pods will be released by filling [reclaimablePods](https://kueue.sigs.k8s.io/docs/concepts/workload/#dynamic-reclaim)
in the Workload status.

The quota from Failed Pods will also be released when there is a pod with
the annotation `kueue.x-k8s.io/retriable-in-group` set to false.

### Metrics

Expand Down Expand Up @@ -618,10 +634,12 @@ The integration tests should cover the following scenarios:

- Basic webhook test
- Single Pod queued, admitted and finished.
- Multiple Pods with one shape:
- Multiple Pods created beforehand:
- queued and admitted
- failed pods recreated can use the same quota
- Workload finished when all pods finish (failed or succeeded)
- Group finished when all pods finish Successfully
- Group finished when a Pod with `retriable-in-group: false` annotation finishes.
- Group preempted and resumed.
- Driver Pod creates workers:
- queued and admitted.
- worker pods beyond the count are rejected (deleted)
Expand Down Expand Up @@ -653,6 +671,8 @@ Major milestones might include:
- when the KEP was retired or superseded
-->

- Sep 29th: Implemented single Pod support (story 1) [#1103](https://github.com/kubernetes-sigs/kueue/pulls/1103).

## Drawbacks

The proposed labels and annotations for groups of pods can be complex to build manually.
Expand Down

0 comments on commit 6ca7e79

Please sign in to comment.