Skip to content

Commit

Permalink
KEP-1287: Add back container status allocatedResources
Browse files Browse the repository at this point in the history
  • Loading branch information
tallclair committed Jan 28, 2025
1 parent cc9982c commit 8fc805d
Showing 1 changed file with 23 additions and 53 deletions.
76 changes: 23 additions & 53 deletions keps/sig-node/1287-in-place-update-pod-resources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,12 @@
- [Notes](#notes)
- [Lifecycle Nuances](#lifecycle-nuances)
- [Atomic Resizes](#atomic-resizes)
- [Edge-triggered Resizes](#edge-triggered-resizes)
- [Memory Limit Decreases](#memory-limit-decreases)
- [Sidecars](#sidecars)
- [QOS Class](#qos-class)
- [Resource Quota](#resource-quota)
- [Affected Components](#affected-components)
- [Instrumentation](#instrumentation)
- [Static CPU & Memory Policy](#static-cpu--memory-policy)
- [Future Enhancements](#future-enhancements)
- [Mutable QOS Class "Shape"](#mutable-qos-class-shape)
Expand Down Expand Up @@ -64,7 +65,7 @@
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Allocated Resources](#allocated-resources-1)
- [Allocated Resource Limits](#allocated-resource-limits)
<!-- /toc -->

## Release Signoff Checklist
Expand Down Expand Up @@ -216,8 +217,7 @@ PodStatus is extended to show the resources applied to the Pod and its Container
* Pod.Status.ContainerStatuses[i].Resources (new field, type
v1.ResourceRequirements) shows the **actual** resources held by the Pod and
its Containers for running containers, and the allocated resources for non-running containers.
* Pod.Status.AllocatedResources (new field) reports the aggregate pod-level allocated resources,
computed from the container-level allocated resources.
* Pod.Status.ContainerStatuses[i].AllocatedResources (new field) reports the allocated resource requests.
* Pod.Status.Resize (new field, type map[string]string) explains what is
happening for a given resource on a given container.

Expand All @@ -234,43 +234,13 @@ Additionally, a new `Pod.Spec.Containers[i].ResizePolicy[]` field (type

When the Kubelet admits a pod initially or admits a resize, all resource requirements from the spec
are cached and checkpointed locally. When a container is (re)started, these are the requests and
limits used. The allocated resources are only reported in the API at the pod-level, through the
`Pod.Status.AllocatedResources` field.
limits used. Only the allocated requests are reported in the API, through the
`Pod.Status.ContainerStatuses[i].AllocatedResources` field.

```
type PodStatus struct {
// ...
// AllocatedResources is the pod-level allocated resources. Only allocated requests are included.
// +optional
AllocatedResources *PodAllocatedResources `json:"allocatedResources,omitempty"`
}
// PodAllocatedResources is used for reporting pod-level allocated resources.
type PodAllocatedResources struct {
// Requests is the pod-level allocated resource requests, either directly
// from the pod-level resource requirements if specified, or computed from
// the total container allocated requests.
// +optional
Requests v1.ResourceList
}
```

The alpha implementation of In-Place Pod Vertical Scaling included `AllocatedResources` in the
container status, but only included requests. This field will remain in alpha, guarded by the
separate `InPlacePodVerticalScalingAllocatedStatus` feature gate, and is a candidate for future
removal. With the allocated status feature gate enabled, Kubelet will continue to populate the field
with the allocated requests from the checkpoint.

The scheduler uses `max(spec...resources, status.allocatedResources, status...resources)` for fit
The scheduler uses `max(spec...resources, status...allocatedResources, status...resources)` for fit
decisions, but since the actual resources are only relevant and reported for running containers, the
Kubelet sets `status...resources` equal to the allocated resources for non-running containers.

See [`Alternatives: Allocated Resources`](#allocated-resources-1) for alternative APIs considered.

The allocated resources API should be reevaluated prior to GA.

#### Subresource

Resource changes can only be made via the new `/resize` subresource, which accepts Update and Patch
Expand Down Expand Up @@ -498,7 +468,7 @@ To compute the Node resources allocated to Pods, pending resizes must be factore
The scheduler will use the maximum of:
1. Desired resources, computed from container requests in the pod spec, unless the resize is marked as `Infeasible`
1. Actual resources, computed from the `.status.containerStatuses[i].resources.requests`
1. Allocated resources, reported in `.status.allocatedResources.requests`
1. Allocated resources, reported in `.status.containerStatuses[i].allocatedResources`

### Flow Control

Expand All @@ -518,7 +488,7 @@ This is intentionally hitting various edge-cases for demonstration.
1. kubelet runs the pod and updates the API
- `spec.containers[0].resources.requests[cpu]` = 1
- `status.resize` = unset
- `status.allocatedResources.requests[cpu]` = 1
- `status.containerStatuses[0].allocatedResources[cpu]` = 1
- `status.containerStatuses[0].resources.requests[cpu]` = 1
- actual CPU shares = 1024

Expand All @@ -542,67 +512,67 @@ This is intentionally hitting various edge-cases for demonstration.
- apiserver validates the request and accepts the operation
- `spec.containers[0].resources.requests[cpu]` = 2
- `status.resize` = `"InProgress"`
- `status.allocatedResources.requests[cpu]` = 1.5
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
- `status.containerStatuses[0].resources.requests[cpu]` = 1
- actual CPU shares = 1024

1. Container runtime applied cpu=1.5
- `spec.containers[0].resources.requests[cpu]` = 2
- `status.resize` = `"InProgress"`
- `status.allocatedResources.requests[cpu]` = 1.5
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
- `status.containerStatuses[0].resources.requests[cpu]` = 1
- actual CPU shares = 1536

1. kubelet syncs the pod, and sees resize #2 (cpu = 2)
- kubelet decides this is feasible, but currently insufficient available resources
- `spec.containers[0].resources.requests[cpu]` = 2
- `status.resize[cpu]` = `"Deferred"`
- `status.allocatedResources.requests[cpu]` = 1.5
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
- actual CPU shares = 1536

1. Resize #3: cpu = 1.6
- apiserver validates the request and accepts the operation
- `spec.containers[0].resources.requests[cpu]` = 1.6
- `status.resize[cpu]` = `"Deferred"`
- `status.allocatedResources.requests[cpu]` = 1.5
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.5
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
- actual CPU shares = 1536

1. Kubelet syncs the pod, and sees resize #3 and admits it
- `spec.containers[0].resources.requests[cpu]` = 1.6
- `status.resize[cpu]` = `"InProgress"`
- `status.allocatedResources.requests[cpu]` = 1.6
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
- actual CPU shares = 1536

1. Container runtime applied cpu=1.6
- `spec.containers[0].resources.requests[cpu]` = 1.6
- `status.resize[cpu]` = `"InProgress"`
- `status.allocatedResources.requests[cpu]` = 1.6
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
- `status.containerStatuses[0].resources.requests[cpu]` = 1.5
- actual CPU shares = 1638

1. Kubelet syncs the pod
- `spec.containers[0].resources.requests[cpu]` = 1.6
- `status.resize[cpu]` = unset
- `status.allocatedResources.requests[cpu]` = 1.6
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
- actual CPU shares = 1638

1. Resize #4: cpu = 100
- apiserver validates the request and accepts the operation
- `spec.containers[0].resources.requests[cpu]` = 100
- `status.resize[cpu]` = unset
- `status.allocatedResources.requests[cpu]` = 1.6
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
- actual CPU shares = 1638

1. Kubelet syncs the pod, and sees resize #4
- this node does not have 100 CPUs, so kubelet cannot admit it
- `spec.containers[0].resources.requests[cpu]` = 100
- `status.resize[cpu]` = `"Infeasible"`
- `status.allocatedResources.requests[cpu]` = 1.6
- `status.containerStatuses[0].allocatedResources[cpu]` = 1.6
- `status.containerStatuses[0].resources.requests[cpu]` = 1.6
- actual CPU shares = 1638

Expand Down Expand Up @@ -795,7 +765,7 @@ With InPlacePodVerticalScaling enabled, resource quota needs to consider pending
to how this is handled by scheduling, resource quota will use the maximum of:
1. Desired resources, computed from container requests in the pod spec, unless the resize is marked as `Infeasible`
1. Actual resources, computed from the `.status.containerStatuses[i].resources.requests`
1. Allocated resources, reported in `.status.allocatedResources.requests`
1. Allocated resources, reported in `.status.containerStatuses[i].allocatedResources`

To properly handle scale-down, resource quota controller now needs to evaluate
pod updates where `.status...resources` changed.
Expand Down Expand Up @@ -1107,7 +1077,7 @@ Setup a guaranteed class Pod with two containers (c1 & c2).
#### Backward Compatibility and Negative Tests

1. Verify that Node is allowed to update only a Pod's AllocatedResources field.
1. Verify that only Node account is allowed to udate AllocatedResources field.
1. Verify that only Node account is allowed to update AllocatedResources field.
1. Verify that updating Pod Resources in workload template spec retains current
behavior:
- Updating Pod Resources in Job template is not allowed.
Expand Down Expand Up @@ -1478,7 +1448,7 @@ _This section must be completed when targeting beta graduation to a release._
- Improve memory limit downsize handling
- Rename ResizeRestartPolicy `NotRequired` to `PreferNoRestart`,
and update CRI `UpdateContainerResources` contract
- Add pod-level `AllocatedResources`
- Add back `AllocatedResources` field to resolve a scheduler corner case
- Switch to edge-triggered resize actuation

## Drawbacks
Expand All @@ -1500,9 +1470,9 @@ information to express the idea and why it was not acceptable.
We considered having scheduler approve the resize. We also considered PodSpec as
the location to checkpoint allocated resources.

### Allocated Resources
### Allocated Resource Limits

If we need allocated resources & limits in the pod status API, the following options have been
If we need allocated limits in the pod status API, the following options have been
considered:

**Option 1: New field "AcceptedResources"**
Expand Down

0 comments on commit 8fc805d

Please sign in to comment.