-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA: DRA integration MVP #7530
base: master
Are you sure you want to change the base?
CA: DRA integration MVP #7530
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @towca, the
|
@@ -0,0 +1,75 @@ | |||
package utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this line, you have to add the following comment:
/*
Copyright 2020 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
3ad1daa
to
cd3ed99
Compare
9847f70
to
33afea0
Compare
e3cee18
to
3629698
Compare
/wg device-management |
I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰
Agreed, this can come later. https://pkg.go.dev/k8s.io/[email protected]/resourceclaim might be the right package for that. If "resourceclaim" works as a package name, then you could already use it now and then later only the import path needs to be changed once that new code is upstream. One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP. |
These utils will be used by various parts of the DRA logic in the following commits.
… state of DRA objects The Snapshot can hold all DRA objects in the cluster, and expose them to the scheduler framework via the SharedDRAManager interface. The state of the objects can be modified during autoscaling simulations using the provided methods.
The Provider uses DRA object listers to create a Snapshot of the DRA objects.
The logic is very basic and will likely need to be revised, but it's something for initial testing. Utilization of a given Pool is calculated as the number of allocated devices in the pool divided by the number of all devices in the pool. For scale-down purposes, the max utilization of all Node-local Pools is used. The new logic is mostly behind the DRA flag guard, so this should be a no-op if the flag is disabled. The only difference should be that FilterOutUnremovable marks a Node as unremovable if calculating utilization fails. Not sure why this wasn't the case before, but I think we need it for DRA - if CA sees an incomplete picture of a resource pool, we probably don't want to scale the Node down.
29b9543
to
2daeb9b
Compare
Yeah unfortunately it's quite a lot, IMO we should split reviewing responsibilities here:
/assign @BigDarkClown
I was wondering about that part, hard to test it without an actual driver (the test driver from k/k e2e tests doesn't run into this). So a new Node can be Ready but not yet have its ResourceSlices exposed, is that right? If so then we indeed need to reimplement the device plugin logic. How can CA know that a Node is supposed to have any slices exposed though? For device plugins we depend on a specific label on the Node (if the label is on the Node but no allocatable resource -> fake the node as not ready). @jackfrancis could you verify how this behaves in practice during your manual testing? Are we seeing double scale-ups? |
All added logic is behind the DRA flag guard, this should be a no-op if the flag is disabled.
…s DRA snapshot The new logic is flag-guarded, it should be a no-op if DRA is disabled.
By default CA is built with DeltaSnapshotStore, which isn't integrated with DRA yet.
This is needed so that the scheduler code correctly includes and executes the DRA plugin. We could just use the feature gate instead of the DRA flag in CA (the feature gates flag is already there, just not really used), but I guess there could be use-cases for having DRA enabled in the cluster but not in CA (e.g. DRA being tested in the cluster, CA only operating on non-DRA nodes/pods).
2daeb9b
to
fec30d3
Compare
@pohly @towca in my testing (A100 nodes using the nvidia k8s-dra-driver) I didn't see any undesirable functional side-effects of the node bootstrapping process. It's definitely true that before a pending pod+resourceclaim can be fulfilled there's a lot more waiting than normal:
During that period of time, I observed that cluster-autoscaler sanely (1) notices that the new node is there but that (2) the resourceclaim is not yet able to be fulfilled. I didn't notice any TTL expire like "I'm tired of waiting for this new node, I'm going to try and build a new one". In practice, my experiments produced a ~10 minute wait between node provisioning and resourceslice population. That's a long time, admittedly! Is there a particular configuration we could set to try to reproduce any "double scale out" symptoms in my environment? For full disclosure, this is the workload I used to test: apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test
name: mig-devices
spec:
spec:
devices:
requests:
- name: mig-1g-10gb
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].profile == '1g.10gb'"
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: gpu-test
name: gpu-test
labels:
app: gpu-test-pod
spec:
replicas: 1
selector:
matchLabels:
app: gpu-test
template:
metadata:
labels:
app: gpu-test
spec:
runtimeClassName: nvidia
resourceClaims:
- name: mig-device-gpu-test
resourceClaimTemplateName: mig-devices
containers:
- name: mig-device-container
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: mig-device-gpu-test
request: mig-1g-10gb
|
Isn't that what needs to be configured for a node pool? Without information about what devices are going to be made available together with a new node, CA cannot scale up. This readiness check could compare that assumption against the actual devices.
I don't remember what the exact problem is, i.e. whether it is "not scaling up when it should" or "scaling down when it shouldn't". I guess that it works in practice indicates that it is not needed for a MVP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceClaim utils (1st commit) - is the logic correct there?
Logic is correct. I would do a few things differently myself, but that is not a blocker for the PR.
|
||
// ClaimOwningPod returns the name and UID of the Pod owner of the provided claim. If the claim isn't | ||
// owned by a Pod, empty strings are returned. | ||
func ClaimOwningPod(claim *resourceapi.ResourceClaim) (string, types.UID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change it now, but elsewhere I suggested that these helpers might go into the k8s.io/dynamic-resource-allocation/resourceclaim
package.
This and other functions then can drop the Claim
prefix because the call becomes resourceclaim.OwningPod
.
} | ||
|
||
// ClaimAllocated returns whether the provided claim is allocated. | ||
func ClaimAllocated(claim *resourceapi.ResourceClaim) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be called resourceclaim.IsAllocated
. Not important right now.
} | ||
|
||
// ClaimReservedForPod returns whether the provided claim is currently reserved for the provided pod. | ||
func ClaimReservedForPod(claim *resourceapi.ResourceClaim, pod *apiv1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once it is in a staging repo, kubelet should get updated to call this. It must have a similar check.
|
||
// DeallocateClaimInPlace clears the allocation of the provided claim. | ||
func DeallocateClaimInPlace(claim *resourceapi.ResourceClaim) { | ||
claim.Status.Allocation = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit short for my taste and might not really make the callsite easier to understand. But it's not wrong.
return | ||
} | ||
|
||
newReservedFor := make([]resourceapi.ResourceClaimConsumerReference, 0, len(claim.Status.ReservedFor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would find the right index, then use slices.Delete
to create the new slice minus that one entry.
func TestClaimOwningPod(t *testing.T) { | ||
truePtr := true | ||
for _, tc := range []struct { | ||
testName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer map[]struct
because it ensures at compile time that I do not reuse the same test name twice.
Also, better avoid spaces in test names. They get replaced with underscores by go test
and then the tests listed in the output can't be found easily in the source code. go test -run=...
also is more complicated to get right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, I am not the expert on DRA so I might have missed something. Overall this LGTM for the CA side, but please get an additional review for DRA handling.
}) | ||
podInfo.NeededResourceClaims = append(podInfo.NeededResourceClaims, testClaim(fmt.Sprintf("%s-claim-%d", pod.Name, i))) | ||
} | ||
for i := range 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any reason why we need two loops?
for id, claim := range s.resourceClaimsById { | ||
result.resourceClaimsById[id] = claim.DeepCopy() | ||
} | ||
// Only the claims are mutable, we don't have to deep-copy anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the comment between before copying resource claims and after, this is weirdly placed right now.
} | ||
|
||
// AddClaims adds additional ResourceClaims to the Snapshot. It can be used e.g. if we need to duplicate a Pod that | ||
// owns ResourceClaims. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add info to the comment that the method returns error when claim already exists in snapshot, it is very important.
for _, podClaimRef := range pod.Spec.ResourceClaims { | ||
claimName := claimRefToName(pod, podClaimRef) | ||
if claimName == "" { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not error? It seems to indicate some issue with snapshot state.
claimId := ResourceClaimId{Name: claimName, Namespace: pod.Namespace} | ||
claim, found := s.resourceClaimsById[claimId] | ||
if !found { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, why not error?
// TODO(DRA): Figure out how to calculate Node utilization for DRA resources properly. The current logic is provisional, it should work well for Nodes with a single Pool of | ||
// expensive Devices but is probably not flexible enough for other scenarios. | ||
resourceName, highestUtil, err := drautils.HighestDynamicResourceUtilization(nodeInfo) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error in this case? I am a bit hesitant, as this calculation seems to be experimental. I think a better option would be to continue to cpu + mem.
wantUtilInfo: Info{CpuUtil: 0.5, Utilization: 0.5, ResourceName: apiv1.ResourceCPU}, | ||
}, | ||
{ | ||
testName: "DRA slices present, but no claims, DRA enabled -> DRA util returned despite being lower than CPU", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am 100% not sure if this is a correct behavior. I think a better way might be to calculate DRA util alongside cpu and mem, but let's keep it for the future improvements.
@@ -101,6 +170,30 @@ func (s *PredicateSnapshot) SchedulePodOnAnyNodeMatching(pod *apiv1.Pod, anyNode | |||
|
|||
// UnschedulePod removes the given Pod from the given Node inside the snapshot. | |||
func (s *PredicateSnapshot) UnschedulePod(namespace string, podName string, nodeName string) error { | |||
if s.draEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As DraSnapshot is in ClusterSnapshotStore, shouldn't PredicateSnapshot delegate this logic to ClusterSnapshotStore? I don't think it applies to all methods, but here it stood up the most for me.
@@ -33,6 +33,8 @@ import ( | |||
"github.com/stretchr/testify/assert" | |||
) | |||
|
|||
// TODO(DRA): Add DRA-specific test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh please do.
var snapshotStore clustersnapshot.ClusterSnapshotStore = store.NewDeltaSnapshotStore() | ||
if autoscalingOptions.DynamicResourceAllocationEnabled { | ||
// TODO(DRA): Remove this once DeltaSnapshotStore is integrated with DRA. | ||
snapshotStore = store.NewBasicSnapshotStore() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some high level log/warning here, this might not be expected for the end-user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed some parts of " CA: sanitize and propagate DRA objects through NodeInfos in node_info utils".
// leave them as-is when duplicating). Returns a map of all pool names (without the suffix) that can be used with SanitizePodResourceClaims(). | ||
// Returns an error if any of the slices isn't node-local. | ||
func SanitizeNodeResourceSlices(nodeLocalSlices []*resourceapi.ResourceSlice, newNodeName, nameSuffix string) (newSlices []*resourceapi.ResourceSlice, oldPoolNames map[string]bool, err error) { | ||
oldPoolNames = map[string]bool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use https://pkg.go.dev/k8s.io/utils/set#Set instead?
newSlices = append(newSlices, sliceCopy) | ||
} | ||
return newSlices, oldPoolNames, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function looks okay to me. Normally one cannot just append something to a name to create a new name because of length limits, but here this isn't an issue because it's all in-memory.
// SanitizePodResourceClaims can be used to duplicate ResourceClaims needed by a Pod, when duplicating the Pod. | ||
// - ResourceClaims owned by oldOwner are duplicated and sanitized, to be owned by a duplicate pod - newOwner. | ||
// - ResourceClaims not owned by oldOwner are returned unchanged in the result. They are shared claims not bound to the | ||
// lifecycle of the duplicated pod, so they shouldn't be duplicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about updating the ReservedFor so that the new pod is listed if (and only if) the old one was? Or is that something that will be covered elsewhere?
@@ -41,6 +42,9 @@ type nodeGroupTemplateNodeInfoGetter interface { | |||
// SanitizedTemplateNodeInfoFromNodeGroup returns a template NodeInfo object based on NodeGroup.TemplateNodeInfo(). The template is sanitized, and only | |||
// contains the pods that should appear on a new Node from the same node group (e.g. DaemonSet pods). | |||
func SanitizedTemplateNodeInfoFromNodeGroup(nodeGroup nodeGroupTemplateNodeInfoGetter, daemonsets []*appsv1.DaemonSet, taintConfig taints.TaintConfig) (*framework.NodeInfo, errors.AutoscalerError) { | |||
// TODO(DRA): Figure out how to handle TemplateNodeInfo() returning DaemonSet Pods using DRA. Currently, things only work correctly if such pods are | |||
// already allocated by TemplateNodeInfo(). It might be better for TemplateNodeInfo() to return unallocated claims, and to run scheduler predicates and | |||
// compute the allocations here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Sanitizing" ResourceClaims (4th commit). During the simulations, CA duplicates Nodes and their DaemonSet pods. Duplicated Nodes/Pods go through "sanitization" so that they aren't actually identical to the originals. Doing this for DS pods using DRA has some corner cases (described in the comments), WDYT about this? Do we anticipate DS pods using DRA?
A use case for a DS with claims is for monitoring, in which the claims will use "admin mode", meaning that they don't block allocating the monitored devices for some other pod. Admin mode is under a feature gate in Kubernetes and alpha, so supporting this in CA is not a priority.
|
||
for _, podInfo := range nodeInfo.Pods() { | ||
freshPod := sanitizePod(podInfo.Pod, freshNode.Name, namesSuffix) | ||
result.AddPod(framework.NewPodInfo(freshPod, nil)) | ||
freshResourceClaims, err := drautils.SanitizePodResourceClaims(freshPod, podInfo.Pod, podInfo.NeededResourceClaims, namesSuffix, nodeInfo.Node().Name, freshNodeName, oldPoolNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately obvious that drautils.SanitizePodResourceClaims
doesn't change something in place. Perhaps rename it to drautils.SanitizePodResourceClaimsDeepCopy
for consistency with this NodeInfoSanitizedDeepCopy
here?
Or is it a convention that sanitize<something>
returns a deep copy? sanitizePod
and sanitizeNode
below do that. Then why the DeepCopy
suffix in NodeInfoSanitizedDeepCopy
?
@@ -130,6 +145,9 @@ func podsExpectedOnFreshNode(sanitizedExampleNodeInfo *framework.NodeInfo, daemo | |||
} | |||
} | |||
// Add all pending DS pods if force scheduling DS | |||
// TODO(DRA): Figure out how to make this work for DS pods using DRA. Currently such pods would get force-added to the | |||
// ClusterSnapshot, but the ResourceClaims reflecting their DRA usage on the Node wouldn't. So CA would be overestimating | |||
// available DRA resources on the Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be okay in practice for the "admin mode" monitoring use case because all devices on the new node would be available to pods, despite also having such a DS.
poolDevices := getAllDevices(currentSlices) | ||
allocatedDeviceNames := allocatedDevices[driverName][poolName] | ||
unallocated, allocated := splitDevicesByAllocation(poolDevices, allocatedDeviceNames) | ||
result[driverName][poolName] = calculatePoolUtil(unallocated, allocated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating utilization (5th commit + the PR description) - does my initial "provisional" idea make sense? Is it implemented correctly?
This looks good to me for now.
It might have to become a bit more complex once partitionable devices get added (targeting 1.33 with that) because then a single GPU will be represented by different "logical" devices in a slice and the "size" of of those matters. For example, on one node one device representing an entire GPU might be allocated (one device out of many allocated) while on some other node, multiple smaller subsets of a similar GPU might be allocated (several devices out of many allocated). Then simply counting "allocated devices" would give the second node a higher utilization although the GPU is less utilized.
I'm not sure yet how this can be handled in a generic way. Perhaps we need to standardize a special attribute of devices which names a quantity that can serve as stand-in for the size of a device? For GPUs, that could be the memory size.
if len(foundPod.Spec.ResourceClaims) > 0 { | ||
if err := s.ClusterSnapshotStore.DraSnapshot().UnreservePodClaims(foundPod); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying the state of DRA objects during scheduling simulations (6th commit). This is tied to the ResourceClaim utils, as a lot of them are used here. Is the logic correct? One question that comes to mind - if we remove the last pod reserving a shared claim, should it be deallocated?
Yes, it should be treated as deallocated. Otherwise you cannot move such a pod from one node to another during a simulated scale down. It would be forced to stay on the original node because that is where the claim is allocated.
The rest of the logic looks okay to me.
{nodeName: node3GpuA1slice.name + "-1", reason: simulator.NotUnderutilized}, | ||
{nodeName: node3GpuA1slice.name + "-2", reason: simulator.NoPlaceToMovePods}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration tests (7th commit) - do the scenarios listed there make sense and behave as you would expect them to? Are some major scenarios missing?
A scale down scenario where a pod using a claim gets moved successfully to some other node would be good. As discussed in #7530, this probably doesn't work yet unless deallocation of the claim gets added. That scenario then also can cover utilization calculation (the "least utilized" node is meant to get removed).
A monitoring DS with admin mode might also be useful, once support for that gets added.
Specific scenarios aside, a general comment about testing: I see that DRA support is behind a flag. Would it be possible to run all existing test scenarios with and without that flag? Nothing should change, but one cannot be sure unless it gets tested...
The inverse is also true: what happens in a cluster where DRA is used and the flag is off? The outcome is probably different from the "flag is on" case, but is it still sane?
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.
This PR implements an MVP of DRA integration in Cluster Autoscaler. Not all CA features work with DRA yet (see list below), but most logic is implemented and DRA autoscaling can be tested in a real cluster.
Changes summary:
dynamicresources.Provider
which retrieves and snapshots all DRA objects.dynamicresources.Snapshot
which allows modifying the DRA objects snapshot obtained fromProvider
, and exposes the DRA objects to the scheduler framework.dynamicresources.Snapshot
insideClusterSnapshotStore
) wheneverClusterSnapshot
is modified (inPredicateSnapshot
methods).StaticAutoscaler
integration-like unit tests covering DRA scenarios.The following features don't work with DRA yet, and will be tackled post-MVP:
nominatedNodeName
set), it adds the pod to thenominatedNodeName
in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled"). If such a Pod uses DRA, we'll have to run scheduler predicates to actually obtain the necessary ResourceClaim allocations. If we just force-add such a Pod to the snapshot without modifying the claims, CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations.TemplateNodeInfo()
has to be called (since scheduler predicates aren't currently run there). Forcing "missing" DS pods onto template Nodes also won't work if the DS pods use DRA (we'll have to duplicate their resource claims and start running scheduler predicates there as well).BasicSnapshotStore
(oldBasicClusterSnapshot
) implementation is integrated withdynamicresources.Snapshot
.DeltaSnapshotStore
(oldDeltaClusterSnapshot
) isn't yet. We'll need to add some delta-like capability todynamicresources.Snapshot
for that in addition to justClone()
.Additionally, the following points will have to be tackled post-MVP:
RunOnce
, then processed byPodListProcessors
and passed toScaleUp
. The Pods' DRA objects, on the other hand, live in thedynamicresources.Snapshot
insideClusterSnapshot
. This leaves us with a risk of the two data sources diverging quite easily (e.g. aPodListProcessor
injecting a "fake" unschedulable Pod to the list, but not injecting the Pod's ResourceClaims to theClusterSnapshot
). To make this better, in a follow-up PR the unschedulable pods will be fully moved insideClusterSnapshot
andPodListProcessors
will interact with them viaClusterSnapshot
methods instead of getting them by argument and returning.Which issue(s) this PR fixes:
Fixes kubernetes/kubernetes#118612
Special notes for your reviewer:
The PR is split into meaningful commits intended to be reviewed sequentially.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: