CA: DRA integration MVP #7530

towca · 2024-11-25T14:30:20Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.

This PR implements an MVP of DRA integration in Cluster Autoscaler. Not all CA features work with DRA yet (see list below), but most logic is implemented and DRA autoscaling can be tested in a real cluster.

Changes summary:

Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.
Introduce dynamicresources.Provider which retrieves and snapshots all DRA objects.
Introduce dynamicresources.Snapshot which allows modifying the DRA objects snapshot obtained from Provider, and exposes the DRA objects to the scheduler framework.
Implement a very rudimentary utilization calculation for device pools, so that scale-down has something to act on.
Start sanitizing the DRA objects when duplicating Nodes and their DS-like pods in the snapshot.
Modify all relevant DRA objects (tracked in dynamicresources.Snapshot inside ClusterSnapshotStore) whenever ClusterSnapshot is modified (in PredicateSnapshot methods).
Add StaticAutoscaler integration-like unit tests covering DRA scenarios.

The following features don't work with DRA yet, and will be tackled post-MVP:

Priority-based preempting pods using DRA. If CA sees an unschedulable pod waiting for scheduler preemption (with nominatedNodeName set), it adds the pod to the nominatedNodeName in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled"). If such a Pod uses DRA, we'll have to run scheduler predicates to actually obtain the necessary ResourceClaim allocations. If we just force-add such a Pod to the snapshot without modifying the claims, CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations.
DaemonSet/static pods using DRA (in some cases). Similarly to the first point, DS/static pods using DRA won't work correctly during scale-from-0-nodes scenarios where TemplateNodeInfo() has to be called (since scheduler predicates aren't currently run there). Forcing "missing" DS pods onto template Nodes also won't work if the DS pods use DRA (we'll have to duplicate their resource claims and start running scheduler predicates there as well).
DeltaSnapshotStore. Only the BasicSnapshotStore (old BasicClusterSnapshot) implementation is integrated with dynamicresources.Snapshot. DeltaSnapshotStore (old DeltaClusterSnapshot) isn't yet. We'll need to add some delta-like capability to dynamicresources.Snapshot for that in addition to just Clone().

Additionally, the following points will have to be tackled post-MVP:

Unschedulable pods state. As pointed out by @MaciekPytel during the initial review of WIP: Implement DRA support in Cluster Autoscaler #7350, the state for unschedulable pods is kept in two separate places after this PR. The pod objects themselves are just a list obtained from listers in RunOnce, then processed by PodListProcessors and passed to ScaleUp. The Pods' DRA objects, on the other hand, live in the dynamicresources.Snapshot inside ClusterSnapshot. This leaves us with a risk of the two data sources diverging quite easily (e.g. a PodListProcessor injecting a "fake" unschedulable Pod to the list, but not injecting the Pod's ResourceClaims to the ClusterSnapshot). To make this better, in a follow-up PR the unschedulable pods will be fully moved inside ClusterSnapshot and PodListProcessors will interact with them via ClusterSnapshot methods instead of getting them by argument and returning.
Calculating Node utilization for DRA resources.
- Cluster Autoscaler scale-down logic calculates a utilization value for each Node in the cluster. Only Nodes with utilization below a configured threshold are considered candidates for scale-down. Utilization is computed separately for every resource, by dividing the sum of Pods' requests by the Node allocatable value (it's not looking at "real" usage, just requests). If a Node has a GPU, only the GPU utilization is taken into account - CPU and memory are ignored. If a Node doesn't have a GPU, its utilization is the max of CPU and memory utilization values. It's not immediately clear how to extend this model for DRA resources.
- This PR extends the utilization model in the following way. All Node-local Devices exposed in ResourceSlices for a given Node are grouped by their Driver and Pool. If ResourceSlices don't present a complete view of a Pool, utilization is not calculated and the Node is not scaled down until that changes. Allocated Node-local devices from the scheduled Pods' ResourceClaims are also grouped by their Driver and Pool. Utilization is calculated separately for every <Driver, Pool> pair by dividing the number of allocated devices by the number of total devices for a given Driver and Pool. Utilization of the Node is the max of these <Driver, Pool> utilization values. Similarly to how GPUs are treated, if DRA resources are exposed for a Node only DRA utilization is taken into account - CPU and memory are ignored.
- The logic above should work pretty well for Nodes with one local Pool with identical, expensive (compared to CPU/memory) Devices - basically mimicking the current GPU approach. For other scenarios, it will behave predictably but it doesn't seem nearly flexible enough to be usable in practice (I might be wrong here though). What if a given DRA Device is not more expensive than CPU/memory and shouldn't be prioritized? What if it's some fake, "free" Device that shouldn't actually be taken into account when calculating utilization? Etc.
- IMO solving this properly will require some discussions and design, so this PR only implements the "provisional" utilization calculation logic described above, so that scale-down can be tested.

Which issue(s) this PR fixes:

Fixes kubernetes/kubernetes#118612

Special notes for your reviewer:

The PR is split into meaningful commits intended to be reviewed sequentially.

Does this PR introduce a user-facing change?

Experimental support for DRA autoscaling is implemented, disabled by default. To enable it, set the `--enable-dynamic-resource-allocation` flag in Cluster Autoscaler, and the `DynamicResourceAllocation` feature guard in the cluster. Additionally, RBAC configuration must allow Cluster Autoscaler to list the following new objects: `resource.k8s.io/ResourceClaim`, `resource.k8s.io/ResourceSlice`, `resource.k8s.io/DeviceClass`. The support is experimental and not yet intended for production use. In particular, details about how Cluster Autoscaler reacts to DRA resources might change in future releases. Most autoscaling scenarios should work, but with potentially reduced performance. A list of missing features can be found in https://github.com/kubernetes/autoscaler/pull/7530.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

k8s-ci-robot · 2024-11-25T14:30:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Shubham82 · 2024-11-26T12:31:49Z

Hi @towca, the Tests / test-and-verify test case failed, which is required to merge this PR.
the error is related to the boilerplate. Here is the error:

The boilerplate header is wrong for /cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

Shubham82 · 2024-11-26T12:36:33Z

cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

@@ -0,0 +1,75 @@
+package utils


before this line, you have to add the following comment:

/* Copyright 2020 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */

Shubham82 · 2024-11-26T12:38:46Z

The first commit in the PR is just a squash of #7479, #7497, and #7529, and it shouldn't be a part of this review. The PR will be rebased on top of master after the others are merged.

As mentioned in the PR description, I am putting this PR on hold to avoid an accidental merge.

/hold

pohly · 2024-12-11T14:59:54Z

/wg device-management

pohly · 2024-12-11T15:07:05Z

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.

Agreed, this can come later. https://pkg.go.dev/k8s.io/[email protected]/resourceclaim might be the right package for that. If "resourceclaim" works as a package name, then you could already use it now and then later only the import path needs to be changed once that new code is upstream.

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

These utils will be used by various parts of the DRA logic in the following commits.

… state of DRA objects The Snapshot can hold all DRA objects in the cluster, and expose them to the scheduler framework via the SharedDRAManager interface. The state of the objects can be modified during autoscaling simulations using the provided methods.

The Provider uses DRA object listers to create a Snapshot of the DRA objects.

… utils

The logic is very basic and will likely need to be revised, but it's something for initial testing. Utilization of a given Pool is calculated as the number of allocated devices in the pool divided by the number of all devices in the pool. For scale-down purposes, the max utilization of all Node-local Pools is used. The new logic is mostly behind the DRA flag guard, so this should be a no-op if the flag is disabled. The only difference should be that FilterOutUnremovable marks a Node as unremovable if calculating utilization fails. Not sure why this wasn't the case before, but I think we need it for DRA - if CA sees an incomplete picture of a resource pool, we probably don't want to scale the Node down.

towca · 2024-12-11T18:25:03Z

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Yeah unfortunately it's quite a lot, IMO we should split reviewing responsibilities here:

@BigDarkClown you're probably the most familiar with core CA code, could you focus your review on how the DRA logic fits into the existing CA logic? In particular:
- Could you verify that the behavior doesn't change if the DRA flag is disabled?
- Could you review the error handling?
- Could you check if I missed any places that need to take DRA into account (beside the parts with TODO(DRA))?
- Could you review the code structure etc.?
@pohly you're definitely the most familiar with the DRA logic/assumptions, could you focus your review on the DRA parts and if the assumptions match? In particular:
- ResourceClaim utils (1st commit) - is the logic correct there?
- "Sanitizing" ResourceClaims (4th commit). During the simulations, CA duplicates Nodes and their DaemonSet pods. Duplicated Nodes/Pods go through "sanitization" so that they aren't actually identical to the originals. Doing this for DS pods using DRA has some corner cases (described in the comments), WDYT about this? Do we anticipate DS pods using DRA?
- Calculating utilization (5th commit + the PR description) - does my initial "provisional" idea make sense? Is it implemented correctly?
- Modifying the state of DRA objects during scheduling simulations (6th commit). This is tied to the ResourceClaim utils, as a lot of them are used here. Is the logic correct? One question that comes to mind - if we remove the last pod reserving a shared claim, should it be deallocated?
- Integration tests (7th commit) - do the scenarios listed there make sense and behave as you would expect them to? Are some major scenarios missing?
@jackfrancis AFAIU you're familiar with both the DRA and CA parts, could you focus your review somewhere "between" the two parts above?

/assign @BigDarkClown
/assign @pohly
/assign @jackfrancis

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

I was wondering about that part, hard to test it without an actual driver (the test driver from k/k e2e tests doesn't run into this). So a new Node can be Ready but not yet have its ResourceSlices exposed, is that right? If so then we indeed need to reimplement the device plugin logic. How can CA know that a Node is supposed to have any slices exposed though? For device plugins we depend on a specific label on the Node (if the label is on the Node but no allocatable resource -> fake the node as not ready).

@jackfrancis could you verify how this behaves in practice during your manual testing? Are we seeing double scale-ups?

All added logic is behind the DRA flag guard, this should be a no-op if the flag is disabled.

…s DRA snapshot The new logic is flag-guarded, it should be a no-op if DRA is disabled.

By default CA is built with DeltaSnapshotStore, which isn't integrated with DRA yet.

This is needed so that the scheduler code correctly includes and executes the DRA plugin. We could just use the feature gate instead of the DRA flag in CA (the feature gates flag is already there, just not really used), but I guess there could be use-cases for having DRA enabled in the cluster but not in CA (e.g. DRA being tested in the cluster, CA only operating on non-DRA nodes/pods).

jackfrancis · 2024-12-12T00:29:20Z

@pohly @towca in my testing (A100 nodes using the nvidia k8s-dra-driver) I didn't see any undesirable functional side-effects of the node bootstrapping process. It's definitely true that before a pending pod+resourceclaim can be fulfilled there's a lot more waiting than normal:

new node comes online
dra driver daemonset pods land on new node and perform host OS level driver installation
gpu operator carries out mig device configuration
node is rebooted
resourceslice is populated

During that period of time, I observed that cluster-autoscaler sanely (1) notices that the new node is there but that (2) the resourceclaim is not yet able to be fulfilled. I didn't notice any TTL expire like "I'm tired of waiting for this new node, I'm going to try and build a new one".

In practice, my experiments produced a ~10 minute wait between node provisioning and resourceslice population. That's a long time, admittedly! Is there a particular configuration we could set to try to reproduce any "double scale out" symptoms in my environment?

For full disclosure, this is the workload I used to test:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test
  name: mig-devices
spec:
  spec:
    devices:
      requests:
      - name: mig-1g-10gb
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: "device.attributes['gpu.nvidia.com'].profile == '1g.10gb'"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: gpu-test
  name: gpu-test
  labels:
    app: gpu-test-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      runtimeClassName: nvidia
      resourceClaims:
      - name: mig-device-gpu-test
        resourceClaimTemplateName: mig-devices
      containers:
      - name: mig-device-container
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: mig-device-gpu-test
            request: mig-1g-10gb

create cluster w/ 1 A100 node that exposes 7 * 10GB mig devices

      nvidia.com/mig.capable: "true"
      nvidia.com/mig.config: all-1g.10gb
      nvidia.com/mig.config.state: success

scale deployment replicas out to 8, which creates 1 pending pod + resourceclaim
wait for CA to add new node and observe (eventually) all 8 pods are running + all resourceclaims are allocated,reserved

pohly · 2024-12-12T07:35:20Z

How can CA know that a Node is supposed to have any slices exposed though?

Isn't that what needs to be configured for a node pool? Without information about what devices are going to be made available together with a new node, CA cannot scale up. This readiness check could compare that assumption against the actual devices.

During that period of time, I observed that cluster-autoscaler sanely (1) notices that the new node is there but that (2) the resourceclaim is not yet able to be fulfilled.

I don't remember what the exact problem is, i.e. whether it is "not scaling up when it should" or "scaling down when it shouldn't". I guess that it works in practice indicates that it is not needed for a MVP.

pohly

ResourceClaim utils (1st commit) - is the logic correct there?

Logic is correct. I would do a few things differently myself, but that is not a blocker for the PR.

pohly · 2024-12-12T14:31:36Z