Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA: DRA integration MVP #7530

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

towca
Copy link
Collaborator

@towca towca commented Nov 25, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler.

This PR implements an MVP of DRA integration in Cluster Autoscaler. Not all CA features work with DRA yet (see list below), but most logic is implemented and DRA autoscaling can be tested in a real cluster.

Changes summary:

  • Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.
  • Introduce dynamicresources.Provider which retrieves and snapshots all DRA objects.
  • Introduce dynamicresources.Snapshot which allows modifying the DRA objects snapshot obtained from Provider, and exposes the DRA objects to the scheduler framework.
  • Implement a very rudimentary utilization calculation for device pools, so that scale-down has something to act on.
  • Start sanitizing the DRA objects when duplicating Nodes and their DS-like pods in the snapshot.
  • Modify all relevant DRA objects (tracked in dynamicresources.Snapshot inside ClusterSnapshotStore) whenever ClusterSnapshot is modified (in PredicateSnapshot methods).
  • Add StaticAutoscaler integration-like unit tests covering DRA scenarios.

The following features don't work with DRA yet, and will be tackled post-MVP:

  • Priority-based preempting pods using DRA. If CA sees an unschedulable pod waiting for scheduler preemption (with nominatedNodeName set), it adds the pod to the nominatedNodeName in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled"). If such a Pod uses DRA, we'll have to run scheduler predicates to actually obtain the necessary ResourceClaim allocations. If we just force-add such a Pod to the snapshot without modifying the claims, CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations.
  • DaemonSet/static pods using DRA (in some cases). Similarly to the first point, DS/static pods using DRA won't work correctly during scale-from-0-nodes scenarios where TemplateNodeInfo() has to be called (since scheduler predicates aren't currently run there). Forcing "missing" DS pods onto template Nodes also won't work if the DS pods use DRA (we'll have to duplicate their resource claims and start running scheduler predicates there as well).
  • DeltaSnapshotStore. Only the BasicSnapshotStore (old BasicClusterSnapshot) implementation is integrated with dynamicresources.Snapshot. DeltaSnapshotStore (old DeltaClusterSnapshot) isn't yet. We'll need to add some delta-like capability to dynamicresources.Snapshot for that in addition to just Clone().

Additionally, the following points will have to be tackled post-MVP:

  • Unschedulable pods state. As pointed out by @MaciekPytel during the initial review of WIP: Implement DRA support in Cluster Autoscaler #7350, the state for unschedulable pods is kept in two separate places after this PR. The pod objects themselves are just a list obtained from listers in RunOnce, then processed by PodListProcessors and passed to ScaleUp. The Pods' DRA objects, on the other hand, live in the dynamicresources.Snapshot inside ClusterSnapshot. This leaves us with a risk of the two data sources diverging quite easily (e.g. a PodListProcessor injecting a "fake" unschedulable Pod to the list, but not injecting the Pod's ResourceClaims to the ClusterSnapshot). To make this better, in a follow-up PR the unschedulable pods will be fully moved inside ClusterSnapshot and PodListProcessors will interact with them via ClusterSnapshot methods instead of getting them by argument and returning.
  • Calculating Node utilization for DRA resources.
    • Cluster Autoscaler scale-down logic calculates a utilization value for each Node in the cluster. Only Nodes with utilization below a configured threshold are considered candidates for scale-down. Utilization is computed separately for every resource, by dividing the sum of Pods' requests by the Node allocatable value (it's not looking at "real" usage, just requests). If a Node has a GPU, only the GPU utilization is taken into account - CPU and memory are ignored. If a Node doesn't have a GPU, its utilization is the max of CPU and memory utilization values. It's not immediately clear how to extend this model for DRA resources.
    • This PR extends the utilization model in the following way. All Node-local Devices exposed in ResourceSlices for a given Node are grouped by their Driver and Pool. If ResourceSlices don't present a complete view of a Pool, utilization is not calculated and the Node is not scaled down until that changes. Allocated Node-local devices from the scheduled Pods' ResourceClaims are also grouped by their Driver and Pool. Utilization is calculated separately for every <Driver, Pool> pair by dividing the number of allocated devices by the number of total devices for a given Driver and Pool. Utilization of the Node is the max of these <Driver, Pool> utilization values. Similarly to how GPUs are treated, if DRA resources are exposed for a Node only DRA utilization is taken into account - CPU and memory are ignored.
    • The logic above should work pretty well for Nodes with one local Pool with identical, expensive (compared to CPU/memory) Devices - basically mimicking the current GPU approach. For other scenarios, it will behave predictably but it doesn't seem nearly flexible enough to be usable in practice (I might be wrong here though). What if a given DRA Device is not more expensive than CPU/memory and shouldn't be prioritized? What if it's some fake, "free" Device that shouldn't actually be taken into account when calculating utilization? Etc.
    • IMO solving this properly will require some discussions and design, so this PR only implements the "provisional" utilization calculation logic described above, so that scale-down can be tested.

Which issue(s) this PR fixes:

Fixes kubernetes/kubernetes#118612

Special notes for your reviewer:

The PR is split into meaningful commits intended to be reviewed sequentially.

Does this PR introduce a user-facing change?

Experimental support for DRA autoscaling is implemented, disabled by default. To enable it, set the `--enable-dynamic-resource-allocation` flag in Cluster Autoscaler, and the `DynamicResourceAllocation` feature guard in the cluster. Additionally, RBAC configuration must allow Cluster Autoscaler to list the following new objects: `resource.k8s.io/ResourceClaim`, `resource.k8s.io/ResourceSlice`, `resource.k8s.io/DeviceClass`. The support is experimental and not yet intended for production use. In particular, details about how Cluster Autoscaler reacts to DRA resources might change in future releases. Most autoscaling scenarios should work, but with potentially reduced performance. A list of missing features can be found in https://github.com/kubernetes/autoscaler/pull/7530.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 25, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 25, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 25, 2024
@Shubham82
Copy link
Contributor

Hi @towca, the Tests / test-and-verify test case failed, which is required to merge this PR.
the error is related to the boilerplate. Here is the error:

The boilerplate header is wrong for /cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

@@ -0,0 +1,75 @@
package utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this line, you have to add the following comment:

/*
Copyright 2020 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

@Shubham82
Copy link
Contributor

The first commit in the PR is just a squash of #7479, #7497, and #7529, and it shouldn't be a part of this review. The PR will be rebased on top of master after the others are merged.

As mentioned in the PR description, I am putting this PR on hold to avoid an accidental merge.

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 26, 2024
@towca towca force-pushed the jtuznik/dra-actual branch from 3ad1daa to cd3ed99 Compare November 28, 2024 11:20
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 28, 2024
@towca towca force-pushed the jtuznik/dra-actual branch 2 times, most recently from 9847f70 to 33afea0 Compare November 29, 2024 11:43
@towca towca force-pushed the jtuznik/dra-actual branch 7 times, most recently from e3cee18 to 3629698 Compare December 10, 2024 18:16
@pohly
Copy link
Contributor

pohly commented Dec 11, 2024

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Dec 11, 2024
@pohly
Copy link
Contributor

pohly commented Dec 11, 2024

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Implement some utils for interacting with ResourceClaims. These would probably ideally be upstream, but IMO we can revisit this later.

Agreed, this can come later. https://pkg.go.dev/k8s.io/[email protected]/resourceclaim might be the right package for that. If "resourceclaim" works as a package name, then you could already use it now and then later only the import path needs to be changed once that new code is upstream.

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

These utils will be used by various parts of the DRA logic in the
following commits.
… state of DRA objects

The Snapshot can hold all DRA objects in the cluster, and expose them
to the scheduler framework via the SharedDRAManager interface.

The state of the objects can be modified during autoscaling simulations
using the provided methods.
The Provider uses DRA object listers to create a Snapshot of the
DRA objects.
The logic is very basic and will likely need to be revised, but it's
something for initial testing. Utilization of a given Pool is calculated
as the number of allocated devices in the pool divided by the number of
all devices in the pool. For scale-down purposes, the max utilization
of all Node-local Pools is used.

The new logic is mostly behind the DRA flag guard, so this should be a no-op
if the flag is disabled. The only difference should be that FilterOutUnremovable
marks a Node as unremovable if calculating utilization fails. Not sure
why this wasn't the case before, but I think we need it for DRA - if CA sees an
incomplete picture of a resource pool, we probably don't want to scale
the Node down.
@towca towca force-pushed the jtuznik/dra-actual branch from 29b9543 to 2daeb9b Compare December 11, 2024 17:50
@towca
Copy link
Collaborator Author

towca commented Dec 11, 2024

I only looked at the PR description so far. I am not sure whether I am qualified to review the code 😰

Yeah unfortunately it's quite a lot, IMO we should split reviewing responsibilities here:

  • @BigDarkClown you're probably the most familiar with core CA code, could you focus your review on how the DRA logic fits into the existing CA logic? In particular:
    • Could you verify that the behavior doesn't change if the DRA flag is disabled?
    • Could you review the error handling?
    • Could you check if I missed any places that need to take DRA into account (beside the parts with TODO(DRA))?
    • Could you review the code structure etc.?
  • @pohly you're definitely the most familiar with the DRA logic/assumptions, could you focus your review on the DRA parts and if the assumptions match? In particular:
    • ResourceClaim utils (1st commit) - is the logic correct there?
    • "Sanitizing" ResourceClaims (4th commit). During the simulations, CA duplicates Nodes and their DaemonSet pods. Duplicated Nodes/Pods go through "sanitization" so that they aren't actually identical to the originals. Doing this for DS pods using DRA has some corner cases (described in the comments), WDYT about this? Do we anticipate DS pods using DRA?
    • Calculating utilization (5th commit + the PR description) - does my initial "provisional" idea make sense? Is it implemented correctly?
    • Modifying the state of DRA objects during scheduling simulations (6th commit). This is tied to the ResourceClaim utils, as a lot of them are used here. Is the logic correct? One question that comes to mind - if we remove the last pod reserving a shared claim, should it be deallocated?
    • Integration tests (7th commit) - do the scenarios listed there make sense and behave as you would expect them to? Are some major scenarios missing?
  • @jackfrancis AFAIU you're familiar with both the DRA and CA parts, could you focus your review somewhere "between" the two parts above?

/assign @BigDarkClown
/assign @pohly
/assign @jackfrancis

One thing that I don't see mentioned is node readiness: directly after provisioning a new node, the DRA driver pod(s) have to be started on the node and publish their ResourceSlices before the new node is really ready to support pods using claims. I know that there are checks for device plugins to treat such a node as "not ready" while plugins are starting. Is there something similar for DRA drivers? This might not be needed for MVP.

I was wondering about that part, hard to test it without an actual driver (the test driver from k/k e2e tests doesn't run into this). So a new Node can be Ready but not yet have its ResourceSlices exposed, is that right? If so then we indeed need to reimplement the device plugin logic. How can CA know that a Node is supposed to have any slices exposed though? For device plugins we depend on a specific label on the Node (if the label is on the Node but no allocatable resource -> fake the node as not ready).

@jackfrancis could you verify how this behaves in practice during your manual testing? Are we seeing double scale-ups?

All added logic is behind the DRA flag guard, this should be a no-op
if the flag is disabled.
…s DRA snapshot

The new logic is flag-guarded, it should be a no-op if DRA is disabled.
By default CA is built with DeltaSnapshotStore, which isn't integrated
with DRA yet.
This is needed so that the scheduler code correctly includes and
executes the DRA plugin.

We could just use the feature gate instead of the DRA flag in CA
(the feature gates flag is already there, just not really used),
but I guess there could be use-cases for having DRA enabled in the
cluster but not in CA (e.g. DRA being tested in the cluster, CA only
operating on non-DRA nodes/pods).
@towca towca force-pushed the jtuznik/dra-actual branch from 2daeb9b to fec30d3 Compare December 11, 2024 18:29
@jackfrancis
Copy link
Contributor

@pohly @towca in my testing (A100 nodes using the nvidia k8s-dra-driver) I didn't see any undesirable functional side-effects of the node bootstrapping process. It's definitely true that before a pending pod+resourceclaim can be fulfilled there's a lot more waiting than normal:

  1. new node comes online
  2. dra driver daemonset pods land on new node and perform host OS level driver installation
  3. gpu operator carries out mig device configuration
  4. node is rebooted
  5. resourceslice is populated

During that period of time, I observed that cluster-autoscaler sanely (1) notices that the new node is there but that (2) the resourceclaim is not yet able to be fulfilled. I didn't notice any TTL expire like "I'm tired of waiting for this new node, I'm going to try and build a new one".

In practice, my experiments produced a ~10 minute wait between node provisioning and resourceslice population. That's a long time, admittedly! Is there a particular configuration we could set to try to reproduce any "double scale out" symptoms in my environment?

For full disclosure, this is the workload I used to test:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test
  name: mig-devices
spec:
  spec:
    devices:
      requests:
      - name: mig-1g-10gb
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: "device.attributes['gpu.nvidia.com'].profile == '1g.10gb'"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: gpu-test
  name: gpu-test
  labels:
    app: gpu-test-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      runtimeClassName: nvidia
      resourceClaims:
      - name: mig-device-gpu-test
        resourceClaimTemplateName: mig-devices
      containers:
      - name: mig-device-container
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: mig-device-gpu-test
            request: mig-1g-10gb
  1. create cluster w/ 1 A100 node that exposes 7 * 10GB mig devices
      nvidia.com/mig.capable: "true"
      nvidia.com/mig.config: all-1g.10gb
      nvidia.com/mig.config.state: success
  1. scale deployment replicas out to 8, which creates 1 pending pod + resourceclaim
  2. wait for CA to add new node and observe (eventually) all 8 pods are running + all resourceclaims are allocated,reserved

@pohly
Copy link
Contributor

pohly commented Dec 12, 2024

How can CA know that a Node is supposed to have any slices exposed though?

Isn't that what needs to be configured for a node pool? Without information about what devices are going to be made available together with a new node, CA cannot scale up. This readiness check could compare that assumption against the actual devices.

During that period of time, I observed that cluster-autoscaler sanely (1) notices that the new node is there but that (2) the resourceclaim is not yet able to be fulfilled.

I don't remember what the exact problem is, i.e. whether it is "not scaling up when it should" or "scaling down when it shouldn't". I guess that it works in practice indicates that it is not needed for a MVP.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceClaim utils (1st commit) - is the logic correct there?

Logic is correct. I would do a few things differently myself, but that is not a blocker for the PR.


// ClaimOwningPod returns the name and UID of the Pod owner of the provided claim. If the claim isn't
// owned by a Pod, empty strings are returned.
func ClaimOwningPod(claim *resourceapi.ResourceClaim) (string, types.UID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it now, but elsewhere I suggested that these helpers might go into the k8s.io/dynamic-resource-allocation/resourceclaim package.

This and other functions then can drop the Claim prefix because the call becomes resourceclaim.OwningPod.

}

// ClaimAllocated returns whether the provided claim is allocated.
func ClaimAllocated(claim *resourceapi.ResourceClaim) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be called resourceclaim.IsAllocated. Not important right now.

}

// ClaimReservedForPod returns whether the provided claim is currently reserved for the provided pod.
func ClaimReservedForPod(claim *resourceapi.ResourceClaim, pod *apiv1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once it is in a staging repo, kubelet should get updated to call this. It must have a similar check.


// DeallocateClaimInPlace clears the allocation of the provided claim.
func DeallocateClaimInPlace(claim *resourceapi.ResourceClaim) {
claim.Status.Allocation = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit short for my taste and might not really make the callsite easier to understand. But it's not wrong.

return
}

newReservedFor := make([]resourceapi.ResourceClaimConsumerReference, 0, len(claim.Status.ReservedFor))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would find the right index, then use slices.Delete to create the new slice minus that one entry.

func TestClaimOwningPod(t *testing.T) {
truePtr := true
for _, tc := range []struct {
testName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer map[]struct because it ensures at compile time that I do not reuse the same test name twice.

Also, better avoid spaces in test names. They get replaced with underscores by go test and then the tests listed in the output can't be found easily in the source code. go test -run=... also is more complicated to get right.

Copy link
Contributor

@BigDarkClown BigDarkClown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, I am not the expert on DRA so I might have missed something. Overall this LGTM for the CA side, but please get an additional review for DRA handling.

})
podInfo.NeededResourceClaims = append(podInfo.NeededResourceClaims, testClaim(fmt.Sprintf("%s-claim-%d", pod.Name, i)))
}
for i := range 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any reason why we need two loops?

for id, claim := range s.resourceClaimsById {
result.resourceClaimsById[id] = claim.DeepCopy()
}
// Only the claims are mutable, we don't have to deep-copy anything else.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split the comment between before copying resource claims and after, this is weirdly placed right now.

}

// AddClaims adds additional ResourceClaims to the Snapshot. It can be used e.g. if we need to duplicate a Pod that
// owns ResourceClaims.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add info to the comment that the method returns error when claim already exists in snapshot, it is very important.

for _, podClaimRef := range pod.Spec.ResourceClaims {
claimName := claimRefToName(pod, podClaimRef)
if claimName == "" {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not error? It seems to indicate some issue with snapshot state.

claimId := ResourceClaimId{Name: claimName, Namespace: pod.Namespace}
claim, found := s.resourceClaimsById[claimId]
if !found {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, why not error?

// TODO(DRA): Figure out how to calculate Node utilization for DRA resources properly. The current logic is provisional, it should work well for Nodes with a single Pool of
// expensive Devices but is probably not flexible enough for other scenarios.
resourceName, highestUtil, err := drautils.HighestDynamicResourceUtilization(nodeInfo)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we error in this case? I am a bit hesitant, as this calculation seems to be experimental. I think a better option would be to continue to cpu + mem.

wantUtilInfo: Info{CpuUtil: 0.5, Utilization: 0.5, ResourceName: apiv1.ResourceCPU},
},
{
testName: "DRA slices present, but no claims, DRA enabled -> DRA util returned despite being lower than CPU",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 100% not sure if this is a correct behavior. I think a better way might be to calculate DRA util alongside cpu and mem, but let's keep it for the future improvements.

@@ -101,6 +170,30 @@ func (s *PredicateSnapshot) SchedulePodOnAnyNodeMatching(pod *apiv1.Pod, anyNode

// UnschedulePod removes the given Pod from the given Node inside the snapshot.
func (s *PredicateSnapshot) UnschedulePod(namespace string, podName string, nodeName string) error {
if s.draEnabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As DraSnapshot is in ClusterSnapshotStore, shouldn't PredicateSnapshot delegate this logic to ClusterSnapshotStore? I don't think it applies to all methods, but here it stood up the most for me.

@@ -33,6 +33,8 @@ import (
"github.com/stretchr/testify/assert"
)

// TODO(DRA): Add DRA-specific test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh please do.

var snapshotStore clustersnapshot.ClusterSnapshotStore = store.NewDeltaSnapshotStore()
if autoscalingOptions.DynamicResourceAllocationEnabled {
// TODO(DRA): Remove this once DeltaSnapshotStore is integrated with DRA.
snapshotStore = store.NewBasicSnapshotStore()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some high level log/warning here, this might not be expected for the end-user.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed some parts of " CA: sanitize and propagate DRA objects through NodeInfos in node_info utils".

// leave them as-is when duplicating). Returns a map of all pool names (without the suffix) that can be used with SanitizePodResourceClaims().
// Returns an error if any of the slices isn't node-local.
func SanitizeNodeResourceSlices(nodeLocalSlices []*resourceapi.ResourceSlice, newNodeName, nameSuffix string) (newSlices []*resourceapi.ResourceSlice, oldPoolNames map[string]bool, err error) {
oldPoolNames = map[string]bool{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSlices = append(newSlices, sliceCopy)
}
return newSlices, oldPoolNames, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function looks okay to me. Normally one cannot just append something to a name to create a new name because of length limits, but here this isn't an issue because it's all in-memory.

// SanitizePodResourceClaims can be used to duplicate ResourceClaims needed by a Pod, when duplicating the Pod.
// - ResourceClaims owned by oldOwner are duplicated and sanitized, to be owned by a duplicate pod - newOwner.
// - ResourceClaims not owned by oldOwner are returned unchanged in the result. They are shared claims not bound to the
// lifecycle of the duplicated pod, so they shouldn't be duplicated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about updating the ReservedFor so that the new pod is listed if (and only if) the old one was? Or is that something that will be covered elsewhere?

@@ -41,6 +42,9 @@ type nodeGroupTemplateNodeInfoGetter interface {
// SanitizedTemplateNodeInfoFromNodeGroup returns a template NodeInfo object based on NodeGroup.TemplateNodeInfo(). The template is sanitized, and only
// contains the pods that should appear on a new Node from the same node group (e.g. DaemonSet pods).
func SanitizedTemplateNodeInfoFromNodeGroup(nodeGroup nodeGroupTemplateNodeInfoGetter, daemonsets []*appsv1.DaemonSet, taintConfig taints.TaintConfig) (*framework.NodeInfo, errors.AutoscalerError) {
// TODO(DRA): Figure out how to handle TemplateNodeInfo() returning DaemonSet Pods using DRA. Currently, things only work correctly if such pods are
// already allocated by TemplateNodeInfo(). It might be better for TemplateNodeInfo() to return unallocated claims, and to run scheduler predicates and
// compute the allocations here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7530:

"Sanitizing" ResourceClaims (4th commit). During the simulations, CA duplicates Nodes and their DaemonSet pods. Duplicated Nodes/Pods go through "sanitization" so that they aren't actually identical to the originals. Doing this for DS pods using DRA has some corner cases (described in the comments), WDYT about this? Do we anticipate DS pods using DRA?

A use case for a DS with claims is for monitoring, in which the claims will use "admin mode", meaning that they don't block allocating the monitored devices for some other pod. Admin mode is under a feature gate in Kubernetes and alpha, so supporting this in CA is not a priority.


for _, podInfo := range nodeInfo.Pods() {
freshPod := sanitizePod(podInfo.Pod, freshNode.Name, namesSuffix)
result.AddPod(framework.NewPodInfo(freshPod, nil))
freshResourceClaims, err := drautils.SanitizePodResourceClaims(freshPod, podInfo.Pod, podInfo.NeededResourceClaims, namesSuffix, nodeInfo.Node().Name, freshNodeName, oldPoolNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious that drautils.SanitizePodResourceClaims doesn't change something in place. Perhaps rename it to drautils.SanitizePodResourceClaimsDeepCopy for consistency with this NodeInfoSanitizedDeepCopy here?

Or is it a convention that sanitize<something> returns a deep copy? sanitizePod and sanitizeNode below do that. Then why the DeepCopy suffix in NodeInfoSanitizedDeepCopy?

@@ -130,6 +145,9 @@ func podsExpectedOnFreshNode(sanitizedExampleNodeInfo *framework.NodeInfo, daemo
}
}
// Add all pending DS pods if force scheduling DS
// TODO(DRA): Figure out how to make this work for DS pods using DRA. Currently such pods would get force-added to the
// ClusterSnapshot, but the ResourceClaims reflecting their DRA usage on the Node wouldn't. So CA would be overestimating
// available DRA resources on the Node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay in practice for the "admin mode" monitoring use case because all devices on the new node would be available to pods, despite also having such a DS.

poolDevices := getAllDevices(currentSlices)
allocatedDeviceNames := allocatedDevices[driverName][poolName]
unallocated, allocated := splitDevicesByAllocation(poolDevices, allocatedDeviceNames)
result[driverName][poolName] = calculatePoolUtil(unallocated, allocated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7530:

Calculating utilization (5th commit + the PR description) - does my initial "provisional" idea make sense? Is it implemented correctly?

This looks good to me for now.

It might have to become a bit more complex once partitionable devices get added (targeting 1.33 with that) because then a single GPU will be represented by different "logical" devices in a slice and the "size" of of those matters. For example, on one node one device representing an entire GPU might be allocated (one device out of many allocated) while on some other node, multiple smaller subsets of a similar GPU might be allocated (several devices out of many allocated). Then simply counting "allocated devices" would give the second node a higher utilization although the GPU is less utilized.

I'm not sure yet how this can be handled in a generic way. Perhaps we need to standardize a special attribute of devices which names a quantity that can serve as stand-in for the size of a device? For GPUs, that could be the memory size.

/cc @klueska @mortent @johnbelamaric

if len(foundPod.Spec.ResourceClaims) > 0 {
if err := s.ClusterSnapshotStore.DraSnapshot().UnreservePodClaims(foundPod); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7530

Modifying the state of DRA objects during scheduling simulations (6th commit). This is tied to the ResourceClaim utils, as a lot of them are used here. Is the logic correct? One question that comes to mind - if we remove the last pod reserving a shared claim, should it be deallocated?

Yes, it should be treated as deallocated. Otherwise you cannot move such a pod from one node to another during a simulated scale down. It would be forced to stay on the original node because that is where the claim is allocated.

The rest of the logic looks okay to me.

{nodeName: node3GpuA1slice.name + "-1", reason: simulator.NotUnderutilized},
{nodeName: node3GpuA1slice.name + "-2", reason: simulator.NoPlaceToMovePods},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#7530

Integration tests (7th commit) - do the scenarios listed there make sense and behave as you would expect them to? Are some major scenarios missing?

A scale down scenario where a pod using a claim gets moved successfully to some other node would be good. As discussed in #7530, this probably doesn't work yet unless deallocation of the claim gets added. That scenario then also can cover utilization calculation (the "least utilized" node is meant to get removed).

A monitoring DS with admin mode might also be useful, once support for that gets added.

Specific scenarios aside, a general comment about testing: I see that DRA support is behind a flag. Would it be possible to run all existing test scenarios with and without that flag? Nothing should change, but one cannot be sure unless it gets tested...

The inverse is also true: what happens in a cluster where DRA is used and the flag is off? The outcome is probably different from the "flag is on" case, but is it still sane?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

DRA: integration with cluster autoscaler
6 participants