WIP: Implement DRA support in Cluster Autoscaler #7350

towca · 2024-10-04T20:15:29Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements support for Dynamic Resource Allocation (DRA) in Cluster Autoscaler.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612. The integration requires changes in CA and kube-scheduler - this is the CA part. The kube-scheduler part will be sent out shortly.

Special notes for your reviewer:

The PR is not complete yet, missing parts are labeled with TODO(DRA):

A bunch of unit tests need to be updated with DRA-specific test cases.
More integration tests need to be added to static_autoscaler_dra_test.go.
The integration test scenarios have to be tested in a real cluster.
Only BasicClusterSnapshot was adapted to work with DRA, the same needs to be done for DeltaClusterSnapshot.
Some fine details around "expendable pods" have to be figured out.

The rest of the implementation should be stable and reviewable right now.

I'm not sure what the best way to review such a large change would be. The PR is split into 20 meaningful commits that should be reviewed in sequence. It should be safe to submit a prefix of the commits as they are approved, but I have no idea how to facilitate something like this on Github.

Everything before the DRA: grab a snapshot of DRA objects and plumb to ClusterSnapshot commit should be a semantic no-op refactor. Later commits were designed to hide the new DRA logic behind a feature flag, but not everything could be easily hidden without a huge readability hit.

Does this PR introduce a user-facing change?

TODO

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

k8s-ci-robot · 2024-10-04T20:15:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

towca · 2024-10-07T16:48:55Z

/assign @MaciekPytel

…eChecker This allows other components to interact with the Framework, which will be needed for DRA support later.

Methods to interact with the new internal types are added to ClusterSnapshot. Cluster Autoscaler code will be migrated to only use these methods and work on the internal types instead of directly using the framework types. The new types are designed so that they can be used exactly like the framework types, which should make the migration manageable. This allows easily adding additional data to the Nodes and Pods tracked in ClusterSnapshot, without having to change the scheduler framework. This will be needed to support DRA, as we'll need to track ResourceSlices and ResourceClaims.

AddNodes() is redundant - it was indended for batch adding nodes, with batch-specific optimizations in mind probably. However, it has always been implemented as just iterating over AddNode(), and is only used in test code. Most of the uses in the test code were initialization - they are replaced with Initialize(), which will later be needed for handling DRA anyway. The other uses are replaced with inline loops over AddNode().

The method is already accessible via StorageInfos(), it's redundant.

AddNodeInfo already provides the same functionality, and has to be used in production code in order to propagate DRA objects correctly. Uses in production are replaced with Initialize(), which will later take DRA objects into account. Uses in the test code are replaced with AddNodeInfo().

simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.

…terSnapshot implementations The implementations will need to interact with the scheduler framework when placing Pods on Nodes, in order to simulate DRA ResourceClaim allocation. Tests are refactored so that ClusterSnapshot and PredicateChecker objects get the same framework handle.

This will be needed to track changes to the DRA objects while making scheduling simulations.

…ialize Having a second snapshot object inside ClusterSnapshot isn't ideal from readability perspective, but the DRA objects can't just be tracked inside the NodeInfos/PodInfos. ResourceClaims can be shared between multiple Pods, so we need some global location for them anyway. There are ResourceSlices that aren't node-local that the snapshot still needs to pass to the DRA scheduler plugin to ensure correct results. Out of multiple options I tried prototyping, having a single source-of-truth snapshot of all DRA objects that is modified during ClusterSnapshot operations seems the cleanest. Trying to model it in a different way always resulted in something being really confusing, or having to synchronize a lot of state. The Basic ClusterSnapshot can just clone the DRA snapshot on Fork(). The Delta implementation will need something more sophisticated, but leaving that for the end.

AddPod is renamed to SchedulePod, RemovePod to UnschedulePod. This makes more sense in the DRA world as for DRA we're not only adding/removing the pod, but also modifying its ResourceClaims - but not adding/removing them (the ResourceClaims need to be tracked even for pods that aren't scheduled). RemoveNode is renamed to RemoveNodeInfo for consistency with other NodeInfo methods.

…rom PredicateChecker SchedulePod takes an additional parameter. If reserveState is passed, the Reserve() phase of the scheduling cycle will be run, so that the DRA scheduler plugin can allocate ResourceClaims in the DRA snapshot if needed.

…napshot

The logic is very basic and will likely need to be revised, but it's something for initial testing. Utilization of a given Pool is calculated as the number of allocated devices in the pool divided by the number of all devices in the pool. For scale-down purposes, the max utilization of all Node-local Pools is used.

CA ignores Pods with priority below a cutoff, and pretends they aren't in the cluster. If the pods have allocated ResourceClaims, they would still block resources on a Node. So ResourceClaims owned by expendable pods are removed from the DRA snapshot. Predicates are now run when scheduling Pods waiting for preemption to their nominatedNodeName. Not sure how this works if the preempted pod is still on the Node, I suspect the filters would fail. This needs to be tested, left a TODO.

DRA integration in CA needs changes in the scheduler framework. The changes are currently in review in kubernetes/kubernetes#127904. This commit pulls these changes to vendor/ so that the PR can be tested and iterated on. Note that this also bumps all of CA's k8s dependencies, and there was a breaking change in the scheduler framework - it seems that InitMetrics() needs to be called before calling NewFramework() now. DO NOT SUBMIT - instead, CA k8s deps should be synced after k/k#127904 is submitted (and the breaking change handled).

nojnhuh · 2024-10-09T19:29:25Z

cluster-autoscaler/config/autoscaling_options.go

@@ -301,6 +301,8 @@ type AutoscalingOptions struct {
 	ProvisioningRequestMaxBackoffTime time.Duration
 	// ProvisioningRequestMaxCacheSize is the max size for ProvisioningRequest cache that is stored for retry backoff.
 	ProvisioningRequestMaxBackoffCacheSize int
+	// EnableDynamicResources configures whether logic for handling DRA objects is enabled.
+	EnableDynamicResources bool


Would calling this EnableDynamicResourceAllocation maybe be a bit more clear?

nojnhuh · 2024-10-09T19:30:05Z

cluster-autoscaler/context/autoscaling_context.go

@@ -44,7 +45,8 @@ type AutoscalingContext struct {
 	AutoscalingKubeClients
 	// CloudProvider used in CA.
 	CloudProvider cloudprovider.CloudProvider
-	// TODO(kgolab) - move away too as it's not config


Is this comment no longer necessary?

nojnhuh · 2024-10-09T20:08:30Z

cluster-autoscaler/simulator/framework/test_utils.go

+func TestFrameworkHandleOrDie(t *testing.T) *Handle {
+	handle, err := TestFrameworkHandle()
+	if err != nil {
+		t.Error(err)


FWIW there is a t.Fatal that will stop execution of the test, where t.Error will keep going. So t.Fatal might more closely fulfill the OrDie part of this function name if that's significant.

nojnhuh · 2024-10-09T20:12:20Z

cluster-autoscaler/dynamicresources/listers.go

+limitations under the License.
+*/
+
+package dynamicresources


If it's not too unbearably long, dynamicresourceallocation might be a clearer name for this package.

nojnhuh · 2024-10-09T20:54:50Z

cluster-autoscaler/main.go

@@ -276,6 +276,7 @@ var (
 	asyncNodeGroupsEnabled                 = flag.Bool("async-node-groups", false, "Whether clusterautoscaler creates and deletes node groups asynchronously. Experimental: requires cloud provider supporting async node group operations, enable at your own risk.")
 	proactiveScaleupEnabled                = flag.Bool("enable-proactive-scaleup", false, "Whether to enable/disable proactive scale-ups, defaults to false")
 	podInjectionLimit                      = flag.Int("pod-injection-limit", 5000, "Limits total number of pods while injecting fake pods. If unschedulable pods already exceeds the limit, pod injection is disabled but pods are not truncated.")
+	enableDynamicResources                 = flag.Bool("enable-dynamic-resources", false, "Whether logic for handling DRA objects is enabled.")


I think this flag name might be another place worth using the full enable-dynamic-resource-allocation to more closely associate this with DRA.

nojnhuh · 2024-10-09T21:43:53Z

cluster-autoscaler/simulator/utilization/info.go

@@ -47,7 +49,7 @@ type Info struct {
 // memory) or gpu utilization based on if the node has GPU or not. Per resource
 // utilization is the sum of requests for it divided by allocatable. It also
 // returns the individual cpu, memory and gpu utilization.


Is it worth also mentioning DRA in this comment alongside CPU/GPU/memory?

nojnhuh · 2024-10-09T22:04:03Z

cluster-autoscaler/core/podlistprocessor/filter_out_expendable.go

+		if ctx.EnableDynamicResources && dynamicresources.PodNeedsResourceClaims(p) {
+			state, err := ctx.PredicateChecker.CheckPredicates(ctx.ClusterSnapshot, p, p.Status.NominatedNodeName)
+			if err != nil {
+				klog.Warningf("Tried to running Filters for preempting pod %s/%s on nominatedNodeName, but they failed - ignoring the pod. Error: %v", p.Namespace, p.Name, err)


This should probably be "Tried to run" or "Tried running."

k8s-ci-robot · 2024-10-16T14:46:11Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

towca added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Oct 4, 2024

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation labels Oct 4, 2024

k8s-ci-robot requested review from andrewsykim and apricote October 4, 2024 20:15

towca mentioned this pull request Oct 7, 2024

DRA: allow Cluster Autoscaler to integrate with DRA scheduler plugin kubernetes/kubernetes#127904

Merged

k8s-ci-robot assigned MaciekPytel Oct 7, 2024

towca added 2 commits October 7, 2024 19:26

DRA: extract interacting with the scheduler framework out of Predicat…

4992123

…eChecker This allows other components to interact with the Framework, which will be needed for DRA support later.

towca added 10 commits October 7, 2024 19:38

DRA: remove redundant IsPVCUsedByPods from ClusterSnapshot

c249f46

The method is already accessible via StorageInfos(), it's redundant.

DRA: Implement a Snapshot of DRA objects, its Provider, and utils

26e4787

This will be needed to track changes to the DRA objects while making scheduling simulations.

DRA: propagate DRA objects through NodeInfos in node_info utils

006685c

towca force-pushed the jtuznik/dra-final branch 2 times, most recently from 1f39113 to 1d2e0e0 Compare October 7, 2024 18:00

towca added 7 commits October 9, 2024 18:24

DRA: plumb the DRA snapshot into scheduler framework through ClusterS…

0e055c4

…napshot

DRA: integrate BasicClusterSnapshot with the DRA snapshot

ef9d420

DRA: add integration tests

38fb034

DRA: handle duplicating unschedulable pods using DRA

3544bb4

towca force-pushed the jtuznik/dra-final branch from 1d2e0e0 to 2e7eeea Compare October 9, 2024 16:51

nojnhuh reviewed Oct 9, 2024

View reviewed changes

nojnhuh mentioned this pull request Oct 14, 2024

Create and use new internal NodeInfo and PodInfo types to enable tracking DRA resources #7390

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2024

towca mentioned this pull request Oct 30, 2024

CA: introduce internal NodeInfo #7447

Merged

This was referenced Nov 6, 2024

CA: refactor ClusterSnapshot methods #7466

Merged

CA: refactor utils related to NodeInfos #7479

Merged

toVersus mentioned this pull request Nov 10, 2024

[EKS] [request]: Use Dynamic Resource Allocation for EKS aws/containers-roadmap#2314

Open

towca mentioned this pull request Nov 14, 2024

CA: refactor PredicateChecker into ClusterSnapshot #7497

Merged

towca mentioned this pull request Nov 25, 2024

CA: DRA integration MVP #7530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Implement DRA support in Cluster Autoscaler #7350

WIP: Implement DRA support in Cluster Autoscaler #7350

towca commented Oct 4, 2024 •

edited

Loading

k8s-ci-robot commented Oct 4, 2024

towca commented Oct 7, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

nojnhuh Oct 9, 2024

k8s-ci-robot commented Oct 16, 2024

WIP: Implement DRA support in Cluster Autoscaler #7350

Are you sure you want to change the base?

WIP: Implement DRA support in Cluster Autoscaler #7350

Conversation

towca commented Oct 4, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 4, 2024

towca commented Oct 7, 2024

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

nojnhuh Oct 9, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 16, 2024

towca commented Oct 4, 2024 •

edited

Loading