-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement DRA support in Cluster Autoscaler #7350
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @MaciekPytel |
…eChecker This allows other components to interact with the Framework, which will be needed for DRA support later.
Methods to interact with the new internal types are added to ClusterSnapshot. Cluster Autoscaler code will be migrated to only use these methods and work on the internal types instead of directly using the framework types. The new types are designed so that they can be used exactly like the framework types, which should make the migration manageable. This allows easily adding additional data to the Nodes and Pods tracked in ClusterSnapshot, without having to change the scheduler framework. This will be needed to support DRA, as we'll need to track ResourceSlices and ResourceClaims.
AddNodes() is redundant - it was indended for batch adding nodes, with batch-specific optimizations in mind probably. However, it has always been implemented as just iterating over AddNode(), and is only used in test code. Most of the uses in the test code were initialization - they are replaced with Initialize(), which will later be needed for handling DRA anyway. The other uses are replaced with inline loops over AddNode().
The method is already accessible via StorageInfos(), it's redundant.
AddNodeInfo already provides the same functionality, and has to be used in production code in order to propagate DRA objects correctly. Uses in production are replaced with Initialize(), which will later take DRA objects into account. Uses in the test code are replaced with AddNodeInfo().
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.
…terSnapshot implementations The implementations will need to interact with the scheduler framework when placing Pods on Nodes, in order to simulate DRA ResourceClaim allocation. Tests are refactored so that ClusterSnapshot and PredicateChecker objects get the same framework handle.
This will be needed to track changes to the DRA objects while making scheduling simulations.
…ialize Having a second snapshot object inside ClusterSnapshot isn't ideal from readability perspective, but the DRA objects can't just be tracked inside the NodeInfos/PodInfos. ResourceClaims can be shared between multiple Pods, so we need some global location for them anyway. There are ResourceSlices that aren't node-local that the snapshot still needs to pass to the DRA scheduler plugin to ensure correct results. Out of multiple options I tried prototyping, having a single source-of-truth snapshot of all DRA objects that is modified during ClusterSnapshot operations seems the cleanest. Trying to model it in a different way always resulted in something being really confusing, or having to synchronize a lot of state. The Basic ClusterSnapshot can just clone the DRA snapshot on Fork(). The Delta implementation will need something more sophisticated, but leaving that for the end.
AddPod is renamed to SchedulePod, RemovePod to UnschedulePod. This makes more sense in the DRA world as for DRA we're not only adding/removing the pod, but also modifying its ResourceClaims - but not adding/removing them (the ResourceClaims need to be tracked even for pods that aren't scheduled). RemoveNode is renamed to RemoveNodeInfo for consistency with other NodeInfo methods.
…rom PredicateChecker SchedulePod takes an additional parameter. If reserveState is passed, the Reserve() phase of the scheduling cycle will be run, so that the DRA scheduler plugin can allocate ResourceClaims in the DRA snapshot if needed.
1f39113
to
1d2e0e0
Compare
The logic is very basic and will likely need to be revised, but it's something for initial testing. Utilization of a given Pool is calculated as the number of allocated devices in the pool divided by the number of all devices in the pool. For scale-down purposes, the max utilization of all Node-local Pools is used.
CA ignores Pods with priority below a cutoff, and pretends they aren't in the cluster. If the pods have allocated ResourceClaims, they would still block resources on a Node. So ResourceClaims owned by expendable pods are removed from the DRA snapshot. Predicates are now run when scheduling Pods waiting for preemption to their nominatedNodeName. Not sure how this works if the preempted pod is still on the Node, I suspect the filters would fail. This needs to be tested, left a TODO.
DRA integration in CA needs changes in the scheduler framework. The changes are currently in review in kubernetes/kubernetes#127904. This commit pulls these changes to vendor/ so that the PR can be tested and iterated on. Note that this also bumps all of CA's k8s dependencies, and there was a breaking change in the scheduler framework - it seems that InitMetrics() needs to be called before calling NewFramework() now. DO NOT SUBMIT - instead, CA k8s deps should be synced after k/k#127904 is submitted (and the breaking change handled).
1d2e0e0
to
2e7eeea
Compare
@@ -301,6 +301,8 @@ type AutoscalingOptions struct { | |||
ProvisioningRequestMaxBackoffTime time.Duration | |||
// ProvisioningRequestMaxCacheSize is the max size for ProvisioningRequest cache that is stored for retry backoff. | |||
ProvisioningRequestMaxBackoffCacheSize int | |||
// EnableDynamicResources configures whether logic for handling DRA objects is enabled. | |||
EnableDynamicResources bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would calling this EnableDynamicResourceAllocation
maybe be a bit more clear?
@@ -44,7 +45,8 @@ type AutoscalingContext struct { | |||
AutoscalingKubeClients | |||
// CloudProvider used in CA. | |||
CloudProvider cloudprovider.CloudProvider | |||
// TODO(kgolab) - move away too as it's not config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment no longer necessary?
func TestFrameworkHandleOrDie(t *testing.T) *Handle { | ||
handle, err := TestFrameworkHandle() | ||
if err != nil { | ||
t.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW there is a t.Fatal
that will stop execution of the test, where t.Error
will keep going. So t.Fatal
might more closely fulfill the OrDie
part of this function name if that's significant.
limitations under the License. | ||
*/ | ||
|
||
package dynamicresources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too unbearably long, dynamicresourceallocation
might be a clearer name for this package.
@@ -276,6 +276,7 @@ var ( | |||
asyncNodeGroupsEnabled = flag.Bool("async-node-groups", false, "Whether clusterautoscaler creates and deletes node groups asynchronously. Experimental: requires cloud provider supporting async node group operations, enable at your own risk.") | |||
proactiveScaleupEnabled = flag.Bool("enable-proactive-scaleup", false, "Whether to enable/disable proactive scale-ups, defaults to false") | |||
podInjectionLimit = flag.Int("pod-injection-limit", 5000, "Limits total number of pods while injecting fake pods. If unschedulable pods already exceeds the limit, pod injection is disabled but pods are not truncated.") | |||
enableDynamicResources = flag.Bool("enable-dynamic-resources", false, "Whether logic for handling DRA objects is enabled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this flag name might be another place worth using the full enable-dynamic-resource-allocation
to more closely associate this with DRA.
@@ -47,7 +49,7 @@ type Info struct { | |||
// memory) or gpu utilization based on if the node has GPU or not. Per resource | |||
// utilization is the sum of requests for it divided by allocatable. It also | |||
// returns the individual cpu, memory and gpu utilization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth also mentioning DRA in this comment alongside CPU/GPU/memory?
if ctx.EnableDynamicResources && dynamicresources.PodNeedsResourceClaims(p) { | ||
state, err := ctx.PredicateChecker.CheckPredicates(ctx.ClusterSnapshot, p, p.Status.NominatedNodeName) | ||
if err != nil { | ||
klog.Warningf("Tried to running Filters for preempting pod %s/%s on nominatedNodeName, but they failed - ignoring the pod. Error: %v", p.Namespace, p.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be "Tried to run" or "Tried running."
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR implements support for Dynamic Resource Allocation (DRA) in Cluster Autoscaler.
Which issue(s) this PR fixes:
The CA/DRA integration is tracked in kubernetes/kubernetes#118612. The integration requires changes in CA and kube-scheduler - this is the CA part. The kube-scheduler part will be sent out shortly.
Special notes for your reviewer:
The PR is not complete yet, missing parts are labeled with
TODO(DRA)
:static_autoscaler_dra_test.go
.BasicClusterSnapshot
was adapted to work with DRA, the same needs to be done forDeltaClusterSnapshot
.The rest of the implementation should be stable and reviewable right now.
I'm not sure what the best way to review such a large change would be. The PR is split into 20 meaningful commits that should be reviewed in sequence. It should be safe to submit a prefix of the commits as they are approved, but I have no idea how to facilitate something like this on Github.
Everything before the
DRA: grab a snapshot of DRA objects and plumb to ClusterSnapshot
commit should be a semantic no-op refactor. Later commits were designed to hide the new DRA logic behind a feature flag, but not everything could be easily hidden without a huge readability hit.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: