Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement DRA support in Cluster Autoscaler #7350

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

towca
Copy link
Collaborator

@towca towca commented Oct 4, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements support for Dynamic Resource Allocation (DRA) in Cluster Autoscaler.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612. The integration requires changes in CA and kube-scheduler - this is the CA part. The kube-scheduler part will be sent out shortly.

Special notes for your reviewer:

The PR is not complete yet, missing parts are labeled with TODO(DRA):

  • A bunch of unit tests need to be updated with DRA-specific test cases.
  • More integration tests need to be added to static_autoscaler_dra_test.go.
  • The integration test scenarios have to be tested in a real cluster.
  • Only BasicClusterSnapshot was adapted to work with DRA, the same needs to be done for DeltaClusterSnapshot.
  • Some fine details around "expendable pods" have to be figured out.

The rest of the implementation should be stable and reviewable right now.

I'm not sure what the best way to review such a large change would be. The PR is split into 20 meaningful commits that should be reviewed in sequence. It should be safe to submit a prefix of the commits as they are approved, but I have no idea how to facilitate something like this on Github.

Everything before the DRA: grab a snapshot of DRA objects and plumb to ClusterSnapshot commit should be a semantic no-op refactor. Later commits were designed to hide the new DRA logic behind a feature flag, but not everything could be easily hidden without a huge readability hit.

Does this PR introduce a user-facing change?

TODO

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

@towca towca added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Oct 4, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation labels Oct 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 4, 2024
@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 4, 2024
@towca
Copy link
Collaborator Author

towca commented Oct 7, 2024

/assign @MaciekPytel

towca added 2 commits October 7, 2024 19:26
…eChecker

This allows other components to interact with the Framework, which will
be needed for DRA support later.
Methods to interact with the new internal types are added to
ClusterSnapshot. Cluster Autoscaler code will be migrated to only
use these methods and work on the internal types instead of directly
using the framework types.

The new types are designed so that they can be used exactly like the
framework types, which should make the migration manageable.

This allows easily adding additional data to the Nodes and Pods tracked
in ClusterSnapshot, without having to change the scheduler framework.
This will be needed to support DRA, as we'll need to track
ResourceSlices and ResourceClaims.
towca added 10 commits October 7, 2024 19:38
AddNodes() is redundant - it was indended for batch adding nodes,
with batch-specific optimizations in mind probably. However, it
has always been implemented as just iterating over AddNode(), and
is only used in test code.

Most of the uses in the test code were initialization - they are
replaced with Initialize(), which will later be needed for handling
DRA anyway. The other uses are replaced with inline loops over
AddNode().
The method is already accessible via StorageInfos(), it's
redundant.
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.

Uses in production are replaced with Initialize(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.

MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
…terSnapshot implementations

The implementations will need to interact with the scheduler framework
when placing Pods on Nodes, in order to simulate DRA ResourceClaim allocation.

Tests are refactored so that ClusterSnapshot and PredicateChecker objects
get the same framework handle.
This will be needed to track changes to the DRA objects while making
scheduling simulations.
…ialize

Having a second snapshot object inside ClusterSnapshot isn't ideal from
readability perspective, but the DRA objects can't just be tracked
inside the NodeInfos/PodInfos.

ResourceClaims can be shared between multiple Pods, so we need some global
location for them anyway. There are ResourceSlices that aren't node-local
that the snapshot still needs to pass to the DRA scheduler plugin to
ensure correct results.

Out of multiple options I tried prototyping, having a single
source-of-truth snapshot of all DRA objects that is modified during
ClusterSnapshot operations seems the cleanest. Trying to model it in a
different way always resulted in something being really confusing, or
having to synchronize a lot of state.

The Basic ClusterSnapshot can just clone the DRA snapshot on Fork(). The
Delta implementation will need something more sophisticated, but leaving
that for the end.
AddPod is renamed to SchedulePod, RemovePod to UnschedulePod. This makes
more sense in the DRA world as for DRA we're not only adding/removing
the pod, but also modifying its ResourceClaims - but not adding/removing
them (the ResourceClaims need to be tracked even for pods that aren't
scheduled).

RemoveNode is renamed to RemoveNodeInfo for consistency with other
NodeInfo methods.
…rom PredicateChecker

SchedulePod takes an additional parameter. If reserveState is passed,
the Reserve() phase of the scheduling cycle will be run, so that the DRA
scheduler plugin can allocate ResourceClaims in the DRA snapshot if needed.
@towca towca force-pushed the jtuznik/dra-final branch 2 times, most recently from 1f39113 to 1d2e0e0 Compare October 7, 2024 18:00
towca added 7 commits October 9, 2024 18:24
The logic is very basic and will likely need to be revised, but it's
something for initial testing. Utilization of a given Pool is calculated
as the number of allocated devices in the pool divided by the number of
all devices in the pool. For scale-down purposes, the max utilization
of all Node-local Pools is used.
CA ignores Pods with priority below a cutoff, and pretends
they aren't in the cluster. If the pods have allocated ResourceClaims,
they would still block resources on a Node. So ResourceClaims owned by
expendable pods are removed from the DRA snapshot.

Predicates are now run when scheduling Pods waiting for preemption
to their nominatedNodeName. Not sure how this works if the preempted
pod is still on the Node, I suspect the filters would fail. This needs
to be tested, left a TODO.
DRA integration in CA needs changes in the scheduler framework. The
changes are currently in review in
kubernetes/kubernetes#127904. This commit
pulls these changes to vendor/ so that the PR can be tested and
iterated on.

Note that this also bumps all of CA's k8s dependencies, and there
was a breaking change in the scheduler framework - it seems that
InitMetrics() needs to be called before calling NewFramework() now.

DO NOT SUBMIT - instead, CA k8s deps should be synced after k/k#127904
is submitted (and the breaking change handled).
@towca towca force-pushed the jtuznik/dra-final branch from 1d2e0e0 to 2e7eeea Compare October 9, 2024 16:51
@@ -301,6 +301,8 @@ type AutoscalingOptions struct {
ProvisioningRequestMaxBackoffTime time.Duration
// ProvisioningRequestMaxCacheSize is the max size for ProvisioningRequest cache that is stored for retry backoff.
ProvisioningRequestMaxBackoffCacheSize int
// EnableDynamicResources configures whether logic for handling DRA objects is enabled.
EnableDynamicResources bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would calling this EnableDynamicResourceAllocation maybe be a bit more clear?

@@ -44,7 +45,8 @@ type AutoscalingContext struct {
AutoscalingKubeClients
// CloudProvider used in CA.
CloudProvider cloudprovider.CloudProvider
// TODO(kgolab) - move away too as it's not config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment no longer necessary?

func TestFrameworkHandleOrDie(t *testing.T) *Handle {
handle, err := TestFrameworkHandle()
if err != nil {
t.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there is a t.Fatal that will stop execution of the test, where t.Error will keep going. So t.Fatal might more closely fulfill the OrDie part of this function name if that's significant.

limitations under the License.
*/

package dynamicresources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too unbearably long, dynamicresourceallocation might be a clearer name for this package.

@@ -276,6 +276,7 @@ var (
asyncNodeGroupsEnabled = flag.Bool("async-node-groups", false, "Whether clusterautoscaler creates and deletes node groups asynchronously. Experimental: requires cloud provider supporting async node group operations, enable at your own risk.")
proactiveScaleupEnabled = flag.Bool("enable-proactive-scaleup", false, "Whether to enable/disable proactive scale-ups, defaults to false")
podInjectionLimit = flag.Int("pod-injection-limit", 5000, "Limits total number of pods while injecting fake pods. If unschedulable pods already exceeds the limit, pod injection is disabled but pods are not truncated.")
enableDynamicResources = flag.Bool("enable-dynamic-resources", false, "Whether logic for handling DRA objects is enabled.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag name might be another place worth using the full enable-dynamic-resource-allocation to more closely associate this with DRA.

@@ -47,7 +49,7 @@ type Info struct {
// memory) or gpu utilization based on if the node has GPU or not. Per resource
// utilization is the sum of requests for it divided by allocatable. It also
// returns the individual cpu, memory and gpu utilization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth also mentioning DRA in this comment alongside CPU/GPU/memory?

if ctx.EnableDynamicResources && dynamicresources.PodNeedsResourceClaims(p) {
state, err := ctx.PredicateChecker.CheckPredicates(ctx.ClusterSnapshot, p, p.Status.NominatedNodeName)
if err != nil {
klog.Warningf("Tried to running Filters for preempting pod %s/%s on nominatedNodeName, but they failed - ignoring the pod. Error: %v", p.Namespace, p.Name, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be "Tried to run" or "Tried running."

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants