Don't cache NodeInfo for recently Ready nodes #4641

x13n · 2022-01-24T07:19:39Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

There's a race condition between DaemonSet pods getting scheduled to a
new node and Cluster Autoscaler caching that node for the sake of
predicting future nodes in a given node group. We can reduce the risk of
missing some DaemonSet by providing a grace period before accepting nodes in the
cache. 1 minute should be more than enough, except for some pathological
edge cases.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/assign @MaciekPytel

x13n · 2022-01-24T07:23:14Z

/assign @yaroslava-serdiuk

k8s-ci-robot · 2022-01-24T07:23:15Z

@x13n: GitHub didn't allow me to assign the following users: yaroslava-serdiuk.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @yaroslava-serdiuk

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MaciekPytel

Overall lgtm, left a few minor comments.

MaciekPytel · 2022-01-25T11:13:42Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

@@ -56,6 +59,7 @@ func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext,
 	// TODO(mwielgus): Review error policy - sometimes we may continue with partial errors.
 	result := make(map[string]*schedulerframework.NodeInfo)
 	seenGroups := make(map[string]bool)
+	now := time.Now()


nit: This is called early enough in CA loop that it probably doesn't matter as much, but maybe passing currentTime from RunOnce() would be more consistent? Since CA loop operates on a snapshot we generally use timestamp of loop start as 'now' for this type of checks.

MaciekPytel · 2022-01-25T11:22:21Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

@@ -90,7 +94,7 @@ func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext,

 	for _, node := range nodes {
 		// Broken nodes might have some stuff missing. Skipping.
-		if !kube_util.IsNodeReadyAndSchedulable(node) {
+		if !isNodeGoodForCaching(node, now) {


You're using this check to see if a node can be used as a template node for this loop (processNode call that you're skipping adds the node to the result), not just if it can be cached. That may actually be the right thing to do (node without DS doesn't make for a good template), but the function name is misleading. I'd either rename the function or, if we think this check should only apply to caching, move it check to L104 where we actually handle caching.

There's a race condition between DaemonSet pods getting scheduled to a new node and Cluster Autoscaler caching that node for the sake of predicting future nodes in a given node group. We can reduce the risk of missing some DaemonSet by providing a grace period before accepting nodes in the cache. 1 minute should be more than enough, except for some pathological edge cases.

MaciekPytel · 2022-01-27T09:55:22Z

/lgtm
/approve

k8s-ci-robot · 2022-01-27T09:56:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot assigned MaciekPytel Jan 24, 2022

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 24, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer January 24, 2022 07:19

x13n force-pushed the nodeinfocache branch from 93152c1 to e0b9a31 Compare January 24, 2022 07:20

jbartosik added the area/cluster-autoscaler label Jan 24, 2022

MaciekPytel reviewed Jan 25, 2022

View reviewed changes

x13n force-pushed the nodeinfocache branch from e0b9a31 to 9944137 Compare January 26, 2022 19:19

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 26, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2022

k8s-ci-robot merged commit f508212 into kubernetes:master Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't cache NodeInfo for recently Ready nodes #4641

Don't cache NodeInfo for recently Ready nodes #4641

x13n commented Jan 24, 2022

x13n commented Jan 24, 2022

k8s-ci-robot commented Jan 24, 2022

MaciekPytel left a comment

MaciekPytel Jan 25, 2022

x13n Jan 26, 2022

MaciekPytel Jan 25, 2022

x13n Jan 26, 2022

MaciekPytel commented Jan 27, 2022

k8s-ci-robot commented Jan 27, 2022

Don't cache NodeInfo for recently Ready nodes #4641

Don't cache NodeInfo for recently Ready nodes #4641

Conversation

x13n commented Jan 24, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

x13n commented Jan 24, 2022

k8s-ci-robot commented Jan 24, 2022

MaciekPytel left a comment

Choose a reason for hiding this comment

MaciekPytel Jan 25, 2022

Choose a reason for hiding this comment

x13n Jan 26, 2022

Choose a reason for hiding this comment

MaciekPytel Jan 25, 2022

Choose a reason for hiding this comment

x13n Jan 26, 2022

Choose a reason for hiding this comment

MaciekPytel commented Jan 27, 2022

k8s-ci-robot commented Jan 27, 2022