Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to check cloud provider" flooding logs #6096

Closed
grosser opened this issue Sep 7, 2023 · 10 comments · Fixed by #6265
Closed

"Failed to check cloud provider" flooding logs #6096

grosser opened this issue Sep 7, 2023 · 10 comments · Fixed by #6265
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@grosser
Copy link
Contributor

grosser commented Sep 7, 2023

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.27.3 with patch #5887

What k8s version are you using (kubectl version)?:

v1.27.2

What environment is this in?:

aws

What did you expect to happen?:

no log spam

What happened instead?:

log spam

{"level":"WARN","message":"Failed to check cloud provider has instance for ip-xxx.us-west-2.compute.internal: node is not present in aws: could not find instance {aws:///us-west-2b/i-xyz i-xyz}"}

seems like 1 line for every instance in use

How to reproduce it (as minimally and precisely as possible):

unclear

Anything else we need to know?:

seems very similar to #5842 but the patch for it did not solve our issue

@grosser grosser added the kind/bug Categorizes issue or PR as related to a bug. label Sep 7, 2023
@mro-amboss
Copy link

I have CA 1.27.2 and having the same error.
Kubernetes version: 1.24.
Env: AWS EKS

@brydoncheyney
Copy link
Contributor

brydoncheyney commented Oct 5, 2023

Observing the same issue. The CA is logging each non-ASG node every control loop. We're provisioning "platform" nodes on Managed Node Group instances, and use karpenter to provision "worker" nodes, which should be excluded from the check.

CA: 1.27.2 / 1.28.0
Kubernetes version: 1.28
Env: AWS EKS

@shapirus
Copy link
Contributor

Same issue here with CA 1.28.0 / k8s 1.28.2.

The ASGs are managed by kops.

I1017 10:21:10.920315       1 cluster.go:175] node i-06de46c41ffb0aaaa is not suitable for removal: can reschedule only 0 out of 1 pods
[at this point all is working fine]
...
I1017 10:25:01.973574       1 auto_scaling_groups.go:393] Regenerating instance to ASG map for ASG names: [ list-of-all-ASGs ]
I1017 10:25:02.099392       1 auto_scaling_groups.go:400] Regenerating instance to ASG map for ASG tags: map[]
I1017 10:25:02.100362       1 auto_scaling_groups.go:142] Updating ASG <ASG-that-contains-the-node-in-question>
...
W1017 10:27:32.735121       1 clusterstate.go:1033] Failed to check cloud provider has instance for i-06de46c41ffb0aaaa: node is not present in aws: could not find instance {aws:///eu-central-1c/i-06de46c41ffb0aaaa i-06de46c41ffb0aaaa}
...
I1017 10:27:32.735701       1 pre_filtering_processor.go:57] Node i-06de46c41ffb0aaaa should not be processed by cluster autoscaler (no node group config)

What does no node group config mean?

The instance, contrary to what CA says, exists and can be viewed in the AWS console. CA seems to lose it after the "Updating ASG" message.

@brydoncheyney
Copy link
Contributor

What does no node group config mean?

The instance, contrary to what CA says, exists and can be viewed in the AWS console. CA seems to lose it after the "Updating ASG" message.

From what I understand, on each control loop the operator will scan the cluster and reconcile the node group ASG cache with the current cluster node instances. As part of this reconciliation, it compares the current cluster nodes to the cache, to update the cache state when cluster nodes have been deleted. This means it's checking all cluster nodes against the (smaller) set of nodes managed by the node group ASG, and hence why the Failed to check error is being reported, as the operator could not find instance (in the ASG cache) as those instances are not managed by the ASG. It's not that it doesn't find the node in the cloud provider but rather the node isn't managed by the node group ASG.

Similarly, when processing nodes that may be eligible for scale-down, nodes that are not managed by the autoscaler ASG are reported with the (no node group config) warning.

There was a recent related issue raised with fargate provisioned nodes also flooding the logs - this was resolved by simply stripping out those nodes based on a prefixed name convention.

The issue here appears to be HasInstance within the context of the ASG cache vs the node instances cache? Essentially, when there are (self managed, say using karpenter) nodes in the cluster that are not managed by the ASG, these will be reported on each reconciliation as "Failed to check... could not find instance" and "should not be processed (no node group config)". Even if these messages were contextualised a little better, and could be suppressed with a change to log levels, this would be an improvement.

@brydoncheyney
Copy link
Contributor

in fact... this was summarised far more succinctly at the time in the k8s autoscaling sig agenda

fix: CA on fargate causing log flood

  • fix: CA on fargate causing log flood #5887
    • Scenario: EKS cluster running on fargate but uses CA to scale some non-fargate nodegroups in the cluster
    • Problem: CA reads all K8s nodes including fargate ones and checks if the instance is present in AWS (by calling HasInstance) by checking if the instance is part of some ASG. This leads to error (because fargate nodes are not part of ASGs) and causes log flooding.

@shapirus
Copy link
Contributor

shapirus commented Oct 17, 2023

Well, in my case, the node that CA complains about in the log:

  • exists in the cluster
  • exists in the list of running EC2 instances
  • is managed by the respective ASG

I have a suspicion that it may be because of non-unique instance group names (other clusters in the same region can have IGs with the same names) and hence non-unique k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup tag values.

As an experiment, I have now renamed the instance groups to make their names unique. Will watch the logs to see if the issue persists.

update: renaming the IG did not help. However, all of the nodes that are reported by CA as node is not present in aws and should not be processed by cluster autoscaler (no node group config) are currently control-plane nodes. The one I initially mentioned was a regular node, but I can't definitely tell that it wasn't going to be shut down at that time. I will post another update if I notice CA saying this again about any node that isn't a control-plane and is definitely not being deleted.

@grosser
Copy link
Contributor Author

grosser commented Nov 9, 2023

Can confirm that the bug is that non-autoscaling nodegroups trigger this, checked our logs and it's only api servers and etcd members which we have autoscale disabled.

@grosser
Copy link
Contributor Author

grosser commented Nov 9, 2023

Solution 1: Make autoscaler have all asgs in it's aws.awsManager.asgCache not sure what the sideeffects would be.
Solution 2: Have the users annotate every node with "leave me alone" which his not great but makes the warning useful again ... maybe CA can automate that annotation 🤷 see #6265

@yongzhang
Copy link

I don't think hacking on kubernetes resources is a good solution, should CA focus on ASG tags for a final solution?

@james-callahan
Copy link

Sadly an annotation isn't a great way to do this (label might be better?), as you can't pass node annotations to kubelet when running it (see kubernetes/kubernetes#108046)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants