-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed to check cloud provider" flooding logs #6096
Comments
I have CA 1.27.2 and having the same error. |
Observing the same issue. The CA is logging each non-ASG node every control loop. We're provisioning "platform" nodes on Managed Node Group instances, and use karpenter to provision "worker" nodes, which should be excluded from the check. CA: 1.27.2 / 1.28.0 |
Same issue here with CA 1.28.0 / k8s 1.28.2. The ASGs are managed by kops.
What does The instance, contrary to what CA says, exists and can be viewed in the AWS console. CA seems to lose it after the "Updating ASG" message. |
From what I understand, on each control loop the operator will scan the cluster and reconcile the node group ASG cache with the current cluster node instances. As part of this reconciliation, it compares the current cluster nodes to the cache, to update the cache state when cluster nodes have been deleted. This means it's checking all cluster nodes against the (smaller) set of nodes managed by the node group ASG, and hence why the Failed to check error is being reported, as the operator could not find instance (in the ASG cache) as those instances are not managed by the ASG. It's not that it doesn't find the node in the cloud provider but rather the node isn't managed by the node group ASG. Similarly, when processing nodes that may be eligible for scale-down, nodes that are not managed by the autoscaler ASG are reported with the (no node group config) warning. There was a recent related issue raised with fargate provisioned nodes also flooding the logs - this was resolved by simply stripping out those nodes based on a prefixed name convention. The issue here appears to be HasInstance within the context of the ASG cache vs the node instances cache? Essentially, when there are (self managed, say using karpenter) nodes in the cluster that are not managed by the ASG, these will be reported on each reconciliation as "Failed to check... could not find instance" and "should not be processed (no node group config)". Even if these messages were contextualised a little better, and could be suppressed with a change to log levels, this would be an improvement. |
in fact... this was summarised far more succinctly at the time in the k8s autoscaling sig agenda
|
Well, in my case, the node that CA complains about in the log:
I have a suspicion that it may be because of non-unique instance group names (other clusters in the same region can have IGs with the same names) and hence non-unique As an experiment, I have now renamed the instance groups to make their names unique. Will watch the logs to see if the issue persists. update: renaming the IG did not help. However, all of the nodes that are reported by CA as |
Can confirm that the bug is that non-autoscaling nodegroups trigger this, checked our logs and it's only api servers and etcd members which we have autoscale disabled. |
Solution 1: Make autoscaler have all asgs in it's |
I don't think hacking on kubernetes resources is a good solution, should CA focus on ASG tags for a final solution? |
Sadly an annotation isn't a great way to do this (label might be better?), as you can't pass node annotations to kubelet when running it (see kubernetes/kubernetes#108046) |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.27.3 with patch #5887
What k8s version are you using (
kubectl version
)?:v1.27.2
What environment is this in?:
aws
What did you expect to happen?:
no log spam
What happened instead?:
log spam
seems like 1 line for every instance in use
How to reproduce it (as minimally and precisely as possible):
unclear
Anything else we need to know?:
seems very similar to #5842 but the patch for it did not solve our issue
The text was updated successfully, but these errors were encountered: