-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nil pointer dereference while fetching node instances for aws asg #2161
Comments
|
MixedInstanceGroup is kind of a feature we recently add to cluster autoscaler. So we determine not to cherry-pick back to old branches.. Another reason is CA has dependency on aws sdk from kubernetes, it's hard to update to use latest aws SDK with this feature in older k8s branches. Let me run a test with this combination and target the issues. |
@xrl I do see this error on 1.15.0. The different part is my CA doesn't crash and keep work
|
When cluster scales up, cache will invalid the record for that node group and it will refetch the nodegroup info. This is a log error and has been fixed here. I don't see any panic in 1.15.0. Could you try latest 1.15 CA versions and see if you still meet it? |
/sig aws |
I am on 1.15.1 CA and I'm still seeing the nil pointer exception:
|
I see the exact same error as @xrl. In my particular case, i aggressively deleted many of my deployments, causing a massive scale down in the autoscaler. Then i encountered this at some point:
I am using v1.15.1 |
Thanks. I can reproduce the issue. It happens for me after a while node scaled down. I am trying to fix the issue. |
Figure out the problem. When a node is scaled down, ASG will delete node. After a while (depends on when ASG fresh, by default we use 1min) autoscaler/cluster-autoscaler/clusterstate/clusterstate.go Lines 295 to 299 in c88f014
When CA updates ReadniessStats, autoscaler/cluster-autoscaler/clusterstate/clusterstate.go Lines 571 to 576 in c88f014
It will call autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go Lines 111 to 115 in c88f014
I will submit a PR in AWS repo to fix that issue. |
All release after this PR has been impacted. @MaciekPytel AWS implementation definitely has some issue, Do you think we can also improve caching part. Currently, csr only invalidate cache entry for scale up. should scale down be considered? |
Looks like Azure also have this issue in #2214 |
This was indeed fixed, I'm looking at a long uptime!
Thanks so much, this is awesome! |
See #2345 for what I think is a related consequence of fixing this problem. |
It's not clear to my why that was the right fix. Looking at the call site at autoscaler/cluster-autoscaler/clusterstate/clusterstate.go Lines 583 to 591 in bc73ded
|
It returns <nil, nil> in this case and clusterstate.go only check |
I'm confused. If Where does this panic occur? We won't reach line 592 in this loop. |
@seh Seems the code has been changed. During the time we fix this issue. It fails on the last line. See permalink I attached.
|
I figure out the problem. The time I make the patch I check 1.15 code that @xrl has issue with. Permalink is on that branch. |
I am seeing this error:
I have been aggressively scaling up my workloads in this new kube cluster. I am using the autoscaler helm chart with these values:
and I am running autoscaler 1.15.0 (I know that's mixing version from EKS 1.13.7 but it's the only way to work with MixedInstanceGroup ASGs.
one of my autoscaling groups cloudformations:
The text was updated successfully, but these errors were encountered: