Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nil pointer dereference while fetching node instances for aws asg #2161

Closed
xrl opened this issue Jul 3, 2019 · 18 comments · Fixed by #2247
Closed

nil pointer dereference while fetching node instances for aws asg #2161

xrl opened this issue Jul 3, 2019 · 18 comments · Fixed by #2247
Labels
area/provider/aws Issues or PRs related to aws provider

Comments

@xrl
Copy link

xrl commented Jul 3, 2019

I am seeing this error:

I0703 17:25:41.236830       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size to 3
I0703 17:25:41.281175       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-misc-9c55878cc-vj7nr", UID:"91b85597-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074967", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281214       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size set to 3
I0703 17:25:41.281226       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"proxy-6678dc85f6-q2cdc", UID:"8fa7bd40-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074922", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281272       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-replication-dc7596ccf-6294t", UID:"930a91d7-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074999", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281303       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-salesforce-6fd96cfddd-zqlhj", UID:"9420fd2c-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5075027", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281380       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"hub-5fb7c84ccb-85ns8", UID:"8ea23011-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074887", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:43.503908       1 node_instances_cache.go:155] Start refreshing cloud provider node instances cache
I0703 17:25:43.503948       1 node_instances_cache.go:167] Refresh cloud provider node instances cache finished, refresh took 16.505µs
I0703 17:25:51.293772       1 static_autoscaler.go:187] Starting main loop
I0703 17:25:51.336472       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-mixedspotinstm52xlargeasgsubnet02af43b02922e710f-QR5SWRK0N1Q0 K3-EKS-mixedspotinstm52xlargeasgsubnet09df9044a965c5907-78SRQM8MXOR6 K3-EKS-mixedspotinstm52xlargeasgsubnet0d22e2495433092d1-7ZH6F3I2Y90L K3-EKS-mixedspotinstt32xlargeasgsubnet02af43b02922e710f-10SJB4NDVE9VX K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O K3-EKS-mixedspotinstt32xlargeasgsubnet0d22e2495433092d1-8EG7GH4K6BDQ K3-EKS-ondemandasgsubnet02af43b02922e710f-YFGY7WRUEUQ1]
I0703 17:25:51.394755       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-07-03 17:26:51.394747295 +0000 UTC m=+788.208984766
E0703 17:25:51.455558       1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O, error <nil>
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2222355]

goroutine 65 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc001796160, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc001796160, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000c28820, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0xc000e36cc0, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000c28820, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0000da000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff4fc0, 0xc0004be180)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

I have been aggressively scaling up my workloads in this new kube cluster. I am using the autoscaler helm chart with these values:

autoDiscovery:
  clusterName: K3
image:
  repository: k8s.gcr.io/cluster-autoscaler
  tag: v1.15.0
rbac:
  create: true
# must override: https://github.com/kubernetes/autoscaler/issues/2139
sslCertHostPath: /etc/ssl/certs/ca-bundle.crt
podAnnotations:
  iam.amazonaws.com/role: k8s-autoscaler
extraArgs:
  v: 4
  stderrthreshold: info
  logtostderr: true
  # write-status-configmap: true
  # leader-elect: true
  skip-nodes-with-local-storage: "false"
  expander: most-pods
  # scale-down-enabled: true
  # balance-similar-node-groups: true
  # min-replica-count: 2
  scale-down-utilization-threshold: 0.75
  # scale-down-non-empty-candidates-count: 5
  # max-node-provision-time: 15m0s
  # scan-interval: 10s
  # scale-down-delay: 10m
  # scale-down-unneeded-time: 10m
  skip-nodes-with-local-storage: "false"
  skip-nodes-with-system-pods: "false"
awsRegion: eu-central-1

and I am running autoscaler 1.15.0 (I know that's mixing version from EKS 1.13.7 but it's the only way to work with MixedInstanceGroup ASGs.

I0703 17:25:52.680391       1 flags.go:52] FLAG: --address=":8085"
I0703 17:25:52.681144       1 flags.go:52] FLAG: --alsologtostderr="false"
I0703 17:25:52.681163       1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0703 17:25:52.681169       1 flags.go:52] FLAG: --cloud-config=""
I0703 17:25:52.681174       1 flags.go:52] FLAG: --cloud-provider="aws"
I0703 17:25:52.681187       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0703 17:25:52.681197       1 flags.go:52] FLAG: --cluster-name=""
I0703 17:25:52.681202       1 flags.go:52] FLAG: --cores-total="0:320000"
I0703 17:25:52.681211       1 flags.go:52] FLAG: --estimator="binpacking"
I0703 17:25:52.681217       1 flags.go:52] FLAG: --expander="most-pods"
I0703 17:25:52.681229       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0703 17:25:52.681234       1 flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
I0703 17:25:52.681240       1 flags.go:52] FLAG: --gpu-total="[]"
I0703 17:25:52.681246       1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0703 17:25:52.681251       1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0703 17:25:52.681261       1 flags.go:52] FLAG: --ignore-taint="[]"
I0703 17:25:52.681270       1 flags.go:52] FLAG: --kubeconfig=""
I0703 17:25:52.681275       1 flags.go:52] FLAG: --kubernetes=""
I0703 17:25:52.681279       1 flags.go:52] FLAG: --leader-elect="true"
I0703 17:25:52.681288       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0703 17:25:52.681297       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0703 17:25:52.681307       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0703 17:25:52.681314       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0703 17:25:52.681319       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0703 17:25:52.681326       1 flags.go:52] FLAG: --log-dir=""
I0703 17:25:52.681331       1 flags.go:52] FLAG: --log-file=""
I0703 17:25:52.681342       1 flags.go:52] FLAG: --log-file-max-size="1800"
I0703 17:25:52.681347       1 flags.go:52] FLAG: --logtostderr="true"
I0703 17:25:52.681352       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0703 17:25:52.681357       1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
I0703 17:25:52.681362       1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
I0703 17:25:52.681367       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0703 17:25:52.681377       1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0703 17:25:52.681386       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0703 17:25:52.681391       1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0703 17:25:52.681396       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0703 17:25:52.681401       1 flags.go:52] FLAG: --max-nodes-total="0"
I0703 17:25:52.681405       1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0703 17:25:52.681433       1 flags.go:52] FLAG: --memory-total="0:6400000"
I0703 17:25:52.681439       1 flags.go:52] FLAG: --min-replica-count="0"
I0703 17:25:52.681443       1 flags.go:52] FLAG: --namespace="kube-system"
I0703 17:25:52.681448       1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0703 17:25:52.681453       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0703 17:25:52.681457       1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
I0703 17:25:52.681466       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/K3]"
I0703 17:25:52.681479       1 flags.go:52] FLAG: --nodes="[]"
I0703 17:25:52.681484       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0703 17:25:52.681489       1 flags.go:52] FLAG: --regional="false"
I0703 17:25:52.681494       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0703 17:25:52.681608       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0703 17:25:52.681638       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0703 17:25:52.681645       1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
I0703 17:25:52.681649       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0703 17:25:52.681653       1 flags.go:52] FLAG: --scale-down-enabled="true"
I0703 17:25:52.681657       1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
I0703 17:25:52.681665       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0703 17:25:52.681669       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0703 17:25:52.681672       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0703 17:25:52.681676       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.75"
I0703 17:25:52.681679       1 flags.go:52] FLAG: --scan-interval="10s"
I0703 17:25:52.681684       1 flags.go:52] FLAG: --skip-headers="false"
I0703 17:25:52.681694       1 flags.go:52] FLAG: --skip-log-headers="false"
I0703 17:25:52.681699       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0703 17:25:52.681704       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="false"
I0703 17:25:52.681709       1 flags.go:52] FLAG: --stderrthreshold="0"
I0703 17:25:52.681713       1 flags.go:52] FLAG: --test.bench=""
I0703 17:25:52.681717       1 flags.go:52] FLAG: --test.benchmem="false"
I0703 17:25:52.681731       1 flags.go:52] FLAG: --test.benchtime="1s"
I0703 17:25:52.681735       1 flags.go:52] FLAG: --test.blockprofile=""
I0703 17:25:52.681738       1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0703 17:25:52.681742       1 flags.go:52] FLAG: --test.count="1"
I0703 17:25:52.681746       1 flags.go:52] FLAG: --test.coverprofile=""
I0703 17:25:52.681752       1 flags.go:52] FLAG: --test.cpu=""
I0703 17:25:52.681758       1 flags.go:52] FLAG: --test.cpuprofile=""
I0703 17:25:52.681762       1 flags.go:52] FLAG: --test.failfast="false"
I0703 17:25:52.681765       1 flags.go:52] FLAG: --test.list=""
I0703 17:25:52.681770       1 flags.go:52] FLAG: --test.memprofile=""
I0703 17:25:52.681776       1 flags.go:52] FLAG: --test.memprofilerate="0"
I0703 17:25:52.681786       1 flags.go:52] FLAG: --test.mutexprofile=""
I0703 17:25:52.681790       1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0703 17:25:52.681796       1 flags.go:52] FLAG: --test.outputdir=""
I0703 17:25:52.681809       1 flags.go:52] FLAG: --test.parallel="8"
I0703 17:25:52.681812       1 flags.go:52] FLAG: --test.run=""
I0703 17:25:52.681816       1 flags.go:52] FLAG: --test.short="false"
I0703 17:25:52.681822       1 flags.go:52] FLAG: --test.testlogfile=""
I0703 17:25:52.681825       1 flags.go:52] FLAG: --test.timeout="0s"
I0703 17:25:52.681829       1 flags.go:52] FLAG: --test.trace=""
I0703 17:25:52.681832       1 flags.go:52] FLAG: --test.v="false"
I0703 17:25:52.681835       1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0703 17:25:52.681839       1 flags.go:52] FLAG: --v="4"
I0703 17:25:52.681845       1 flags.go:52] FLAG: --vmodule=""
I0703 17:25:52.681849       1 flags.go:52] FLAG: --write-status-configmap="true"
I0703 17:25:52.681869       1 main.go:354] Cluster Autoscaler 1.15.0

one of my autoscaling groups cloudformations:

  mixedspotinstm52xlargeasgsubnet02af43b02922e710f:
    DependsOn:
      - K3Cluster
      - mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate
    Properties:
      MaxSize: 7
      MinSize: 0
      MixedInstancesPolicy:
        InstancesDistribution:
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: lowest-price
          SpotInstancePools: 2
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateId: !Ref 'mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate'
            Version: '1'
          Overrides:
            - InstanceType: t3.2xlarge
            - InstanceType: m5.2xlarge
      Tags:
        - Key: Name
          PropagateAtLaunch: true
          Value: K3 Cluster Node
        - Key: kubernetes.io/cluster/K3
          PropagateAtLaunch: true
          Value: owned
        - Key: k8s.io/cluster-autoscaler/enabled
          PropagateAtLaunch: true
          Value: you know it
        - Key: k8s.io/cluster-autoscaler/K3
          PropagateAtLaunch: true
          Value: you know it
      VPCZoneIdentifier:
        - subnet-02af43b02922e710f
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy: {}
  mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate:
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/sda1
            Ebs:
              VolumeSize: 200
              VolumeType: gp2
        EbsOptimized: 'true'
        IamInstanceProfile:
          Arn: !GetAtt 'k3instanceprofile.Arn'
        ImageId: ami-02d5e7ca7bc498ef9
        InstanceType: t3.2xlarge
        SecurityGroupIds:
          - !GetAtt 'k3NodeSecurityGroup.GroupId'
        UserData: !Base64
          Fn::Sub: "#!/bin/bash\n      set -o xtrace\n      echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo/nsu1N9kZ9oYC7bkuj4wz0CnmLhif+zv8Tuf7kPq/hzN4QBAgWT7XUMqG2sJlx+6SHS2zz1kQsklm9SAX+bNC25A2Iqe9J5TS8uiOdaZd2dKXQSeQsAt9TB1bqg2ZbjcJHol/qvfb1KkceMk1Kvhi7jimztbEwZaWyHRQRBUbl0AWnYUjwBo1RPaXu9mejkYSP8OOoYIjOhHAeL3pmJ+58dSLCN3kgXQuBfdb9Ap4R9YjxvVpXDmh6E1KmyeFXLq8Vm7GUePjHYPsie98oHArrvic7wsm3xAHg6IT0l0CRY53yb3gnVJZmUicEkmUj01xxpE2uI0H8kaezP89t/5'\
            \ >> /home/ec2-user/.ssh/authorized_keys\n      /etc/eks/bootstrap.sh\
            \ K3 --kubelet-extra-args '--node-labels=asg=mixedspotinstm52xlarge'\n\
            \      /opt/aws/bin/cfn-signal --exit-code $?                --stack ${AWS::StackName}\
            \                --resource mixedspotinstm52xlargeasgsubnet02af43b02922e710f\
            \                 --region ${AWS::Region}\n      "
      LaunchTemplateName: mixedspotinstm52xlargesubnet02af43b02922e710fkubenode
    Type: AWS::EC2::LaunchTemplate
@xrl
Copy link
Author

xrl commented Jul 3, 2019

$ k -n kube-system get pods | grep autoscaler
cluster-autoscaler-aws-cluster-autoscaler-6c699845b4-nnxv6   1/1     Running   16         38h

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 16, 2019

MixedInstanceGroup is kind of a feature we recently add to cluster autoscaler. So we determine not to cherry-pick back to old branches.. Another reason is CA has dependency on aws sdk from kubernetes, it's hard to update to use latest aws SDK with this feature in older k8s branches.

Let me run a test with this combination and target the issues.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 1, 2019

@xrl I do see this error on 1.15.0. The different part is my CA doesn't crash and keep work
I will do some debug to figure out root cause.

E0801 21:55:19.022150       1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for eksctl-ca-testing-nodegroup-spot-ng-NodeGroup-U5QMXV7HKKQ5, error <nil>

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 1, 2019

When cluster scales up, cache will invalid the record for that node group and it will refetch the nodegroup info. This is a log error and has been fixed here. I don't see any panic in 1.15.0. Could you try latest 1.15 CA versions and see if you still meet it?

e31575a

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 1, 2019

/sig aws

@xrl
Copy link
Author

xrl commented Aug 3, 2019

I am on 1.15.1 CA and I'm still seeing the nil pointer exception:

I0803 00:43:47.191887       1 static_autoscaler.go:187] Starting main loop
I0803 00:43:47.269999       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-ondemandr51aasgsubnet09df9044a965c5907-1T3QC9839XEQ5 K3-EKS-ondemandr51basgsubnet0d22e2495433092d1-1ICV405Z1VA7P K3-EKS-ondemandr51casgsubnet02af43b02922e710f-H5KHP9ZNKUU9 K3-EKS-spotc5xlasgsubnet02af43b02922e710f-1IM25A0SBNE9J K3-EKS-spotc5xlasgsubnet09df9044a965c5907-1QF1ZB8O03CSV K3-EKS-spotc5xlasgsubnet0d22e2495433092d1-TISDPU6HDQKN K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O K3-EKS-spotr5xlasgsubnet09df9044a965c5907-RVE356WBLUL0 K3-EKS-spotr5xlasgsubnet0d22e2495433092d1-17688IETW4IJP]
I0803 00:43:47.352781       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-08-03 00:44:47.352774196 +0000 UTC m=+121.482117782
I0803 00:43:47.354065       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-51-116.eu-central-1.compute.internal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2223f75]

goroutine 83 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc000fd46e0, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc000fd46e0, 0xc009ae6a00, 0x2c, 0x40, 0xc00eff1a40, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc0018b8140, 0xc009ae6a00, 0x2c, 0x40, 0xc00eff1a40, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0xc009a73b60, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc0018b8140, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0002b4000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff6580, 0xc0006080c0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

@k8s-ci-robot k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider and removed sig/aws labels Aug 6, 2019
@mplaul
Copy link

mplaul commented Aug 6, 2019

I see the exact same error as @xrl. In my particular case, i aggressively deleted many of my deployments, causing a massive scale down in the autoscaler. Then i encountered this at some point:


I0806 13:06:50.592524       1 static_autoscaler.go:187] Starting main loop
I0806 13:06:50.592571       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [<nodegrp1> <nodegrp2>]
I0806 13:06:50.748320       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-08-06 13:07:50.748313672 +0000 UTC m=+2970.490662812
I0806 13:06:50.748531       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-<some-ip>.ec2.internal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2223f75]

goroutine 58 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc000fab8c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc000fab8c0, 0xc00079e160, 0x4, 0x4, 0xc000aac9c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc00136c000, 0xc00079e160, 0x4, 0x4, 0xc000aac9c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0xc00079e1a0, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc00136c000, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0004dde00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff6580, 0xc000355a40)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

I am using v1.15.1
Any idea what is happening in the background? Missing permissions, ...? Any help appreciated.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 9, 2019

Thanks. I can reproduce the issue. It happens for me after a while node scaled down.
Similar as this one #2214

I am trying to fix the issue.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 10, 2019

Figure out the problem. When a node is scaled down, ASG will delete node. After a while (depends on when ASG fresh, by default we use 1min)

cloudProviderNodeInstances, err := csr.getCloudProviderNodeInstances()
if err != nil {
return err
}
notRegistered := getNotRegisteredNodes(nodes, cloudProviderNodeInstances, currentTime)

  1. Listers update all existing nodes (the deleted node has been deleted)
  2. Cluster state registry still have that node in cache. there will be one unregister node

When CA updates ReadniessStats,

nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node)
if errNg != nil {
klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg)
continue
}
perNgCopy := perNodeGroup[nodeGroup.Id()]

It will call cloudprovider.NodeGroupForNode() to get nodeGroup info. AWS manager doesn't handle it well. It should raise an Error rather than return nil, nil.

asg := aws.awsManager.GetAsgForInstance(*ref)
if asg == nil {
return nil, nil
}

I will submit a PR in AWS repo to fix that issue.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 10, 2019

All release after this PR has been impacted.

4831d76

@MaciekPytel AWS implementation definitely has some issue, Do you think we can also improve caching part. Currently, csr only invalidate cache entry for scale up. should scale down be considered?

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 10, 2019

Looks like Azure also have this issue in #2214

@xrl
Copy link
Author

xrl commented Aug 23, 2019

This was indeed fixed, I'm looking at a long uptime!

cluster-autoscaler-aws-cluster-autoscaler-5f75d89dcf-mhp7v   1/1     Running   0          20h

Thanks so much, this is awesome!

@seh
Copy link

seh commented Sep 17, 2019

See #2345 for what I think is a related consequence of fixing this problem.

@seh
Copy link

seh commented Sep 17, 2019

It will call cloudprovider.NodeGroupForNode() to get nodeGroup info. AWS manager doesn't handle it well. It should raise an Error rather than return nil, nil.

asg := aws.awsManager.GetAsgForInstance(*ref)
if asg == nil {
return nil, nil
}

It's not clear to my why that was the right fix. Looking at the call site at (*clusterstate.ClusterStateRegistry).updateReadinessStats, it already handles NodeGroupForNode returning nil for the cloudprovider.NodeGroup value, albeit with a warning printed:

nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node)
if errNg != nil {
klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg)
continue
}
if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
klog.Warningf("Nodegroup is nil for %s", unregistered.Node.Name)
continue
}

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 17, 2019

it already handles NodeGroupForNode returning nil for the cloudprovider.NodeGroup value, albeit with a warning printed:

It returns <nil, nil> in this case and clusterstate.go only check errNg, it won't print error messages for this case and go to next step to have panic issue.

@seh
Copy link

seh commented Sep 17, 2019

I'm confused. If NodeForNodeGroup returns nil and nil, then we pass line 584 above, and on line 588, we find nodeGroup is nit, log a warning, and continue on.

Where does this panic occur? We won't reach line 592 in this loop.

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

@seh Seems the code has been changed. During the time we fix this issue. It fails on the last line. See permalink I attached.

 nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node) 
 if errNg != nil { 
 	klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg) 
 	continue 
 } 
 perNgCopy := perNodeGroup[nodeGroup.Id()] 

@Jeffwan
Copy link
Contributor

Jeffwan commented Sep 18, 2019

I figure out the problem. The time I make the patch I check 1.15 code that @xrl has issue with. Permalink is on that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants