nil pointer dereference while fetching node instances for aws asg #2161

xrl · 2019-07-03T18:25:39Z

I am seeing this error:

I0703 17:25:41.236830       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size to 3
I0703 17:25:41.281175       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-misc-9c55878cc-vj7nr", UID:"91b85597-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074967", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281214       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size set to 3
I0703 17:25:41.281226       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"proxy-6678dc85f6-q2cdc", UID:"8fa7bd40-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074922", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281272       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-replication-dc7596ccf-6294t", UID:"930a91d7-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074999", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281303       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-salesforce-6fd96cfddd-zqlhj", UID:"9420fd2c-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5075027", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281380       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"hub-5fb7c84ccb-85ns8", UID:"8ea23011-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074887", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:43.503908       1 node_instances_cache.go:155] Start refreshing cloud provider node instances cache
I0703 17:25:43.503948       1 node_instances_cache.go:167] Refresh cloud provider node instances cache finished, refresh took 16.505µs
I0703 17:25:51.293772       1 static_autoscaler.go:187] Starting main loop
I0703 17:25:51.336472       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-mixedspotinstm52xlargeasgsubnet02af43b02922e710f-QR5SWRK0N1Q0 K3-EKS-mixedspotinstm52xlargeasgsubnet09df9044a965c5907-78SRQM8MXOR6 K3-EKS-mixedspotinstm52xlargeasgsubnet0d22e2495433092d1-7ZH6F3I2Y90L K3-EKS-mixedspotinstt32xlargeasgsubnet02af43b02922e710f-10SJB4NDVE9VX K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O K3-EKS-mixedspotinstt32xlargeasgsubnet0d22e2495433092d1-8EG7GH4K6BDQ K3-EKS-ondemandasgsubnet02af43b02922e710f-YFGY7WRUEUQ1]
I0703 17:25:51.394755       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-07-03 17:26:51.394747295 +0000 UTC m=+788.208984766
E0703 17:25:51.455558       1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O, error <nil>
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2222355]

goroutine 65 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc001796160, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc001796160, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000c28820, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0xc000e36cc0, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000c28820, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0000da000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff4fc0, 0xc0004be180)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

I have been aggressively scaling up my workloads in this new kube cluster. I am using the autoscaler helm chart with these values:

autoDiscovery:
  clusterName: K3
image:
  repository: k8s.gcr.io/cluster-autoscaler
  tag: v1.15.0
rbac:
  create: true
# must override: https://github.com/kubernetes/autoscaler/issues/2139
sslCertHostPath: /etc/ssl/certs/ca-bundle.crt
podAnnotations:
  iam.amazonaws.com/role: k8s-autoscaler
extraArgs:
  v: 4
  stderrthreshold: info
  logtostderr: true
  # write-status-configmap: true
  # leader-elect: true
  skip-nodes-with-local-storage: "false"
  expander: most-pods
  # scale-down-enabled: true
  # balance-similar-node-groups: true
  # min-replica-count: 2
  scale-down-utilization-threshold: 0.75
  # scale-down-non-empty-candidates-count: 5
  # max-node-provision-time: 15m0s
  # scan-interval: 10s
  # scale-down-delay: 10m
  # scale-down-unneeded-time: 10m
  skip-nodes-with-local-storage: "false"
  skip-nodes-with-system-pods: "false"
awsRegion: eu-central-1

and I am running autoscaler 1.15.0 (I know that's mixing version from EKS 1.13.7 but it's the only way to work with MixedInstanceGroup ASGs.

I0703 17:25:52.680391       1 flags.go:52] FLAG: --address=":8085"
I0703 17:25:52.681144       1 flags.go:52] FLAG: --alsologtostderr="false"
I0703 17:25:52.681163       1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0703 17:25:52.681169       1 flags.go:52] FLAG: --cloud-config=""
I0703 17:25:52.681174       1 flags.go:52] FLAG: --cloud-provider="aws"
I0703 17:25:52.681187       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0703 17:25:52.681197       1 flags.go:52] FLAG: --cluster-name=""
I0703 17:25:52.681202       1 flags.go:52] FLAG: --cores-total="0:320000"
I0703 17:25:52.681211       1 flags.go:52] FLAG: --estimator="binpacking"
I0703 17:25:52.681217       1 flags.go:52] FLAG: --expander="most-pods"
I0703 17:25:52.681229       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0703 17:25:52.681234       1 flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
I0703 17:25:52.681240       1 flags.go:52] FLAG: --gpu-total="[]"
I0703 17:25:52.681246       1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0703 17:25:52.681251       1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0703 17:25:52.681261       1 flags.go:52] FLAG: --ignore-taint="[]"
I0703 17:25:52.681270       1 flags.go:52] FLAG: --kubeconfig=""
I0703 17:25:52.681275       1 flags.go:52] FLAG: --kubernetes=""
I0703 17:25:52.681279       1 flags.go:52] FLAG: --leader-elect="true"
I0703 17:25:52.681288       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0703 17:25:52.681297       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0703 17:25:52.681307       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0703 17:25:52.681314       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0703 17:25:52.681319       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0703 17:25:52.681326       1 flags.go:52] FLAG: --log-dir=""
I0703 17:25:52.681331       1 flags.go:52] FLAG: --log-file=""
I0703 17:25:52.681342       1 flags.go:52] FLAG: --log-file-max-size="1800"
I0703 17:25:52.681347       1 flags.go:52] FLAG: --logtostderr="true"
I0703 17:25:52.681352       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0703 17:25:52.681357       1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
I0703 17:25:52.681362       1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
I0703 17:25:52.681367       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0703 17:25:52.681377       1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0703 17:25:52.681386       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0703 17:25:52.681391       1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0703 17:25:52.681396       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0703 17:25:52.681401       1 flags.go:52] FLAG: --max-nodes-total="0"
I0703 17:25:52.681405       1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0703 17:25:52.681433       1 flags.go:52] FLAG: --memory-total="0:6400000"
I0703 17:25:52.681439       1 flags.go:52] FLAG: --min-replica-count="0"
I0703 17:25:52.681443       1 flags.go:52] FLAG: --namespace="kube-system"
I0703 17:25:52.681448       1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0703 17:25:52.681453       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0703 17:25:52.681457       1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
I0703 17:25:52.681466       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/K3]"
I0703 17:25:52.681479       1 flags.go:52] FLAG: --nodes="[]"
I0703 17:25:52.681484       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0703 17:25:52.681489       1 flags.go:52] FLAG: --regional="false"
I0703 17:25:52.681494       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0703 17:25:52.681608       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0703 17:25:52.681638       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0703 17:25:52.681645       1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
I0703 17:25:52.681649       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0703 17:25:52.681653       1 flags.go:52] FLAG: --scale-down-enabled="true"
I0703 17:25:52.681657       1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
I0703 17:25:52.681665       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0703 17:25:52.681669       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0703 17:25:52.681672       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0703 17:25:52.681676       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.75"
I0703 17:25:52.681679       1 flags.go:52] FLAG: --scan-interval="10s"
I0703 17:25:52.681684       1 flags.go:52] FLAG: --skip-headers="false"
I0703 17:25:52.681694       1 flags.go:52] FLAG: --skip-log-headers="false"
I0703 17:25:52.681699       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0703 17:25:52.681704       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="false"
I0703 17:25:52.681709       1 flags.go:52] FLAG: --stderrthreshold="0"
I0703 17:25:52.681713       1 flags.go:52] FLAG: --test.bench=""
I0703 17:25:52.681717       1 flags.go:52] FLAG: --test.benchmem="false"
I0703 17:25:52.681731       1 flags.go:52] FLAG: --test.benchtime="1s"
I0703 17:25:52.681735       1 flags.go:52] FLAG: --test.blockprofile=""
I0703 17:25:52.681738       1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0703 17:25:52.681742       1 flags.go:52] FLAG: --test.count="1"
I0703 17:25:52.681746       1 flags.go:52] FLAG: --test.coverprofile=""
I0703 17:25:52.681752       1 flags.go:52] FLAG: --test.cpu=""
I0703 17:25:52.681758       1 flags.go:52] FLAG: --test.cpuprofile=""
I0703 17:25:52.681762       1 flags.go:52] FLAG: --test.failfast="false"
I0703 17:25:52.681765       1 flags.go:52] FLAG: --test.list=""
I0703 17:25:52.681770       1 flags.go:52] FLAG: --test.memprofile=""
I0703 17:25:52.681776       1 flags.go:52] FLAG: --test.memprofilerate="0"
I0703 17:25:52.681786       1 flags.go:52] FLAG: --test.mutexprofile=""
I0703 17:25:52.681790       1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0703 17:25:52.681796       1 flags.go:52] FLAG: --test.outputdir=""
I0703 17:25:52.681809       1 flags.go:52] FLAG: --test.parallel="8"
I0703 17:25:52.681812       1 flags.go:52] FLAG: --test.run=""
I0703 17:25:52.681816       1 flags.go:52] FLAG: --test.short="false"
I0703 17:25:52.681822       1 flags.go:52] FLAG: --test.testlogfile=""
I0703 17:25:52.681825       1 flags.go:52] FLAG: --test.timeout="0s"
I0703 17:25:52.681829       1 flags.go:52] FLAG: --test.trace=""
I0703 17:25:52.681832       1 flags.go:52] FLAG: --test.v="false"
I0703 17:25:52.681835       1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0703 17:25:52.681839       1 flags.go:52] FLAG: --v="4"
I0703 17:25:52.681845       1 flags.go:52] FLAG: --vmodule=""
I0703 17:25:52.681849       1 flags.go:52] FLAG: --write-status-configmap="true"
I0703 17:25:52.681869       1 main.go:354] Cluster Autoscaler 1.15.0

one of my autoscaling groups cloudformations:

  mixedspotinstm52xlargeasgsubnet02af43b02922e710f:
    DependsOn:
      - K3Cluster
      - mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate
    Properties:
      MaxSize: 7
      MinSize: 0
      MixedInstancesPolicy:
        InstancesDistribution:
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: lowest-price
          SpotInstancePools: 2
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateId: !Ref 'mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate'
            Version: '1'
          Overrides:
            - InstanceType: t3.2xlarge
            - InstanceType: m5.2xlarge
      Tags:
        - Key: Name
          PropagateAtLaunch: true
          Value: K3 Cluster Node
        - Key: kubernetes.io/cluster/K3
          PropagateAtLaunch: true
          Value: owned
        - Key: k8s.io/cluster-autoscaler/enabled
          PropagateAtLaunch: true
          Value: you know it
        - Key: k8s.io/cluster-autoscaler/K3
          PropagateAtLaunch: true
          Value: you know it
      VPCZoneIdentifier:
        - subnet-02af43b02922e710f
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy: {}
  mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate:
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/sda1
            Ebs:
              VolumeSize: 200
              VolumeType: gp2
        EbsOptimized: 'true'
        IamInstanceProfile:
          Arn: !GetAtt 'k3instanceprofile.Arn'
        ImageId: ami-02d5e7ca7bc498ef9
        InstanceType: t3.2xlarge
        SecurityGroupIds:
          - !GetAtt 'k3NodeSecurityGroup.GroupId'
        UserData: !Base64
          Fn::Sub: "#!/bin/bash\n      set -o xtrace\n      echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo/nsu1N9kZ9oYC7bkuj4wz0CnmLhif+zv8Tuf7kPq/hzN4QBAgWT7XUMqG2sJlx+6SHS2zz1kQsklm9SAX+bNC25A2Iqe9J5TS8uiOdaZd2dKXQSeQsAt9TB1bqg2ZbjcJHol/qvfb1KkceMk1Kvhi7jimztbEwZaWyHRQRBUbl0AWnYUjwBo1RPaXu9mejkYSP8OOoYIjOhHAeL3pmJ+58dSLCN3kgXQuBfdb9Ap4R9YjxvVpXDmh6E1KmyeFXLq8Vm7GUePjHYPsie98oHArrvic7wsm3xAHg6IT0l0CRY53yb3gnVJZmUicEkmUj01xxpE2uI0H8kaezP89t/5'\
            \ >> /home/ec2-user/.ssh/authorized_keys\n      /etc/eks/bootstrap.sh\
            \ K3 --kubelet-extra-args '--node-labels=asg=mixedspotinstm52xlarge'\n\
            \      /opt/aws/bin/cfn-signal --exit-code $?                --stack ${AWS::StackName}\
            \                --resource mixedspotinstm52xlargeasgsubnet02af43b02922e710f\
            \                 --region ${AWS::Region}\n      "
      LaunchTemplateName: mixedspotinstm52xlargesubnet02af43b02922e710fkubenode
    Type: AWS::EC2::LaunchTemplate

The text was updated successfully, but these errors were encountered:

xrl · 2019-07-03T18:29:29Z

$ k -n kube-system get pods | grep autoscaler
cluster-autoscaler-aws-cluster-autoscaler-6c699845b4-nnxv6   1/1     Running   16         38h

Jeffwan · 2019-07-16T23:05:26Z

MixedInstanceGroup is kind of a feature we recently add to cluster autoscaler. So we determine not to cherry-pick back to old branches.. Another reason is CA has dependency on aws sdk from kubernetes, it's hard to update to use latest aws SDK with this feature in older k8s branches.

Let me run a test with this combination and target the issues.

Jeffwan · 2019-08-01T21:59:57Z

@xrl I do see this error on 1.15.0. The different part is my CA doesn't crash and keep work
I will do some debug to figure out root cause.

E0801 21:55:19.022150       1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for eksctl-ca-testing-nodegroup-spot-ng-NodeGroup-U5QMXV7HKKQ5, error <nil>

Jeffwan · 2019-08-01T23:45:42Z

When cluster scales up, cache will invalid the record for that node group and it will refetch the nodegroup info. This is a log error and has been fixed here. I don't see any panic in 1.15.0. Could you try latest 1.15 CA versions and see if you still meet it?

e31575a

Jeffwan · 2019-08-01T23:46:25Z

/sig aws

xrl · 2019-08-03T00:44:33Z

I am on 1.15.1 CA and I'm still seeing the nil pointer exception:

I0803 00:43:47.191887       1 static_autoscaler.go:187] Starting main loop
I0803 00:43:47.269999       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-ondemandr51aasgsubnet09df9044a965c5907-1T3QC9839XEQ5 K3-EKS-ondemandr51basgsubnet0d22e2495433092d1-1ICV405Z1VA7P K3-EKS-ondemandr51casgsubnet02af43b02922e710f-H5KHP9ZNKUU9 K3-EKS-spotc5xlasgsubnet02af43b02922e710f-1IM25A0SBNE9J K3-EKS-spotc5xlasgsubnet09df9044a965c5907-1QF1ZB8O03CSV K3-EKS-spotc5xlasgsubnet0d22e2495433092d1-TISDPU6HDQKN K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O K3-EKS-spotr5xlasgsubnet09df9044a965c5907-RVE356WBLUL0 K3-EKS-spotr5xlasgsubnet0d22e2495433092d1-17688IETW4IJP]
I0803 00:43:47.352781       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-08-03 00:44:47.352774196 +0000 UTC m=+121.482117782
I0803 00:43:47.354065       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-51-116.eu-central-1.compute.internal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2223f75]

goroutine 83 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc000fd46e0, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc000fd46e0, 0xc009ae6a00, 0x2c, 0x40, 0xc00eff1a40, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc0018b8140, 0xc009ae6a00, 0x2c, 0x40, 0xc00eff1a40, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0xc009a73b60, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc0018b8140, 0xbf495410cb6f9530, 0xe4707579a, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0002b4000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff6580, 0xc0006080c0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

mplaul · 2019-08-06T13:13:14Z

I see the exact same error as @xrl. In my particular case, i aggressively deleted many of my deployments, causing a massive scale down in the autoscaler. Then i encountered this at some point:


I0806 13:06:50.592524       1 static_autoscaler.go:187] Starting main loop
I0806 13:06:50.592571       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [<nodegrp1> <nodegrp2>]
I0806 13:06:50.748320       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-08-06 13:07:50.748313672 +0000 UTC m=+2970.490662812
I0806 13:06:50.748531       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-<some-ip>.ec2.internal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2223f75]

goroutine 58 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc000fab8c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc000fab8c0, 0xc00079e160, 0x4, 0x4, 0xc000aac9c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc00136c000, 0xc00079e160, 0x4, 0x4, 0xc000aac9c0, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0xc00079e1a0, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc00136c000, 0xbf4a7cbaa350dd49, 0x2a59d799885, 0x4cb5be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0004dde00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff6580, 0xc000355a40)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

I am using v1.15.1
Any idea what is happening in the background? Missing permissions, ...? Any help appreciated.

Jeffwan · 2019-08-09T22:55:05Z

Thanks. I can reproduce the issue. It happens for me after a while node scaled down.
Similar as this one #2214

I am trying to fix the issue.

Jeffwan · 2019-08-10T00:22:03Z

Figure out the problem. When a node is scaled down, ASG will delete node. After a while (depends on when ASG fresh, by default we use 1min)

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 295 to 299 in c88f014

    
           cloudProviderNodeInstances, err := csr.getCloudProviderNodeInstances() 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           notRegistered := getNotRegisteredNodes(nodes, cloudProviderNodeInstances, currentTime)

Listers update all existing nodes (the deleted node has been deleted)
Cluster state registry still have that node in cache. there will be one unregister node

When CA updates ReadniessStats,

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 571 to 576 in c88f014

    
           nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node) 
        
           if errNg != nil { 
        
           	klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg) 
        
           	continue 
        
           } 
        
           perNgCopy := perNodeGroup[nodeGroup.Id()]

It will call cloudprovider.NodeGroupForNode() to get nodeGroup info. AWS manager doesn't handle it well. It should raise an Error rather than return nil, nil.

autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go

Lines 111 to 115 in c88f014

    
           asg := aws.awsManager.GetAsgForInstance(*ref) 
        
           if asg == nil { 
        
           	return nil, nil 
        
           }

I will submit a PR in AWS repo to fix that issue.

Jeffwan · 2019-08-10T00:26:16Z

All release after this PR has been impacted.

4831d76

@MaciekPytel AWS implementation definitely has some issue, Do you think we can also improve caching part. Currently, csr only invalidate cache entry for scale up. should scale down be considered?

Jeffwan · 2019-08-10T00:26:59Z

Looks like Azure also have this issue in #2214

xrl · 2019-08-23T16:22:54Z

This was indeed fixed, I'm looking at a long uptime!

cluster-autoscaler-aws-cluster-autoscaler-5f75d89dcf-mhp7v   1/1     Running   0          20h

Thanks so much, this is awesome!

seh · 2019-09-17T17:13:41Z

See #2345 for what I think is a related consequence of fixing this problem.

seh · 2019-09-17T17:24:52Z

It will call cloudprovider.NodeGroupForNode() to get nodeGroup info. AWS manager doesn't handle it well. It should raise an Error rather than return nil, nil.

autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go

Lines 111 to 115 in c88f014

asg := aws.awsManager.GetAsgForInstance(*ref)

if asg == nil {

return nil, nil

}

It's not clear to my why that was the right fix. Looking at the call site at (*clusterstate.ClusterStateRegistry).updateReadinessStats, it already handles NodeGroupForNode returning nil for the cloudprovider.NodeGroup value, albeit with a warning printed:

autoscaler/cluster-autoscaler/clusterstate/clusterstate.go

Lines 583 to 591 in bc73ded

    
           nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node) 
        
           if errNg != nil { 
        
           	klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg) 
        
           	continue 
        
           } 
        
           if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() { 
        
           	klog.Warningf("Nodegroup is nil for %s", unregistered.Node.Name) 
        
           	continue 
        
           }

Jeffwan · 2019-09-17T18:47:08Z

it already handles NodeGroupForNode returning nil for the cloudprovider.NodeGroup value, albeit with a warning printed:

It returns <nil, nil> in this case and clusterstate.go only check errNg, it won't print error messages for this case and go to next step to have panic issue.

seh · 2019-09-17T19:18:37Z

I'm confused. If NodeForNodeGroup returns nil and nil, then we pass line 584 above, and on line 588, we find nodeGroup is nit, log a warning, and continue on.

Where does this panic occur? We won't reach line 592 in this loop.

Jeffwan · 2019-09-18T04:55:26Z

@seh Seems the code has been changed. During the time we fix this issue. It fails on the last line. See permalink I attached.

 nodeGroup, errNg := csr.cloudProvider.NodeGroupForNode(unregistered.Node) 
 if errNg != nil { 
 	klog.Warningf("Failed to get nodegroup for %s: %v", unregistered.Node.Name, errNg) 
 	continue 
 } 
 perNgCopy := perNodeGroup[nodeGroup.Id()]

Jeffwan · 2019-09-18T05:12:15Z

I figure out the problem. The time I make the patch I check 1.15 code that @xrl has issue with. Permalink is on that branch.

k8s-ci-robot added the sig/aws label Aug 1, 2019

k8s-ci-robot added area/provider/aws Issues or PRs related to aws provider and removed sig/aws labels Aug 6, 2019

Jeffwan mentioned this issue Aug 9, 2019

Cluster Autoscaling Not Working kubeflow/kubeflow#3844

Closed

Jeffwan mentioned this issue Aug 10, 2019

Return error if aws fail to find nodegroup for node #2247

Merged

k8s-ci-robot closed this as completed in #2247 Aug 19, 2019

seh mentioned this issue Sep 17, 2019

Cluster autoscaler version 1.16.0 doesn't notice pending pods #2345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nil pointer dereference while fetching node instances for aws asg #2161

nil pointer dereference while fetching node instances for aws asg #2161

xrl commented Jul 3, 2019

xrl commented Jul 3, 2019

Jeffwan commented Jul 16, 2019 •

edited

Loading

Jeffwan commented Aug 1, 2019

Jeffwan commented Aug 1, 2019

Jeffwan commented Aug 1, 2019

xrl commented Aug 3, 2019

mplaul commented Aug 6, 2019 •

edited

Loading

Jeffwan commented Aug 9, 2019

Jeffwan commented Aug 10, 2019

Jeffwan commented Aug 10, 2019

Jeffwan commented Aug 10, 2019

xrl commented Aug 23, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 18, 2019 •

edited

Loading

Jeffwan commented Sep 18, 2019

nil pointer dereference while fetching node instances for aws asg #2161

nil pointer dereference while fetching node instances for aws asg #2161

Comments

xrl commented Jul 3, 2019

xrl commented Jul 3, 2019

Jeffwan commented Jul 16, 2019 • edited Loading

Jeffwan commented Aug 1, 2019

Jeffwan commented Aug 1, 2019

Jeffwan commented Aug 1, 2019

xrl commented Aug 3, 2019

mplaul commented Aug 6, 2019 • edited Loading

Jeffwan commented Aug 9, 2019

Jeffwan commented Aug 10, 2019

Jeffwan commented Aug 10, 2019

Jeffwan commented Aug 10, 2019

xrl commented Aug 23, 2019

seh commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 17, 2019

seh commented Sep 17, 2019

Jeffwan commented Sep 18, 2019 • edited Loading

Jeffwan commented Sep 18, 2019

Jeffwan commented Jul 16, 2019 •

edited

Loading

mplaul commented Aug 6, 2019 •

edited

Loading

Jeffwan commented Sep 18, 2019 •

edited

Loading