Autoscaler not respecting the taint tag in AWS #2434

dcherniv · 2019-10-09T16:22:46Z

It seems that cluster autoscaler is ignoring the taints in some cases and tries to spin up a node pool that is tainted to schedule a pod with no toleration for the taint.

Example pod spec:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  labels:
    app: dev-pay-stub-services
    pod-template-hash: 64f54bbcd
  name: dev-pay-stub-services-64f54bbcd-d5z2r
  namespace: dev
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: dev-pay-stub-services-64f54bbcd
spec:
  containers:
  - env:
REDACTED
    image: REDACTED/pay-stub-services:0.9.1-5-g3d1358f
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: http
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 30
    name: pay-stub-services
    ports:
    - containerPort: 5000
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: http
        scheme: HTTP
      initialDelaySeconds: 30
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 30
    resources:
      limits:
        cpu: "2"
        memory: 4Gi
      requests:
        cpu: 1500m
        memory: 3Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
REDACTED
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      items:
      - key: newrelic.ini
        path: newrelic.ini
      name: dev-pay-stub-services
    name: newrelic-config
  - name: default-token-nj7fb
    secret:
      defaultMode: 420
      secretName: default-token-nj7fb

autoscaler logs from the time. Strangely it rememberd that gpu-4 had a taint. But exact gpu-0 from the another ASG in a different AZ it happily spun up.


Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048387       1 utils.go:237] Pod dev-pay-stub-services-64f54bbcd-27xzb can't be scheduled on nonproduction01-spot-gpu-4, predicate failed: PodToleratesNodeTaints predicate mismatch, reason: node(s) had taints that the pod didn't tolerate
-- | --

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048459       1 utils.go:227] Pod dev-pay-stub-services-64f54bbcd-htm6g can't be scheduled on nonproduction01-spot-gpu-4. Used cached predicate check results

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048470       1 scale_up.go:411] No pod can fit to nonproduction01-spot-gpu-4

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048574       1 scale_up.go:338] Skipping node group nonproduction01-spot-gpu-6 - max size reached

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048589       1 utils.go:237] Pod dev-pay-stub-services-64f54bbcd-27xzb can't be scheduled on nonproduction01-spot-gpu-7, predicate failed: PodToleratesNodeTaints predicate mismatch, reason: node(s) had taints that the pod didn't tolerate

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048661       1 utils.go:227] Pod dev-pay-stub-services-64f54bbcd-htm6g can't be scheduled on nonproduction01-spot-gpu-7. Used cached predicate check results

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048674       1 scale_up.go:411] No pod can fit to nonproduction01-spot-gpu-7

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048679       1 scale_up.go:423] Best option to resize: nonproduction01-spot-gpu-0

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048683       1 scale_up.go:427] Estimated 1 nodes needed in nonproduction01-spot-gpu-0

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048922       1 scale_up.go:521] Splitting scale-up between 4 similar node groups: {nonproduction01-spot-gpu-0, nonproduction01-spot-gpu-1, nonproduction01-spot-gpu-2, nonproduction01-spot-gpu-3}

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048933       1 scale_up.go:529] Final scale-up plan: [{nonproduction01-spot-gpu-0 0->1 (max: 1)}]

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048945       1 scale_up.go:694] Scale-up: setting group nonproduction01-spot-gpu-0 size to 1

  | Oct 9, 2019 @ 00:30:28.048 | I1009 04:30:28.048971       1 auto_scaling_groups.go:211] Setting asg nonproduction01-spot-gpu-0 size to 1

here's the setting for the node pool in aws.

So a couple of questions. Does autoscaler only finds out about the taints when it tries to spin up an instance from the nodepool?
Judging by the logs it appears there's some caching of the taints/labels going on. How long will it cache them for?
Why did it spin up an instance from a tainted node pool?
How can i make sure that pods that don't tolerate any taints wont trigger a scaling action?

The text was updated successfully, but these errors were encountered:

Jeffwan · 2019-10-09T23:18:25Z

/assign @Jeffwan

Jeffwan · 2019-10-11T00:28:54Z

Does autoscaler only finds out about the taints when it tries to spin up an instance from the nodepool?

It uses predicates to check if pod can be scheduled on the candidate node groups template.

Judging by the logs it appears there's some caching of the taints/labels going on. How long will it cache them for?

It caches for every run. ~10s

Why did it spin up an instance from a tainted node pool?

Did you scale up from 0? Can you describe node if there's one and I'd like to know if you just tag ASG or do taint the node? ASG tag is used for scale from 0 case.

How can i make sure that pods that don't tolerate any taints wont trigger a scaling action?

If there's node group qualified, it won't trigger scaling up.

Jeffwan · 2019-10-11T00:29:12Z

/area aws

k8s-ci-robot · 2019-10-11T00:29:13Z

@Jeffwan: The label(s) area/aws cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/area aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2019-10-11T00:36:27Z

/area provider/aws

d-baranowski · 2019-11-26T10:08:13Z

I'm having the same issue on v1.13.8 :(

Jeffwan · 2019-11-26T18:29:33Z

em.. @d-baranowski Thanks for reporting. I will have a check.

d-baranowski · 2019-11-27T10:42:25Z

I built a custom autoscaler container based off v1.13.8 branch and deployed it to our cluster. I added a bunch a print statements trying to make sense of the behaviour. You can find the logs and ASG definition in the following gist https://gist.github.com/d-baranowski/0f727f1426df438f69e8c906adacc060

Oddly the autoscaler appears to be picking up the taints and labels but neither gets applied to the new nodes as I'd expect. Also when considering the ASG for scale up it marks the ASGs as unsuitable despite the fact that according to the label tag the pod node selector would match.

rapuckett · 2020-01-16T01:22:29Z

Bump. AWS autoscaler (1.15.4) is still not applying node labels & taints to new instances.

d-baranowski · 2020-01-16T09:10:36Z

@Jeffwan Did you have any luck with this?

Jeffwan · 2020-01-21T01:22:29Z

@rapuckett @d-baranowski autoscaler won't apply labels & taints to node object, it will try to fetch from ASG tags and construct node with taints in CA memory. Could I know what your ASG tags look like?

d-baranowski · 2020-01-21T16:54:30Z

        {
            "ResourceType": "auto-scaling-group",
            "ResourceId": "spl-test-asg-monitoring-az2-cluster",
            "PropagateAtLaunch": true,
            "Value": "monitoring-only",
            "Key": "k8s.io/cluster-autoscaler/node-template/label/restrict"
        },
        {
            "ResourceType": "auto-scaling-group",
            "ResourceId": "spl-test-asg-monitoring-az2-cluster",
            "PropagateAtLaunch": true,
            "Value": "true:NoSchedule",
            "Key": "k8s.io/cluster-autoscaler/node-template/taint/monitoring"
        },

karolinepauls · 2020-02-28T12:35:20Z

@Jeffwan

autoscaler won't apply labels & taints to node object, it will try to fetch from ASG tags and construct node with taints in CA memory.

So to be 100% clear, are we expecting that if an autoscaling group is tagged (e.g. "Key": "k8s.io/cluster-autoscaler/node-template/taint/spot", "Value": "true:NoSchedule", "PropagateAtLaunch": true}), cluster-autoscaler will understand that spinning a new instance in that group will result in it being tainted with the same key/value and will not do that when needing to fit a pod that doesn't tolerate the taint in question?

However, when a node from that group is spun for a different reason (e.g. when other pods with matching affinity+tolerations need to be accommodated for), in order for these nodes to receive taints, one has to add them in a different way, for example as arguments to kubelet?

In other words k8s.io/cluster-autoscaler/node-template/taint/* tags are only to tell the autoscaler how it should expect nodes to look like when spun?

If the above is correct, it should be documented - I can make a PR to the AWS FAQ document (not sure about others). If not, the real behaviour still needs to be documented but maybe someone else should do it.

seh · 2020-02-28T14:49:42Z

Yes, @karolinepauls, your understanding is correct: The cluster autoscaler does not establish or change taints on Kubernetes Node objects. The resource tags on the EC2 autoscaling groups are there to tell the autoscaler which taints you as the cluster operator intend to apply to the Nodes for the EC2 instances that these ASGs will create. How you apply those taints is up to you, but yes, setting them in the kubelet's --register-with-taints command-line flag is a common technique.

oscar60310 · 2020-03-02T03:53:57Z

@seh, @karolinepauls thank you, it's now more clear. I also think we should add some description on FAQ about the node-template tags.

BTW, @d-baranowski if you need to add labels/taints on node when it bootstrap with amazon-eks-node AMI, you can try adding args using --kubelet-extra-args.

karolinepauls · 2020-03-23T18:13:51Z

EDIT: Actually, it seems that i shouldn't have used CriticalAddonsOnly because that's a "rescheduler taint", and it's ignored by CA. I have to use my own taint.

I have 3 OnDemand ASGs and 3 Spot ASGs. OnDemand ASGs are tainted with CriticalAddonsOnly=true:NoSchedule. This should prevent the cluster-autoscaler from scaling up the ondemand instanes to allocate pods that don't tolerate the taint.

$ aws autoscaling describe-auto-scaling-groups --query '[AutoScalingGroups[*].AutoScalingGroupName]' --output text | xargs -n1 echo | grep cluster
cluster1-ondemand-eu-west-1a20200312110536898200000017
cluster1-ondemand-eu-west-1b2020031211053694470000001b
cluster1-ondemand-eu-west-1c20200312110536895300000016
cluster1-spot-eu-west-1a20200312110536924700000018
cluster1-spot-eu-west-1b20200312110536925200000019
cluster1-spot-eu-west-1c2020031211053694290000001a

$ aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names cluster1-ondemand-eu-west-1a20200312110536898200000017 cluster1-ondemand-eu-west-1b2020031211053694470000001b cluster1-ondemand-eu-west-1c20200312110536895300000016 --query 'AutoScalingGroups[*].[AutoScalingGroupName,Tags[*].[Key,Value,PropagateAtLaunch]]' --output text | grep -e cluster | column -t
cluster1-ondemand-eu-west-1a20200312110536898200000017
Name                                                                   kube-cluster1    True
k8s.io/cluster-autoscaler/cluster1                                     cluster1         False
k8s.io/cluster-autoscaler/enabled                                      true             False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  ondemand         True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi            False
k8s.io/cluster-autoscaler/node-template/taint/CriticalAddonsOnly       true:NoSchedule  True
kubernetes.io/cluster/cluster1                                         owned            True
cluster1-ondemand-eu-west-1b2020031211053694470000001b
Name                                                                   kube-cluster1    True
k8s.io/cluster-autoscaler/cluster1                                     cluster1         False
k8s.io/cluster-autoscaler/enabled                                      true             False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  ondemand         True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi            False
k8s.io/cluster-autoscaler/node-template/taint/CriticalAddonsOnly       true:NoSchedule  True
kubernetes.io/cluster/cluster1                                         owned            True
cluster1-ondemand-eu-west-1c20200312110536895300000016
Name                                                                   kube-cluster1    True
k8s.io/cluster-autoscaler/cluster1                                     cluster1         False
k8s.io/cluster-autoscaler/enabled                                      true             False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  ondemand         True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi            False
k8s.io/cluster-autoscaler/node-template/taint/CriticalAddonsOnly       true:NoSchedule  True
kubernetes.io/cluster/cluster1                                         owned            True

$ aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names cluster1-spot-eu-west-1a20200312110536924700000018 cluster1-spot-eu-west-1b20200312110536925200000019 cluster1-spot-eu-west-1c2020031211053694290000001a --query 'AutoScalingGroups[*].[AutoScalingGroupName,Tags[*].[Key,Value,PropagateAtLaunch]]' --output text | grep -e cluster | column -t
cluster1-spot-eu-west-1a20200312110536924700000018
Name                                                                   kube-cluster1  True
k8s.io/cluster-autoscaler/cluster1                                     cluster1       False
k8s.io/cluster-autoscaler/enabled                                      true           False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  spot           True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi          False
kubernetes.io/cluster/cluster1                                         owned          True
cluster1-spot-eu-west-1b20200312110536925200000019
Name                                                                   kube-cluster1  True
k8s.io/cluster-autoscaler/cluster1                                     cluster1       False
k8s.io/cluster-autoscaler/enabled                                      true           False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  spot           True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi          False
kubernetes.io/cluster/cluster1                                         owned          True
cluster1-spot-eu-west-1c2020031211053694290000001a
Name                                                                   kube-cluster1  True
k8s.io/cluster-autoscaler/cluster1                                     cluster1       False
k8s.io/cluster-autoscaler/enabled                                      true           False
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/lifecycle  spot           True
k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage    100Gi          False
kubernetes.io/cluster/cluster1                                         owned          True

However, when trying to allocate space for a new pod managed by a statefulset, with unallocated storage, cluster-autoscaler seems to be a confused about storage and ends up expanding a node group tagged with k8s.io/cluster-autoscaler/node-template/taint/CriticalAddonsOnly=true:NoSchedule, which no pod will tolerate by default.

I0323 14:24:39.755508       1 scale_up.go:263] Pod default/my-release-postgresql-0 is unschedulable
I0323 14:24:39.755557       1 scale_up.go:300] Upcoming 0 nodes
I0323 14:24:39.755586       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756172       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-ondemand-eu-west-1a20200312110536898200000017-2945094011860928531"
I0323 14:24:39.756197       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-ondemand-eu-west-1a20200312110536898200000017-2945094011860928531" ...
I0323 14:24:39.756219       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756233       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.756241       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756293       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756658       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-ondemand-eu-west-1b2020031211053694470000001b-2172180488295428279"
I0323 14:24:39.756682       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-ondemand-eu-west-1b2020031211053694470000001b-2172180488295428279" ...
I0323 14:24:39.756701       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756712       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.756718       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.756753       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757126       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-ondemand-eu-west-1c20200312110536895300000016-2920899871041967569"
I0323 14:24:39.757145       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-ondemand-eu-west-1c20200312110536895300000016-2920899871041967569" ...
I0323 14:24:39.757163       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757176       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.757189       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757228       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757565       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-spot-eu-west-1a20200312110536924700000018-3126217379181517886"
I0323 14:24:39.757587       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-spot-eu-west-1a20200312110536924700000018-3126217379181517886" ...
I0323 14:24:39.757609       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757619       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.757624       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.757662       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758025       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-spot-eu-west-1b20200312110536925200000019-4660979458788806340"
I0323 14:24:39.758047       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-spot-eu-west-1b20200312110536925200000019-4660979458788806340" ...
I0323 14:24:39.758064       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758082       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.758092       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758131       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758496       1 scheduler_binder.go:697] No matching volumes for Pod "default/my-release-postgresql-0", PVC "default/data-my-release-postgresql-0" on node "template-node-for-cluster1-spot-eu-west-1c2020031211053694290000001a-7420554225850925066"
I0323 14:24:39.758514       1 scheduler_binder.go:752] Provisioning for claims of pod "default/my-release-postgresql-0" that has no matching volumes on node "template-node-for-cluster1-spot-eu-west-1c2020031211053694290000001a-7420554225850925066" ...
I0323 14:24:39.758528       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758539       1 csi_volume_predicate.go:135] Persistent volume had no name for claim default/data-my-release-postgresql-0
I0323 14:24:39.758545       1 predicates.go:440] PVC default/data-my-release-postgresql-0 is not bound, assuming PVC matches predicate when counting limits
I0323 14:24:39.758588       1 scale_up.go:423] Best option to resize: cluster1-ondemand-eu-west-1a20200312110536898200000017
I0323 14:24:39.758603       1 scale_up.go:427] Estimated 1 nodes needed in cluster1-ondemand-eu-west-1a20200312110536898200000017
I0323 14:24:39.758833       1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {cluster1-ondemand-eu-west-1a20200312110536898200000017, cluster1-ondemand-eu-west-1b2020031211053694470000001b, cluster1-ondemand-eu-west-1c20200312110536895300000016}
I0323 14:24:39.758855       1 scale_up.go:529] Final scale-up plan: [{cluster1-ondemand-eu-west-1a20200312110536898200000017 1->2 (max: 10)}]
I0323 14:24:39.758871       1 scale_up.go:694] Scale-up: setting group cluster1-ondemand-eu-west-1a20200312110536898200000017 size to 2
I0323 14:24:39.758922       1 auto_scaling_groups.go:221] Setting asg cluster1-ondemand-eu-west-1a20200312110536898200000017 size to 2

Looks like, the volume confusion prevents cluster-autoscaler from considering taints?

For reference, the pod fails to schedule after scaleups it triggers - it later manages to do so when something else happens to cause an ondemand instance group to expand.

Events:
  Type     Reason                  Age                From                                                 Message
  ----     ------                  ----               ----                                                 -------
  Normal   TriggeredScaleUp        27m                cluster-autoscaler                                   pod triggered scale-up: [{cluster1-ondemand-eu-west-1b2020031211053694470000001b 1->2 (max: 10)}]
  Warning  FailedScheduling        27m (x4 over 27m)  default-scheduler                                    0/10 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 7 Insufficient cpu.
  Normal   TriggeredScaleUp        27m                cluster-autoscaler                                   pod triggered scale-up: [{cluster1-spot-eu-west-1b20200312110536925200000019 1->2 (max: 20)}]
  Warning  FailedScheduling        26m (x5 over 26m)  default-scheduler                                    0/11 nodes are available: 4 node(s) had taints that the pod didn't tolerate, 7 Insufficient cpu.
  Warning  FailedScheduling        25m (x5 over 25m)  default-scheduler                                    0/12 nodes are available: 5 node(s) had taints that the pod didn't tolerate, 7 Insufficient cpu.
  Normal   Scheduled               25m                default-scheduler                                    Successfully assigned default/my-release-postgresql-0 to ip-10-0-188-213.eu-west-1.compute.internal
  Normal   SuccessfulAttachVolume  25m                attachdetach-controller                              AttachVolume.Attach succeeded for volume "pvc-c6df6fd5-6d33-11ea-8fc7-0218e6b2b19a"
  Normal   Pulling                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Pulling image "docker.io/bitnami/minideb:latest"
  Normal   Pulled                  25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Successfully pulled image "docker.io/bitnami/minideb:latest"
  Normal   Created                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Created container init-chmod-data
  Normal   Started                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Started container init-chmod-data
  Normal   Pulling                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Pulling image "docker.io/bitnami/postgresql:11.3.0-debian-9-r38"
  Normal   Pulled                  25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Successfully pulled image "docker.io/bitnami/postgresql:11.3.0-debian-9-r38"
  Normal   Created                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Created container my-release-postgresql
  Normal   Started                 25m                kubelet, ip-10-0-188-213.eu-west-1.compute.internal  Started container my-release-postgresql

Cluster Autoscaler 1.14.6
EKS 1.14

fejta-bot · 2020-06-21T22:46:11Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-07-21T23:27:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

ghost · 2020-08-07T15:14:06Z

/remove-lifecycle rotten

fejta-bot · 2020-11-05T16:02:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-05T16:47:32Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

lawliet89 · 2020-12-16T03:24:30Z

Landed here after observing the same issue. According to eksctl docs, you need to add the Effect to the tag value too.

        {
            "ResourceType": "auto-scaling-group",
            "ResourceId": "spl-test-asg-monitoring-az2-cluster",
            "PropagateAtLaunch": true,
            "Value": "monitoring-only:NoSchedule",
            "Key": "k8s.io/cluster-autoscaler/node-template/label/restrict"
        },

fejta-bot · 2021-03-16T04:06:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-04-15T04:11:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-05-15T04:38:13Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-05-15T04:38:22Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awoimbee · 2023-03-10T15:55:44Z

I know it's a rather old issue, but why is this detail important:

Did you scale up from 0?

If I tell CA that my node group is tainted and that new nodes will be tainted, who cares if I'm scaling from 0, 1 or 20 ?

This is preventing me from implementing partial/graceful updates to my cluster since autoscaler keeps scaling up the outdated (and tainted) node group.

sleterrier · 2023-03-12T00:38:47Z

@awoimbee, from cluster-autoscaler documentation:

When scaling up from 0 nodes, the Cluster Autoscaler reads ASG tags to derive information about the specifications of the nodes i.e labels and taints in that ASG. Note that it does not actually apply these labels or taints - this is done by an AWS generated user data script. It gives the Cluster Autoscaler information about whether pending pods will be able to be scheduled should a new node be spun up for a particular ASG with the asumption the ASG tags accurately reflect the labels/taint actually applied.

tomthetommy · 2023-03-15T10:40:04Z

Same issue. I'm on CA 1.24 / EKS 1.24. My understanding was that the autoscaler now can describenodegroups in order to view tags / taints. It seems to be viewing these tags (as I can scale from 0 without the template tags on the ASG's), but not the taints.

It'll spin up a node, and then complain the taint isn't tolerated by the pod.

abstrask · 2024-01-30T16:16:54Z

Same issue. I'm on CA 1.24 / EKS 1.24. My understanding was that the autoscaler now can describenodegroups in order to view tags / taints. It seems to be viewing these tags (as I can scale from 0 without the template tags on the ASG's), but not the taints.

It'll spin up a node, and then complain the taint isn't tolerated by the pod.

While that's true, the taints are not being read correctly (see #6481).

My colleague, @wcarlsen, and I believe we have a fix for this (PR #6482).

k8s-ci-robot assigned Jeffwan Oct 9, 2019

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Oct 11, 2019

hasakura12 mentioned this issue May 6, 2020

Cluster Autoscaler not scaling up with "not ready for scaleup - backoff" & "backoff after failed scale-up" #3115

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 7, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 5, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2020

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 5, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 16, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2021

afirth mentioned this issue Apr 29, 2021

Add example to AWS readme if taint has value #4050

Merged

k8s-ci-robot closed this as completed May 15, 2021

anandnilkal mentioned this issue May 17, 2021

Cluster Autoscaler scaling from 0 on AWS EKS stops working after CA pod is restarted #3780

Closed

fullykubed mentioned this issue May 23, 2021

[cluster-autoscaler] CriticalAddonsOnly taint ignored #4097

Closed

abstrask mentioned this issue Jan 30, 2024

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler not respecting the taint tag in AWS #2434

Autoscaler not respecting the taint tag in AWS #2434

dcherniv commented Oct 9, 2019 •

edited

Loading

Jeffwan commented Oct 9, 2019

Jeffwan commented Oct 11, 2019 •

edited

Loading

Jeffwan commented Oct 11, 2019

k8s-ci-robot commented Oct 11, 2019

Jeffwan commented Oct 11, 2019

d-baranowski commented Nov 26, 2019

Jeffwan commented Nov 26, 2019

d-baranowski commented Nov 27, 2019 •

edited

Loading

rapuckett commented Jan 16, 2020

d-baranowski commented Jan 16, 2020

Jeffwan commented Jan 21, 2020

d-baranowski commented Jan 21, 2020 •

edited

Loading

karolinepauls commented Feb 28, 2020 •

edited

Loading

seh commented Feb 28, 2020

oscar60310 commented Mar 2, 2020

karolinepauls commented Mar 23, 2020 •

edited

Loading

fejta-bot commented Jun 21, 2020

fejta-bot commented Jul 21, 2020

ghost commented Aug 7, 2020

fejta-bot commented Nov 5, 2020

fejta-bot commented Dec 5, 2020

lawliet89 commented Dec 16, 2020 •

edited

Loading

fejta-bot commented Mar 16, 2021

fejta-bot commented Apr 15, 2021

fejta-bot commented May 15, 2021

k8s-ci-robot commented May 15, 2021

awoimbee commented Mar 10, 2023

sleterrier commented Mar 12, 2023 •

edited

Loading

tomthetommy commented Mar 15, 2023

abstrask commented Jan 30, 2024

Autoscaler not respecting the taint tag in AWS #2434

Autoscaler not respecting the taint tag in AWS #2434

Comments

dcherniv commented Oct 9, 2019 • edited Loading

Jeffwan commented Oct 9, 2019

Jeffwan commented Oct 11, 2019 • edited Loading

Jeffwan commented Oct 11, 2019

k8s-ci-robot commented Oct 11, 2019

Jeffwan commented Oct 11, 2019

d-baranowski commented Nov 26, 2019

Jeffwan commented Nov 26, 2019

d-baranowski commented Nov 27, 2019 • edited Loading

rapuckett commented Jan 16, 2020

d-baranowski commented Jan 16, 2020

Jeffwan commented Jan 21, 2020

d-baranowski commented Jan 21, 2020 • edited Loading

karolinepauls commented Feb 28, 2020 • edited Loading

seh commented Feb 28, 2020

oscar60310 commented Mar 2, 2020

karolinepauls commented Mar 23, 2020 • edited Loading

fejta-bot commented Jun 21, 2020

fejta-bot commented Jul 21, 2020

ghost commented Aug 7, 2020

fejta-bot commented Nov 5, 2020

fejta-bot commented Dec 5, 2020

lawliet89 commented Dec 16, 2020 • edited Loading

fejta-bot commented Mar 16, 2021

fejta-bot commented Apr 15, 2021

fejta-bot commented May 15, 2021

k8s-ci-robot commented May 15, 2021

awoimbee commented Mar 10, 2023

sleterrier commented Mar 12, 2023 • edited Loading

tomthetommy commented Mar 15, 2023

abstrask commented Jan 30, 2024

dcherniv commented Oct 9, 2019 •

edited

Loading

Jeffwan commented Oct 11, 2019 •

edited

Loading

d-baranowski commented Nov 27, 2019 •

edited

Loading

d-baranowski commented Jan 21, 2020 •

edited

Loading

karolinepauls commented Feb 28, 2020 •

edited

Loading

karolinepauls commented Mar 23, 2020 •

edited

Loading

lawliet89 commented Dec 16, 2020 •

edited

Loading

sleterrier commented Mar 12, 2023 •

edited

Loading