Scale up windows on AWS EKS cluster #3133

chmielas · 2020-05-14T13:02:32Z

Hi,
I`m using Kubernetes based on EKS 1.15 with windows node group, vpc controller and webhook
and cluster autoscaler
cluster-autoscaler cluster-autoscaler v1.15.6

The problem that I have is similar to #2888
When ASG need to be scaled from 0 to 2 instances after couple days of inactivity autoscaler don`t trigger scale up.

The workaround is to set the minimum size of ASG to 1. In such case, autoscaler don`t have any problem with scale up and scale down.
After update to v1.15.6 problem still occurs

Here is pod output

Name:                 job-038erq28k
Namespace:            default
Priority:             10000
Priority Class Name:  low-priority
Node:                 <none>
Labels:               app=my-eks-job
                      platform=WINDOWS
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        Job/job-038e11d2
Init Containers:
  init-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     250m
      memory:  300Mi
    Requests:
      cpu:     250m
      memory:  300Mi
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Containers:
  main-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Requests:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  mytoken:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mytoken
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=windows
Tolerations:     dedicated=WINDOWS:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   41m (x14 over 60m)      default-scheduler   0/31 nodes are available: 21 Insufficient memory, 31 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Warning  FailedScheduling   31m (x19 over 65m)      default-scheduler   0/31 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  16m (x1672 over 13h)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu, 2 max limit reached
  Normal   NotTriggerScaleUp  6m31s (x1720 over 13h)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
  Normal   NotTriggerScaleUp  89s (x301 over 13h)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 max limit reached, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu
  Warning  FailedScheduling   60s (x22 over 62m)      default-scheduler   0/30 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 30 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 30 node(s) didn't match node selector.

and some logs from autoscaller

I0513 06:49:01.806775       1 utils.go:229] Pod job-038erq28k can't be scheduled on linux-node-asg-20191203023958042900000018, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient cpu, Insufficient vpc.amazonaws.com/PrivateIPv4Address, 
I0513 06:49:01.806973       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"job-038erq28k", UID:"<removed>", APIVersion:"v1", ResourceVersion:"73451570", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached

The text was updated successfully, but these errors were encountered:

jjhidalgar · 2020-06-16T17:54:20Z

Can confirm, on eks 1.16
If I manually update desired instances to 1, then it works. Even after auto-downscaling to 0, it can scale up again.

But if you never had any instance it doesn't work.

Haven't tried waiting a few days after downscaling to 0, it may stop working again.

iusergii · 2020-07-29T06:42:10Z

Have the same issue with 1.17.

Jeffwan · 2020-08-05T00:48:49Z

Did you put labels to ASG tags?

Jeffwan · 2020-08-05T00:49:10Z

This should be resolved in last release. #2888

Jeffwan · 2020-08-05T00:49:16Z

/assign @Jeffwan

iusergii · 2020-08-05T06:21:40Z

@Jeffwan yes, it scales up if you already have at least one node up.
external-dns image: 0.7.2-debian-10-r46
EKS: 1.17.6

Jeffwan · 2020-08-05T20:31:06Z

@iusergii

Scale from 0 should be working as well. Could you share your ASG tags?

iusergii · 2020-08-10T07:19:04Z

@Jeffwan Here my tags:

Name: eks-windows-node-1a-Node
alpha.eksctl.io/cluster-name: eks	
alpha.eksctl.io/eksctl-version: 0.24.0
alpha.eksctl.io/nodegroup-name: windows-node-1a	
alpha.eksctl.io/nodegroup-type: unmanaged
eksctl.cluster.k8s.io/v1alpha1/cluster-name: eks	
eksctl.io/v1alpha2/nodegroup-name: windows-node-1a	
k8s.io/cluster-autoscaler/eks: owned	
k8s.io/cluster-autoscaler/enabled: true	
k8s.io/cluster-autoscaler/node-template/label/windows-node: 1a	
k8s.io/cluster-autoscaler/node-template/taint/windows: true:NoSchedule	
kubernetes.io/cluster/eks: owned

As you can see I also have a Taints on these nodes.

Jeffwan · 2020-08-10T15:25:57Z

@iusergii

Can you add these tags? CA won't know your node has ENI and IP addresses. Please check #2888 (comment) for more details

k8s.io/cluster-autoscaler/node-template/label/beta.kubernetes.io/os windows
k8s.io/cluster-autoscaler/node-template/label/os windows
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 14

iusergii · 2020-08-12T07:32:21Z

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

Jeffwan · 2020-08-13T17:46:27Z

@iusergii

Did you restart your CA or wait for a while after you apply the tag changes?

iusergii · 2020-08-20T16:11:26Z

@Jeffwan yes, I did:

created IG
restarted CA
Redeployed application.
Still have a pod in Pending state

Jeffwan · 2020-08-20T21:57:48Z

@iusergii One last thing, what's the patch version are you using?

iusergii · 2020-08-25T12:21:01Z

@Jeffwan sorry, didn't get you.
CA: k8s.gcr.io/cluster-autoscaler:v1.17.1
API: GitVersion:"v1.17.6-eks-4e7f64"

fejta-bot · 2020-11-23T13:02:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-23T13:46:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-01-22T14:30:04Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2021-01-22T14:30:17Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dschunack · 2021-01-25T18:56:02Z

/reopen

k8s-ci-robot · 2021-01-25T18:56:16Z

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dschunack · 2021-01-25T18:59:29Z

Hi,

Problem is still exist on EKS 1.17 and 1.18. Problem is not solved yet.
We see that same behavior on our EKS, described here.

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address,

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

dschunack · 2021-01-25T18:59:49Z

Please reopen the issue.

/reopen

k8s-ci-robot · 2021-01-25T19:00:01Z

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Please reopen the issue.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chmielas · 2021-01-26T16:52:31Z

/reopen

k8s-ci-robot · 2021-01-26T16:52:44Z

@chmielas: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chmielas · 2021-01-26T16:53:39Z

@dschunack Issue has been reopened

dschunack · 2021-02-22T15:04:02Z

any news?

fejta-bot · 2021-03-24T15:25:13Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-24T15:25:26Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dschunack · 2021-03-24T16:54:12Z

/reopen

k8s-ci-robot · 2021-03-24T16:54:24Z

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dschunack · 2021-03-24T16:55:28Z

Problem still exist, please reopen the issue

jjhidalgar · 2021-03-24T17:51:23Z

I commented in June and can confirm that when you scale down to zero, and wait some time (not sure how much time), then it stops working again. Only solution is either setting the min to 1 or scaling manually from zero every time

dschunack · 2021-03-24T18:34:00Z

This is not really a solution, but I think a solution could be to add the stable APIs as described in my other issues #3802 here.
The new stable APIs are missing in the aws manager.

chmielas · 2021-03-26T16:56:42Z

/reopen

k8s-ci-robot · 2021-03-26T16:56:50Z

@chmielas: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chmielas · 2021-03-26T16:57:48Z

@dschunack issue has been reopened

fejta-bot · 2021-04-25T17:37:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-25T17:38:10Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vlinevych · 2023-03-03T10:43:20Z

For those who are getting Insufficient vpc.amazonaws.com/PrivateIPv4Address for windows ASGs with 0 nodes,
adding the following node tags has fixed the issue for me:

Explicitly specify the amount of allocatable resources:

k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 5

Tested with cluster-autoscaler v1.23

k8s-ci-robot assigned Jeffwan Aug 5, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 23, 2020

k8s-ci-robot closed this as completed Jan 22, 2021

k8s-ci-robot reopened this Jan 26, 2021

dschunack mentioned this issue Mar 17, 2021

Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS #3802

Closed

k8s-ci-robot closed this as completed Mar 24, 2021

k8s-ci-robot reopened this Mar 26, 2021

k8s-ci-robot closed this as completed Apr 25, 2021

Scale up windows on AWS EKS cluster #3133

Scale up windows on AWS EKS cluster #3133

Comments

chmielas commented May 14, 2020

jjhidalgar commented Jun 16, 2020 • edited Loading

iusergii commented Jul 29, 2020

Jeffwan commented Aug 5, 2020

Jeffwan commented Aug 5, 2020

Jeffwan commented Aug 5, 2020

iusergii commented Aug 5, 2020

Jeffwan commented Aug 5, 2020

iusergii commented Aug 10, 2020

Jeffwan commented Aug 10, 2020 • edited Loading

iusergii commented Aug 12, 2020

Jeffwan commented Aug 13, 2020

iusergii commented Aug 20, 2020 • edited Loading

Jeffwan commented Aug 20, 2020

iusergii commented Aug 25, 2020

fejta-bot commented Nov 23, 2020

fejta-bot commented Dec 23, 2020

fejta-bot commented Jan 22, 2021

k8s-ci-robot commented Jan 22, 2021

dschunack commented Jan 25, 2021

k8s-ci-robot commented Jan 25, 2021

dschunack commented Jan 25, 2021

dschunack commented Jan 25, 2021

k8s-ci-robot commented Jan 25, 2021

chmielas commented Jan 26, 2021

k8s-ci-robot commented Jan 26, 2021

chmielas commented Jan 26, 2021

dschunack commented Feb 22, 2021

fejta-bot commented Mar 24, 2021

k8s-ci-robot commented Mar 24, 2021

dschunack commented Mar 24, 2021

k8s-ci-robot commented Mar 24, 2021

dschunack commented Mar 24, 2021

jjhidalgar commented Mar 24, 2021

dschunack commented Mar 24, 2021

chmielas commented Mar 26, 2021

k8s-ci-robot commented Mar 26, 2021

chmielas commented Mar 26, 2021

fejta-bot commented Apr 25, 2021

k8s-ci-robot commented Apr 25, 2021

vlinevych commented Mar 3, 2023

jjhidalgar commented Jun 16, 2020 •

edited

Loading

Jeffwan commented Aug 10, 2020 •

edited

Loading

iusergii commented Aug 20, 2020 •

edited

Loading