Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale up windows on AWS EKS cluster #3133

Closed
chmielas opened this issue May 14, 2020 · 40 comments
Closed

Scale up windows on AWS EKS cluster #3133

chmielas opened this issue May 14, 2020 · 40 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@chmielas
Copy link

Hi,
I`m using Kubernetes based on EKS 1.15 with windows node group, vpc controller and webhook
and cluster autoscaler
cluster-autoscaler cluster-autoscaler v1.15.6

The problem that I have is similar to #2888
When ASG need to be scaled from 0 to 2 instances after couple days of inactivity autoscaler don`t trigger scale up.

The workaround is to set the minimum size of ASG to 1. In such case, autoscaler don`t have any problem with scale up and scale down.
After update to v1.15.6 problem still occurs

Here is pod output

Name:                 job-038erq28k
Namespace:            default
Priority:             10000
Priority Class Name:  low-priority
Node:                 <none>
Labels:               app=my-eks-job
                      platform=WINDOWS
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        Job/job-038e11d2
Init Containers:
  init-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     250m
      memory:  300Mi
    Requests:
      cpu:     250m
      memory:  300Mi
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Containers:
  main-container:
    Image:      myimage:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Requests:
      cpu:                                   7
      memory:                                15000Mi
      vpc.amazonaws.com/PrivateIPv4Address:  1
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from mytoken (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  mytoken:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  mytoken
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  beta.kubernetes.io/os=windows
Tolerations:     dedicated=WINDOWS:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   41m (x14 over 60m)      default-scheduler   0/31 nodes are available: 21 Insufficient memory, 31 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Warning  FailedScheduling   31m (x19 over 65m)      default-scheduler   0/31 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 31 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 31 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  16m (x1672 over 13h)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu, 2 max limit reached
  Normal   NotTriggerScaleUp  6m31s (x1720 over 13h)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
  Normal   NotTriggerScaleUp  89s (x301 over 13h)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 max limit reached, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 4 Insufficient cpu
  Warning  FailedScheduling   60s (x22 over 62m)      default-scheduler   0/30 nodes are available: 20 Insufficient memory, 30 Insufficient cpu, 30 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 30 node(s) didn't match node selector.

and some logs from autoscaller

I0513 06:49:01.806775       1 utils.go:229] Pod job-038erq28k can't be scheduled on linux-node-asg-20191203023958042900000018, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient cpu, Insufficient vpc.amazonaws.com/PrivateIPv4Address, 
I0513 06:49:01.806973       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"job-038erq28k", UID:"<removed>", APIVersion:"v1", ResourceVersion:"73451570", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 8 Insufficient vpc.amazonaws.com/PrivateIPv4Address, 2 max limit reached
@jjhidalgar
Copy link

jjhidalgar commented Jun 16, 2020

Can confirm, on eks 1.16
If I manually update desired instances to 1, then it works. Even after auto-downscaling to 0, it can scale up again.

But if you never had any instance it doesn't work.

Haven't tried waiting a few days after downscaling to 0, it may stop working again.

@iusergii
Copy link

Have the same issue with 1.17.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

Did you put labels to ASG tags?

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

This should be resolved in last release. #2888

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

/assign @Jeffwan

@iusergii
Copy link

iusergii commented Aug 5, 2020

@Jeffwan yes, it scales up if you already have at least one node up.
external-dns image: 0.7.2-debian-10-r46
EKS: 1.17.6

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 5, 2020

@iusergii

Scale from 0 should be working as well. Could you share your ASG tags?

@iusergii
Copy link

@Jeffwan Here my tags:

Name: eks-windows-node-1a-Node
alpha.eksctl.io/cluster-name: eks	
alpha.eksctl.io/eksctl-version: 0.24.0
alpha.eksctl.io/nodegroup-name: windows-node-1a	
alpha.eksctl.io/nodegroup-type: unmanaged
eksctl.cluster.k8s.io/v1alpha1/cluster-name: eks	
eksctl.io/v1alpha2/nodegroup-name: windows-node-1a	
k8s.io/cluster-autoscaler/eks: owned	
k8s.io/cluster-autoscaler/enabled: true	
k8s.io/cluster-autoscaler/node-template/label/windows-node: 1a	
k8s.io/cluster-autoscaler/node-template/taint/windows: true:NoSchedule	
kubernetes.io/cluster/eks: owned	

As you can see I also have a Taints on these nodes.

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 10, 2020

@iusergii

Can you add these tags? CA won't know your node has ENI and IP addresses. Please check #2888 (comment) for more details

k8s.io/cluster-autoscaler/node-template/label/beta.kubernetes.io/os windows
k8s.io/cluster-autoscaler/node-template/label/os windows
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 14

@iusergii
Copy link

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address, 

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 13, 2020

@iusergii

Did you restart your CA or wait for a while after you apply the tag changes?

@iusergii
Copy link

iusergii commented Aug 20, 2020

@Jeffwan yes, I did:

  • created IG
  • restarted CA
  • Redeployed application.
    Still have a pod in Pending state

@Jeffwan
Copy link
Contributor

Jeffwan commented Aug 20, 2020

@iusergii One last thing, what's the patch version are you using?

@iusergii
Copy link

@Jeffwan sorry, didn't get you.
CA: k8s.gcr.io/cluster-autoscaler:v1.17.1
API: GitVersion:"v1.17.6-eks-4e7f64"

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 23, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dschunack
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dschunack
Copy link
Contributor

Hi,

Problem is still exist on EKS 1.17 and 1.18. Problem is not solved yet.
We see that same behavior on our EKS, described here.

@Jeffwan didn't help

I0812 07:08:35.605505       1 auto_scaling_groups.go:136] Registering ASG eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
W0812 07:08:35.606158       1 clusterstate.go:437] Failed to find acceptable ranges for eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6
I0812 07:08:35.606928       1 scale_up.go:271] Pod default/windows-app-574d74c548-sbckq is unschedulable
I0812 07:11:35.789780       1 pod_schedulable.go:165] Pod windows-app-574d74c548-sbckq can't be scheduled on eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient vpc.amazonaws.com/PrivateIPv4Address, 

After I scaled ASG up manually to one and added more workflows it successfully scales up by autoscaler:

I0812 07:23:36.962939       1 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"windows-api-59b496d9d-4h9qm", UID:"bd2ba098-5539-4efc-a706-81ed843eb044", APIVersion:"v1", ResourceV
ersion:"4625649", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eksctl-eks-nodegroup-windows-node-1a-NodeGroup-1K77Z0OOEBYN6 1->2 (max: 5)}]

@dschunack
Copy link
Contributor

Please reopen the issue.

/reopen

@k8s-ci-robot
Copy link
Contributor

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Please reopen the issue.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chmielas
Copy link
Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@chmielas: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 26, 2021
@chmielas
Copy link
Author

@dschunack Issue has been reopened

@dschunack
Copy link
Contributor

any news?

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dschunack
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@dschunack: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dschunack
Copy link
Contributor

Problem still exist, please reopen the issue

@jjhidalgar
Copy link

I commented in June and can confirm that when you scale down to zero, and wait some time (not sure how much time), then it stops working again. Only solution is either setting the min to 1 or scaling manually from zero every time

@dschunack
Copy link
Contributor

This is not really a solution, but I think a solution could be to add the stable APIs as described in my other issues #3802 here.
The new stable APIs are missing in the aws manager.

@chmielas
Copy link
Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@chmielas: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 26, 2021
@chmielas
Copy link
Author

@dschunack issue has been reopened

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vlinevych
Copy link

For those who are getting Insufficient vpc.amazonaws.com/PrivateIPv4Address for windows ASGs with 0 nodes,
adding the following node tags has fixed the issue for me:

Explicitly specify the amount of allocatable resources:

k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/ENI 1
k8s.io/cluster-autoscaler/node-template/resources/vpc.amazonaws.com/PrivateIPv4Address 5

Tested with cluster-autoscaler v1.23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants