Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale from zero fails to work on k8s >= 1.18 AWS #83

Closed
prashanth26 opened this issue Jun 7, 2021 · 4 comments
Closed

Scale from zero fails to work on k8s >= 1.18 AWS #83

prashanth26 opened this issue Jun 7, 2021 · 4 comments
Labels
kind/bug Bug platform/aws Amazon web services platform/infrastructure platform/azure Microsoft Azure platform/infrastructure priority/2 Priority (lower number equals higher priority) status/in-progress Issue is in progress/work

Comments

@prashanth26
Copy link

What happened:

moving a pod from one worker group node type to another via node label selector resulted in the following autoscaler error when the destination worker group is at a minimum size 0. This was noticed after we upgraded from k8s 1.17 -> 1.18 on AWS

  Normal   NotTriggerScaleUp  18m (x14 over 33m)    cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 4 Insufficient cpu, 5 Insufficient memory, 1 node(s) had volume node affinity conflict, 4 node(s) didn't match node selector
  • specifically, note the volume node affinity conflict

What you expected to happen:

the new node should be scaled up and pod assigned (the PV and node are both in the same zone).

How to reproduce it (as minimally and precisely as possible):

  1. A pod with an volume is scheduled onto the aws-r5d-large (small) worker group. The volume binding mode is WaitForFirstConsumer and is created at the time it is scheduled there. The pod is running.
  2. The pod spec is then changed such that the node label selector will force it onto the other worker group node type aws-r5d-2xlarge (medium)
  3. The pod gets stuck Pending because no node is available for it to run. Describing the pod yields the event above repeating

Anything else we need to know:

Through experimentation we can see:

  1. while the pod is stuck in this pending state, if we adjust the shoot worker group aws-r5d-2xlarge (medium) to a minimum of 1 the new node is created (to meet the minimum request) and the pod is scheduled and able to run. This means there appears to be no real zone conflict preventing mounting.
  2. while the pod is stuck in this pending state, if we schedule another new pod on a aws-r5d-2xlarge (medium) node, the node is created for this pod and it runs. And then also the autoscaler then scales up a 2nd node for the originally stuck pending pod (and its scheduled and able to run).
  3. while the pod is stuck in this pending state, if we adjust the shoot worker group definition to include an additional 2 labels related to zone topology, the node is created and the pod is able to run. It appears perhaps the labels are required for deciding auto-scaling but are not observed when the worker group is at 0?

Environment:
/platform aws
/platform azure

Issue credits: Bryon Hummel

@prashanth26 prashanth26 added the kind/bug Bug label Jun 7, 2021
@gardener-robot gardener-robot added platform/aws Amazon web services platform/infrastructure platform/azure Microsoft Azure platform/infrastructure labels Jun 7, 2021
@prashanth26 prashanth26 changed the title Scale from zero fails to Scale from zero fails to work on k8s >= 1.18 AWS Jun 7, 2021
@prashanth26 prashanth26 added status/in-progress Issue is in progress/work priority/2 Priority (lower number equals higher priority) labels Jun 7, 2021
@himanshu-kun
Copy link

I have tried to reproduce the issue and yes the experimentation is giving results as described in the issue. The problem is due to missing of this label topology.ebs.csi.aws.com/zone on the destination node . This label is used by PV created by AWS CSI driver to get attached to the node. The well known zone label topology.kubernetes.io/zone is not used because it was not well known at the time CSI driver released(see kubernetes-sigs/aws-ebs-csi-driver#729 (comment)) .
CA doesn't know abt the AWS EBS label, and so the solutions can be to either add the label manually on the worker pools through shoot spec or the following solutions (credits @prashanth26 )

  1. Update the cluster autoscaler code to add this label (for AWS and Azure) during scale-up calculations. This makes it ugly though.
  2. Update the extension providers to add this label on worker pool creations.
  3. Community speaks about a way to ignore some labels for calculations. However, I don't fully understand how this will play together. Will try to spend some time over the week to look up more about it.

@himanshu-kun
Copy link

himanshu-kun commented Jun 15, 2021

I have digged into the 3rd solution talked in last comment and referred (kubernetes#3230 (comment)). I used v0.10.x for the CSI driver along with and without the flag --balancing-ignore-label=topology.ebs.csi.aws.com/zone , but it didn't solve the problem.

After using v0.10.x and v1.0.0( not v1.1.0 though) for CSI driver ,the well known topology label is now a part of nodeAffinity in PV along with AWS EBS label and according to kubernetes-sigs/aws-ebs-csi-driver#729 (comment) it should solve our problem by prioritizing the well known label, but it doesn't.

Also the flag --balancing-ignore-label is for ignoring labels while comparing whether two node grps are equal or not.

@prashanth26
Copy link
Author

@amshuman-kr - What do you think would be the good approach between (1) and (2) above then?

@prashanth26
Copy link
Author

Fixed with gardener/gardener-extension-provider-aws#365.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug platform/aws Amazon web services platform/infrastructure platform/azure Microsoft Azure platform/infrastructure priority/2 Priority (lower number equals higher priority) status/in-progress Issue is in progress/work
Projects
None yet
Development

No branches or pull requests

3 participants