Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: Managed Nodes scale to 0 #724

Closed
mikestef9 opened this issue Jan 26, 2020 · 218 comments
Closed

[EKS] [request]: Managed Nodes scale to 0 #724

mikestef9 opened this issue Jan 26, 2020 · 218 comments
Assignees
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service

Comments

@mikestef9
Copy link
Contributor

Currently, managed node groups has a required minimum of 1 node in a node group. This request is to update behavior to support node groups of size 0, to unlock batch and ML use cases.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

@mikestef9 mikestef9 added EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue labels Jan 26, 2020
@mikestef9 mikestef9 self-assigned this Jan 26, 2020
@mathewpower
Copy link

This feature would be great for me. I'm looking to run GitLab workers on my EKS cluster to run ML training workloads. Typically, these jobs only run for a couple of hours a day (on big instances) so being able to scale down would make thing much more cost effective for us.

Any ideas when this feature might land?

@jzjones-lc
Copy link

@mathewpower you might want to use a vanilla autoscaling group instead of EKS managed.

Pretty much this issue makes EKS managed nodes a nonstarter for any ML projects due to one node in each group always being on

@jcampbell05
Copy link

There is tasks now - perhaps that's the solution for this.

@jzjones-lc
Copy link

@jcampbell05 can you elaborate? What tasks are you referring to?

@yann-soubeyrand
Copy link

I guess that node taints will have to be managed like node labels already are in order for the necessary node template to be set: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#scaling-a-node-group-to-0.

@mikestef9
Copy link
Contributor Author

Hey @yann-soubeyrand that is correct. Looking for some feedback on that, would you want all labels and taints to automatically propagate to the ASG in the required format for scale to 0, or have selective control over which ones propagate?

@dcherman
Copy link

@mikestef9 If AWS has enough information to propagate the labels/taints to the ASG, then I think it'd be preferable to have it "just work" as much as possible.

There will still be scenarios where manual intervention will be needed by the consumer I think such as setting region/AZ labels for single AZ nodegroups so that cluster-autoscaler can make intelligent decisions if a specific AZ is needed, however we should probably try to minimize that work as much as possible.

@yann-soubeyrand
Copy link

@mikestef9 in my understanding, all the labels and taints should be propagated to the ASG in the k8s.io/cluster-autoscaler/node-template/[label|taint]/<key> format since the cluster autoscaler takes its decisions based on it. If some taints or labels are missing, this could mislead the cluster autoscaler. Also, I'm not aware of any good reason not to propagate certain labels or taints.

A feature which could be useful though, is to be able to disable cluster autoscaler for specific node groups (that is, not setting k8s.io/cluster-autoscaler/enabled tag on these node groups).

@dcherman isn't the AZ case already managed by cluster autoscaler without specifying label templates?

@dcherman
Copy link

@yann-soubeyrand I think you're right! Just read through the cluster-autoscaler code, and it looks like it discovers what AZs the ASG creates nodes in from the ASG itself; I always thought it had discovered those from the nodes initially created by the ASG.

In that case, we can disregard my earlier comment.

@Ghazgkull
Copy link

I would like to be able to forcibly scale a managed node group to 0 via the CLI, by setting something like desired or maximum number of nodes to 0. Ignoring things like pod disruption budgets, etc.

I would like this in order for developers to have their own clusters which get scaled to 0 outside of working hours. I would like to use a simple cron to force clusters to size 0 at night, then give them 1 node in the morning and let cluster-autoscaler scale them back up.

@mikestef9 mikestef9 added EKS Managed Nodes EKS Managed Nodes and removed Proposed Community submitted issue labels Jun 11, 2020
@sibendu
Copy link

sibendu commented Jun 16, 2020

Hi All
is this feature already for AWS EKS?
From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0
Can someone please confirm?

@yann-soubeyrand
Copy link

Hi All
is this feature already for AWS EKS?
From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0
Can someone please confirm?

@sibendu it's not supported with managed node groups yet (this is the object of this issue) but you can achieve it with non managed node groups (following the documentation you linked).

@cfarrend
Copy link

cfarrend commented Jun 25, 2020

Would be great to have this, we make use of cluster autoscaling in order to demand GPU nodes on GKE and scale down when there are no requests. Having one node idle is definitely not cost effective for us if we want to use managed nodes on EKS

@antonosmond
Copy link

Putting use cases aside (although I have many), autoscaling groups already support min, max & desired size being 0. A node group is ultimately just an autoscaling group (and therefore already supports size 0). You can go into the AWS web console, find the ASG created for a node group and set the size to 0 and it's fine therefore it doesn't make sense that node groups are not supporting a zero size. As a loyal AWS customer it's frustrating to see things like this - there appears to be no good technical reason for preventing a size of zero but forcing customers to have a least 1 instance makes AWS more £££. Hmmm... was the decision to prevent a zero size about making it better for the customer or is Jeff a bit short of cash?

@yann-soubeyrand
Copy link

@antonosmond there are good technical reasons why you cannot scale from 0 with the actual configuration: for the autoscaler to be able to scale from 0, one have to put tags on the ASG indicating labels and taints the nodes will have. These tags are missing as of now. This is the purpose of this issue.

@antonosmond
Copy link

@yann-soubeyrand The cluster autoscaler is just one use case but this issue shouldn't relate specifically to the cluster autoscaler. The issue should be that you can't set a size of zero and regardless of use case or whether or not you run the cluster autoscaler, you should be able to set a size of zero as this is supported in autoscaling groups.

In addition to the use cases above, other use cases for 0 size include:

  • PoCs and testing (I may want 0 nodes so I can test my config without incurring instance charges)
  • having different node groups for different instance types where I don't necessarily need all instance types running at all times
  • cost saving e.g. scaling to zero overnight / at weekends

@yann-soubeyrand
Copy link

@antonosmond if you're not using cluster autoscaler, you're scaling the ASG manually, right? What prevents you from setting a min and desired count to 0? It seems to work as intended.

@antonosmond
Copy link

antonosmond commented Jul 1, 2020

@yann-soubeyrand I got to this issue from here.
It's nothing to do with the cluster autoscaler, I simply want to create a node group with an initial size of 0.
I have some terraform to create a node group but if I set the size to 0 it fails because the AWS API behind the resource creation validates that size is greater than zero.
Update - and yes I can create a node group with a size of 1 and then manually scale it zero but I shouldn't need to. The API should allow me to create a node group with a zero size.

@yann-soubeyrand
Copy link

The API should allow me to create a node group with a zero size.

I think we all agree with this ;-)

@MatteoMori
Copy link

MatteoMori commented Aug 6, 2020

Hey guys,

is there any update on this one?

thanks!

@stevehipwell
Copy link

If anyone is interested I can drop a pattern which works with the latest community module and has comments?

@artificial-aidan
Copy link

@stevehipwell thanks for the suggestion about declaring as locals. Another good way to do it.

I haven't followed the transition to 18 closely, it's been on my to-do list to catch up on. I followed the flurry of activity around the auth configmap changes and at that point decided it was going to be more effort than it was worth at the time.

I will experiment with the local node group definitions, that will likely work for us, just venting some slight annoyance that there isn't a clear documented well supported path for scale to zero. (It's possible, and I currently do it, just clunky). Maybe refactoring into local definitions will clean things up.

@artificial-aidan
Copy link

If anyone is interested I can drop a pattern which works with the latest community module and has comments?

Yes please.

@stevehipwell
Copy link

This is the example from above with a bit more context added, it turns out that the module changes don't help as they only provide the names and not the ID to lookup the required tags.

locals {
  cluster_name = "my-cluster"

  # Define MNGs here so we can reference them later
  mngs = {
    my-mng = {
      name = "my-mng"
      labels = {
        "my-label" = "foo"
      }
      taints = [{
        key = "my-taint"
        value = "bar"
        effect = "NO_SCHEDULE"
      }]
    }
  }

  # We need to lookup K8s taint effect from the AWS API value
  taint_effects = {
    "NO_SCHEDULE" = "NoSchedule"
    "NO_EXECUTE"  = "NoExecute"
    "PREFER_NO_SCHEDULE" = "PreferNoSchedule"
  }

  # Calculate the tags required by CA based on the MNG inputs
  mng_ca_tags = { for mng_key, mng_value in local.mngs : mng_key => merge({
    "k8s.io/cluster-autoscaler/enabled" = "true"
    "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
    },
    { for label_key, label_value in mng_value.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
    { for taint in mng_value.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}" }
  )
}

# Use the module
module "eks" {
  ...
  eks_managed_node_groups = local.mngs
  ...
}

resource "aws_autoscaling_group_tag" "mng_ca" {
  # Create a tuple in a map for each ASG tag combo
  for_each = merge([for mng_key, mng_tags in local.mng_ca_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${substr(tag_key, 25, -1)}" => { mng = mng_key, key = tag_key, value = tag_value }}]...)

  # Lookup the ASG name for the MNG, erroring if there is more than one
  autoscaling_group_name = one(module.eks.eks_managed_node_groups[each.value.mng].node_group_autoscaling_group_names)

  tag {
    key   = each.value.key
    value = each.value.value

    propagate_at_launch = false
  }
}

@TomasHradecky
Copy link

hi all, am I missing something or what is the reason why not to set minimum capacity for Managed node group to 0 ?
https://aws.amazon.com/blogs/containers/catching-up-with-managed-node-groups-in-amazon-eks/

@jbg
Copy link

jbg commented May 9, 2022

You can absolutely set the minimum to zero.

If you want cluster-autoscaler to automatically scale the node group up from zero when a pod is unschedulable on existing nodes but could be scheduled on a new node created in that node group, you need to tag the ASG that the node group creates with k8s.io/cluster-autoscaler/node-template/label/{key} = {value} and/or k8s.io/cluster-autoscaler/node-template/taint/{key} = {value} according to the labels and taints that the created nodes will have.

That's easy to automate using IaaC solutions like Terraform or with ad-hoc scripts as described above.

@ArchiFleKs
Copy link

ArchiFleKs commented May 10, 2022

Hi, here is a snippet I use with Terragrunt and terraform-aws-eks module. I use it as a symlink in the same folder where my terragrunt.hcl reside. (thanks @stevehipwell for the snippet).

This allows to set implicit labels to be able to scale to and from 0 with restricted labels such as:

  • k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type
  • k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone
  • k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone

And also to add restricted labels to the ASG as tags (labels that are forbidden via the EKS API for example). I think there is room for improvement, let me know what you think.

It also allows to mix between taint and labels defined in defaults and directly in MNG.

locals {
  mngs         = var.eks_managed_node_groups
  mng_defaults = var.eks_managed_node_group_defaults

  cluster_name = var.cluster_name

  taint_effects = {
    NO_SCHEDULE        = "NoSchedule"
    NO_EXECUTE         = "NoExecute"
    PREFER_NO_SCHEDULE = "PreferNoSchedule"
  }

  mng_ca_tags_defaults = {
    "k8s.io/cluster-autoscaler/enabled"               = "true"
    "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
  }

  mng_ca_tags_taints_defaults = try(local.mng_defaults.taints, []) != [] ? {
    for taint in local.mng_defaults.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}"
  } : {}

  mng_ca_tags_labels_defaults = try(local.mng_defaults.labels, {}) != {} ? {
    for label_key, label_value in local.mng_defaults.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value
  } : {}

  mng_ca_tags_taints = { for mng_key, mng_value in local.mngs : mng_key => merge(
    { for taint in mng_value.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}" }
    ) if try(mng_value.taints, []) != []
  }

  mng_ca_tags_labels = { for mng_key, mng_value in local.mngs : mng_key => merge(
    { for label_key, label_value in mng_value.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
    ) if try(mng_value.labels, {}) != {}
  }

  mng_ca_tags_restricted_labels = { for mng_key, mng_value in local.mngs : mng_key => merge(
    { for label_key, label_value in mng_value.restricted_labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
    ) if try(mng_value.restricted_labels, {}) != {}
  }

  mng_ca_tags_implicit = { for mng_key, mng_value in local.mngs : mng_key => merge(
    length(try(mng_value.instance_types, local.mng_defaults.instance_types)) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = one(try(mng_value.instance_types, local.mng_defaults.instance_types)) } : {},
    length(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = one(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) } : {},
    length(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = one(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) } : {},
    )
  }

  mng_ca_tags = { for mng_key, mng_value in local.mngs : mng_key => merge(
    local.mng_ca_tags_defaults,
    local.mng_ca_tags_taints_defaults,
    local.mng_ca_tags_labels_defaults,
    try(local.mng_ca_tags_taints[mng_key], {}),
    try(local.mng_ca_tags_labels[mng_key], {}),
    try(local.mng_ca_tags_restricted_labels[mng_key], {}),
    local.mng_ca_tags_implicit[mng_key],
  ) }

  mng_asg_custom_tags = { for mng_key, mng_value in local.mngs : mng_key => merge(var.tags) }
}

data "aws_autoscaling_group" "node_groups" {
  for_each = module.eks_managed_node_group
  name     = each.value.node_group_resources.0.autoscaling_groups.0.name
}

resource "aws_autoscaling_group_tag" "mng_ca" {
  # Create a tuple in a map for each ASG tag combo
  for_each = merge([for mng_key, mng_tags in local.mng_ca_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${substr(tag_key, 25, -1)}" => { mng = mng_key, key = tag_key, value = tag_value } }]...)

  # Lookup the ASG name for the MNG, erroring if there is more than one
  autoscaling_group_name = one(module.eks_managed_node_group[each.value.mng].node_group_autoscaling_group_names)

  tag {
    key                 = each.value.key
    value               = each.value.value
    propagate_at_launch = false
  }
}

resource "aws_autoscaling_group_tag" "mng_asg_tags" {
  # Create a tuple in a map for each ASG tag combo
  for_each = merge([for mng_key, mng_tags in local.mng_asg_custom_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${tag_key}" => { mng = mng_key, key = tag_key, value = tag_value } }]...)

  # Lookup the ASG name for the MNG, erroring if there is more than one
  autoscaling_group_name = one(module.eks_managed_node_group[each.value.mng].node_group_autoscaling_group_names)

  tag {
    key                 = each.value.key
    value               = each.value.value
    propagate_at_launch = true
  }
}

This produces the following output for example in console:

{                                                                                                                   
  "c5-xlarge-pub-a" = {                                                                                             
    "k8s.io/cluster-autoscaler/enabled" = "true"                                                                                                   
    "k8s.io/cluster-autoscaler/node-template/label/network" = "public"                          
    "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
    "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1a"
    "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1a"
    "k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
  }                                                                                                                 
  "c5-xlarge-pub-b" = {                                                                                             
    "k8s.io/cluster-autoscaler/enabled" = "true"                                                                                                
    "k8s.io/cluster-autoscaler/node-template/label/network" = "public"                          
    "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
    "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1b"
    "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1b"
    "k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
  }                                             
  "c5-xlarge-pub-c" = {                                                                                             
    "k8s.io/cluster-autoscaler/enabled" = "true"                                                                                                
    "k8s.io/cluster-autoscaler/node-template/label/network" = "public"                          
    "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
    "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1c"
    "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1c"                      
    "k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
  }

@TomasHradecky
Copy link

TomasHradecky commented May 11, 2022

Still not sure if you have to do it this way. My definition:

locals {
  rds_db_large_mng_count_green  = var.create_rds_db_large_mng_green ? var.rds_db_large_mng_count_blue : 0
}

data "aws_ec2_instance_type" "rds_db_large" {
  instance_type = var.rds_db_large_cluster_instance_type
}

resource "aws_eks_node_group" "rds_instances_db_large" {
  count = local.rds_db_large_mng_count_blue

  launch_template {
    id      = aws_launch_template.cluster_instances.id
    version = aws_launch_template.cluster_instances.latest_version
  }

  cluster_name           = module.eks.cluster_id
  node_group_name_prefix = "eks-rds-db-r6g-large"
  capacity_type          = "ON_DEMAND"
  node_role_arn          = module.eks.worker_iam_role_arn
  subnet_ids             = [data.terraform_remote_state.core.outputs.vpc_private_subnets[count.index]]
  instance_types         = [var.rds_db_large_cluster_instance_type]
  ami_type               = var.custom_bottlerocket_ami_id != "" ? null : "BOTTLEROCKET_ARM_64"
  release_version        = var.rds_db_large_custom_bottlerocket_version != "" ? var.rds_db_large_custom_bottlerocket_version : ""

  scaling_config {
    desired_size = var.rds-db-large_mng_desired_size
    max_size     = var.rds-db-large_mng_max_size
    min_size     = var.rds-db-large_mng_min_size
  }

  labels = {
    "node_group" = "rds_db_large"
    "node_size"  = "r6g_large"
    "blue_group" = "true"
    "aws_vcpus" : data.aws_ec2_instance_type.rds_db_large.default_vcpus
    "aws_memory" : data.aws_ec2_instance_type.rds_db_large.memory_size
  }

  lifecycle {
    create_before_destroy = false
    ignore_changes = [scaling_config[0].desired_size] #max_size, min_size
  }

  depends_on = [module.eks]

  tags = var.create_rds_db_large_mng_green ? {} : merge(var.tags,
    {
      "k8s.io/cluster-autoscaler/${var.cluster_name}" = "owned",
      "k8s.io/cluster-autoscaler/enabled"             = "true",
      "kubernetes.io/cluster/bottlerocket"            = "owned"
    }
  )
}

Values for size

min_size = 0
desired_size = 1
max_size = 10

Is important to set desired_size to 1, there will be created MNG with 1 node and after some time cluster autoscaler will scale down to 0. Now If you create e.g. deployment with nodeSelector set for specific MNG scaled to 0 deployment trigger scale up and new node is created with all labels specified in terraform.
If you specify desired_size = 0 no new nodes will be created

@TomasHradecky
Copy link

TomasHradecky commented Jun 20, 2022

After some time I've returned to this topic with as I believe final fix at least for our usecase.
During my first testing, scale to 0 and from 0 works well but I didn't test case when managed node groups are scaled to 0 for longer than few hours.
What happened now ? Our node groups were scaled to 0 for some days and after that cluster-autoscaler was unable to scale from 0. Main reason was that we use managed node group labels as nodeSelector for pods and ASG inherit only tags from node group not labels.
So after some time scaled to 0 cluster-autoscaler lost information about which ASG should be scaled up for this pod condition and pod which require specific nodeSelector is marked as unschedulable.
After ad this selector as ASG as tag, cluster-autoscaler scale from 0 like a charm.
So here is main info from my testing:

  • aws currently support node group min_size=0 and desired_size=0 and cluster-autoscaler is able to work with it
  • managed node group labels are only on node group, even if cluster autoscaler can scale besed on their value
  • managed node group tags are inherited by ASG
  • ASG has references on launch_template and instance size and is not required to add it as a tags
  • if you want to use managed node group labels to scale from 0 is necessary to set them as ASG tags with following prefix "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/"

Hope it helps to someone.

@stevehipwell
Copy link

@TomasHradecky what your describing is the expected and documented Cluster Autoscaler behaviour which is easy to configure for un-managed node groups but slightly trickier to do for AWS managed node groups. This issue has covered a number of ways to "hack" the ASG behind the MNG but the real issue here is waiting for AWS to support this natively through the MNG API; this is still as far away as it was when the issue was opened and when it was moved to coming soon!

@khteh
Copy link

khteh commented Jul 7, 2022

+1 cost is major concern!

@dprateek1991
Copy link

dprateek1991 commented Oct 12, 2022

Scaling Managed Node Groups from 0 to 1 and scaling down back to 0 works well on EKS. We have just tested it out. The hack is to TAG your Autoscaling Group used by Managed Node Group with a specific tag, similar to the K8s label added to your Node Group.

I just had to add the following block of code in my Terraform Module for EKS Node Groups and post this the Cluster Autoscaler is able to scale up from 0 to 1 and scale back down to 0 as well. This is also mentioned in the Cluster Autoscaler document here - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

Our K8s label on Managed Node Group is -
group = spot-m5-8xlarge
Therefore, in "k8s.io/cluster-autoscaler/node-template/label/group", the group at the end is the KEY of the label

# ASG Tag
resource "aws_autoscaling_group_tag" "tag" {
  for_each = toset(
    [for asg in flatten(
      [for resources in aws_eks_node_group.ng.resources : resources.autoscaling_groups]
    ) : asg.name]
  )

  autoscaling_group_name = each.value

  tag  {
    key   = "k8s.io/cluster-autoscaler/node-template/label/group"
    value = "${var.node_group_name}"
    propagate_at_launch = true
  }
}

I tested it on our EKS Jupyterhub environment and you can see the scale-up worked well as per logs -

2022-10-12T15:16:27Z [Normal] pod triggered scale-up: [{eks-spot-m5-8xlarge-20221005083326499900000005-88c1d3a3-9d15-f436-afd7-52c200e7678c 0->1 (max: 5)}]

Hope it helps others as well, who're facing the same challenges as us.

PS: An example to achieve this is also mentioned here - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/312e4a4d59cb10a762a4045e9944f3f837126933/examples/eks_managed_node_group/main.tf#L673-L712

@khteh
Copy link

khteh commented Oct 15, 2022

Scaling Managed Node Groups from 0 to 1 and scaling down back to 0 works well on EKS. We have just tested it out. The hack is to TAG your Autoscaling Group used by Managed Node Group with a specific tag, similar to the K8s label added to your Node Group.

I just had to add the following block of code in my Terraform Module for EKS Node Groups and post this the Cluster Autoscaler is able to scale up from 0 to 1 and scale back down to 0 as well. This is also mentioned in the Cluster Autoscaler document here - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

Our K8s label on Managed Node Group is - group = spot-m5-8xlarge Therefore, in "k8s.io/cluster-autoscaler/node-template/label/group", the group at the end is the KEY of the label

# ASG Tag
resource "aws_autoscaling_group_tag" "tag" {
  for_each = toset(
    [for asg in flatten(
      [for resources in aws_eks_node_group.ng.resources : resources.autoscaling_groups]
    ) : asg.name]
  )

  autoscaling_group_name = each.value

  tag  {
    key   = "k8s.io/cluster-autoscaler/node-template/label/group"
    value = "${var.node_group_name}"
    propagate_at_launch = true
  }
}

I tested it on our EKS Jupyterhub environment and you can see the scale-up worked well as per logs -

2022-10-12T15:16:27Z [Normal] pod triggered scale-up: [{eks-spot-m5-8xlarge-20221005083326499900000005-88c1d3a3-9d15-f436-afd7-52c200e7678c 0->1 (max: 5)}]

Hope it helps others as well, who're facing the same challenges as us.

PS: An example to achieve this is also mentioned here - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/312e4a4d59cb10a762a4045e9944f3f837126933/examples/eks_managed_node_group/main.tf#L673-L712

How does it impact cold start? Imagine if the cluster has 3 AZs and all are allowed to scale to 0, wouldn't it impact the first request that comes in with 0 node in ALL AZs?

@jbg
Copy link

jbg commented Oct 15, 2022

Of course, it takes time (30 sec ~ 3 min depending on node OS, probe settings etc) for a node to start, join the cluster and get Ready. If you are responding to requests that need fast responses, don't scale all your node groups to zero.

@khteh
Copy link

khteh commented Oct 15, 2022

How to scale all AZs to zero except one?

@jbg
Copy link

jbg commented Oct 15, 2022

If you're following best practices, you already have one node group per AZ, so you could set min size for each AZ separately. But it's probably better to handle that higher up in the stack — even if you have min size set to 0 for the node group for all AZs, cluster-autoscaler won't scale a node group down to zero if there are pods scheduled on the node(s) that can't be moved elsewhere.

@dprateek1991
Copy link

dprateek1991 commented Oct 15, 2022

@khteh - Regarding your question, I would rather suggest to design your workloads in a way to submit on different Node Groups and not all Node Groups have to be 0. In our use-case, we have multiple Node Groups like

Services NG - We run DE services like Airflow, JupyterHub, MLflow etc on this. We can't have this NG as Min 0, as services are running 24x7. These are mostly On-Demand EC2
Workloads NG - We use this to run workloads related to DE and ML. We can set this as Min 0 and use SPOT EC2 in this.

I would rather design in a way which fits specific use-cases and not to keep all NG to 0.

@yuvipanda
Copy link

Would karpenter help here at all? I'd imagine no, but wanted to check.

@dprateek1991
Copy link

dprateek1991 commented Oct 18, 2022

Would karpenter help here at all? I'd imagine no, but wanted to check.

I don't think so, as that's not the original purpose of Karpenter. The solution to TAG the ASG of Managed NG is provided by AWS itself to us when we checked about it with them. This is something already done in Terraform Module of EKS provided here - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/312e4a4d59cb10a762a4045e9944f3f837126933/examples/eks_managed_node_group/main.tf#L673-L712

AFAIK, for now, this is the only solution to support Min 0 for Managed NGs in EKS. Unless someone has tried a better solution :). @yuvipanda - I have tried it with Z2JH (which you help maintain) and it works like a charm. Just have to wait for the Autoscaler to kick in, so if we have a timeout on Jhub, need to do couple of retries to spawn the server, but works well and helps save cost a lot

@valorl
Copy link

valorl commented Oct 22, 2022

Would karpenter help here at all? I'd imagine no, but wanted to check.

I think Karpenter is certainly a viable way to handle the use-case of spinning up nodes just-in-time for e.g. batch workloads. Karpenter will simply spin up a node directly when it sees pending pods, no ASGs involved. The controller itself may run on e.g. a statically scaled ASG, or for a completely ASG-less cluster, it should be possible to run it on Fargate. For us at least, Karpenter massively simplified running batch workloads.

@mbevc1
Copy link

mbevc1 commented Oct 23, 2022

Yeah, good candidate and can run on Fargate nodes 👍

@bryantbiggs
Copy link
Member

Support for scaling managed node groups to 0 will be available starting with Kubernetes 1.24 - kubernetes/autoscaler#4491 (comment)

Once EKS releases support for Kuberenetes 1.24, you should be able to configure this functionality

@akestner
Copy link

🚀🚀🚀 Launch Announcement 🚀🚀🚀
Today we launched EKS support for Kubernetes 1.24 which includes a feature we contributed to the upstream Cluster Autoscaler project that simplifies scaling an EKS MNG to/from 0 nodes. When there are no running nodes in the MNG, the Cluster Autoscaler will call the EKS DescribeNodegroup API to get the information it needs about MNG resources, labels, and taints.

Read more about this and other new features available on EKS and Kubernetes 1.24 in the launch blog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Managed Nodes EKS Managed Nodes EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests