-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [request]: Managed Nodes scale to 0 #724
Comments
This feature would be great for me. I'm looking to run GitLab workers on my EKS cluster to run ML training workloads. Typically, these jobs only run for a couple of hours a day (on big instances) so being able to scale down would make thing much more cost effective for us. Any ideas when this feature might land? |
@mathewpower you might want to use a vanilla autoscaling group instead of EKS managed. Pretty much this issue makes EKS managed nodes a nonstarter for any ML projects due to one node in each group always being on |
There is tasks now - perhaps that's the solution for this. |
@jcampbell05 can you elaborate? What tasks are you referring to? |
I guess that node taints will have to be managed like node labels already are in order for the necessary node template to be set: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#scaling-a-node-group-to-0. |
Hey @yann-soubeyrand that is correct. Looking for some feedback on that, would you want all labels and taints to automatically propagate to the ASG in the required format for scale to 0, or have selective control over which ones propagate? |
@mikestef9 If AWS has enough information to propagate the labels/taints to the ASG, then I think it'd be preferable to have it "just work" as much as possible. There will still be scenarios where manual intervention will be needed by the consumer I think such as setting region/AZ labels for single AZ nodegroups so that cluster-autoscaler can make intelligent decisions if a specific AZ is needed, however we should probably try to minimize that work as much as possible. |
@mikestef9 in my understanding, all the labels and taints should be propagated to the ASG in the A feature which could be useful though, is to be able to disable cluster autoscaler for specific node groups (that is, not setting @dcherman isn't the AZ case already managed by cluster autoscaler without specifying label templates? |
@yann-soubeyrand I think you're right! Just read through the cluster-autoscaler code, and it looks like it discovers what AZs the ASG creates nodes in from the ASG itself; I always thought it had discovered those from the nodes initially created by the ASG. In that case, we can disregard my earlier comment. |
I would like to be able to forcibly scale a managed node group to 0 via the CLI, by setting something like desired or maximum number of nodes to 0. Ignoring things like pod disruption budgets, etc. I would like this in order for developers to have their own clusters which get scaled to 0 outside of working hours. I would like to use a simple cron to force clusters to size 0 at night, then give them 1 node in the morning and let cluster-autoscaler scale them back up. |
Hi All |
@sibendu it's not supported with managed node groups yet (this is the object of this issue) but you can achieve it with non managed node groups (following the documentation you linked). |
Would be great to have this, we make use of cluster autoscaling in order to demand GPU nodes on GKE and scale down when there are no requests. Having one node idle is definitely not cost effective for us if we want to use managed nodes on EKS |
Putting use cases aside (although I have many), autoscaling groups already support min, max & desired size being 0. A node group is ultimately just an autoscaling group (and therefore already supports size 0). You can go into the AWS web console, find the ASG created for a node group and set the size to 0 and it's fine therefore it doesn't make sense that node groups are not supporting a zero size. As a loyal AWS customer it's frustrating to see things like this - there appears to be no good technical reason for preventing a size of zero but forcing customers to have a least 1 instance makes AWS more £££. Hmmm... was the decision to prevent a zero size about making it better for the customer or is Jeff a bit short of cash? |
@antonosmond there are good technical reasons why you cannot scale from 0 with the actual configuration: for the autoscaler to be able to scale from 0, one have to put tags on the ASG indicating labels and taints the nodes will have. These tags are missing as of now. This is the purpose of this issue. |
@yann-soubeyrand The cluster autoscaler is just one use case but this issue shouldn't relate specifically to the cluster autoscaler. The issue should be that you can't set a size of zero and regardless of use case or whether or not you run the cluster autoscaler, you should be able to set a size of zero as this is supported in autoscaling groups. In addition to the use cases above, other use cases for 0 size include:
|
@antonosmond if you're not using cluster autoscaler, you're scaling the ASG manually, right? What prevents you from setting a min and desired count to 0? It seems to work as intended. |
@yann-soubeyrand I got to this issue from here. |
I think we all agree with this ;-) |
Hey guys, is there any update on this one? thanks! |
If anyone is interested I can drop a pattern which works with the latest community module and has comments? |
@stevehipwell thanks for the suggestion about declaring as locals. Another good way to do it. I haven't followed the transition to 18 closely, it's been on my to-do list to catch up on. I followed the flurry of activity around the auth configmap changes and at that point decided it was going to be more effort than it was worth at the time. I will experiment with the local node group definitions, that will likely work for us, just venting some slight annoyance that there isn't a clear documented well supported path for scale to zero. (It's possible, and I currently do it, just clunky). Maybe refactoring into local definitions will clean things up. |
Yes please. |
This is the example from above with a bit more context added, it turns out that the module changes don't help as they only provide the names and not the ID to lookup the required tags. locals {
cluster_name = "my-cluster"
# Define MNGs here so we can reference them later
mngs = {
my-mng = {
name = "my-mng"
labels = {
"my-label" = "foo"
}
taints = [{
key = "my-taint"
value = "bar"
effect = "NO_SCHEDULE"
}]
}
}
# We need to lookup K8s taint effect from the AWS API value
taint_effects = {
"NO_SCHEDULE" = "NoSchedule"
"NO_EXECUTE" = "NoExecute"
"PREFER_NO_SCHEDULE" = "PreferNoSchedule"
}
# Calculate the tags required by CA based on the MNG inputs
mng_ca_tags = { for mng_key, mng_value in local.mngs : mng_key => merge({
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
},
{ for label_key, label_value in mng_value.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
{ for taint in mng_value.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}" }
)
}
# Use the module
module "eks" {
...
eks_managed_node_groups = local.mngs
...
}
resource "aws_autoscaling_group_tag" "mng_ca" {
# Create a tuple in a map for each ASG tag combo
for_each = merge([for mng_key, mng_tags in local.mng_ca_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${substr(tag_key, 25, -1)}" => { mng = mng_key, key = tag_key, value = tag_value }}]...)
# Lookup the ASG name for the MNG, erroring if there is more than one
autoscaling_group_name = one(module.eks.eks_managed_node_groups[each.value.mng].node_group_autoscaling_group_names)
tag {
key = each.value.key
value = each.value.value
propagate_at_launch = false
}
} |
hi all, am I missing something or what is the reason why not to set minimum capacity for Managed node group to 0 ? |
You can absolutely set the minimum to zero. If you want cluster-autoscaler to automatically scale the node group up from zero when a pod is unschedulable on existing nodes but could be scheduled on a new node created in that node group, you need to tag the ASG that the node group creates with That's easy to automate using IaaC solutions like Terraform or with ad-hoc scripts as described above. |
Hi, here is a snippet I use with Terragrunt and This allows to set implicit labels to be able to scale to and from 0 with restricted labels such as:
And also to add restricted labels to the ASG as tags (labels that are forbidden via the EKS API for example). I think there is room for improvement, let me know what you think. It also allows to mix between taint and labels defined in locals {
mngs = var.eks_managed_node_groups
mng_defaults = var.eks_managed_node_group_defaults
cluster_name = var.cluster_name
taint_effects = {
NO_SCHEDULE = "NoSchedule"
NO_EXECUTE = "NoExecute"
PREFER_NO_SCHEDULE = "PreferNoSchedule"
}
mng_ca_tags_defaults = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned"
}
mng_ca_tags_taints_defaults = try(local.mng_defaults.taints, []) != [] ? {
for taint in local.mng_defaults.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}"
} : {}
mng_ca_tags_labels_defaults = try(local.mng_defaults.labels, {}) != {} ? {
for label_key, label_value in local.mng_defaults.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value
} : {}
mng_ca_tags_taints = { for mng_key, mng_value in local.mngs : mng_key => merge(
{ for taint in mng_value.taints : "k8s.io/cluster-autoscaler/node-template/taint/${taint.key}" => "${taint.value}:${local.taint_effects[taint.effect]}" }
) if try(mng_value.taints, []) != []
}
mng_ca_tags_labels = { for mng_key, mng_value in local.mngs : mng_key => merge(
{ for label_key, label_value in mng_value.labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
) if try(mng_value.labels, {}) != {}
}
mng_ca_tags_restricted_labels = { for mng_key, mng_value in local.mngs : mng_key => merge(
{ for label_key, label_value in mng_value.restricted_labels : "k8s.io/cluster-autoscaler/node-template/label/${label_key}" => label_value },
) if try(mng_value.restricted_labels, {}) != {}
}
mng_ca_tags_implicit = { for mng_key, mng_value in local.mngs : mng_key => merge(
length(try(mng_value.instance_types, local.mng_defaults.instance_types)) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = one(try(mng_value.instance_types, local.mng_defaults.instance_types)) } : {},
length(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = one(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) } : {},
length(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) == 1 ? { "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = one(data.aws_autoscaling_group.node_groups[mng_key].availability_zones) } : {},
)
}
mng_ca_tags = { for mng_key, mng_value in local.mngs : mng_key => merge(
local.mng_ca_tags_defaults,
local.mng_ca_tags_taints_defaults,
local.mng_ca_tags_labels_defaults,
try(local.mng_ca_tags_taints[mng_key], {}),
try(local.mng_ca_tags_labels[mng_key], {}),
try(local.mng_ca_tags_restricted_labels[mng_key], {}),
local.mng_ca_tags_implicit[mng_key],
) }
mng_asg_custom_tags = { for mng_key, mng_value in local.mngs : mng_key => merge(var.tags) }
}
data "aws_autoscaling_group" "node_groups" {
for_each = module.eks_managed_node_group
name = each.value.node_group_resources.0.autoscaling_groups.0.name
}
resource "aws_autoscaling_group_tag" "mng_ca" {
# Create a tuple in a map for each ASG tag combo
for_each = merge([for mng_key, mng_tags in local.mng_ca_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${substr(tag_key, 25, -1)}" => { mng = mng_key, key = tag_key, value = tag_value } }]...)
# Lookup the ASG name for the MNG, erroring if there is more than one
autoscaling_group_name = one(module.eks_managed_node_group[each.value.mng].node_group_autoscaling_group_names)
tag {
key = each.value.key
value = each.value.value
propagate_at_launch = false
}
}
resource "aws_autoscaling_group_tag" "mng_asg_tags" {
# Create a tuple in a map for each ASG tag combo
for_each = merge([for mng_key, mng_tags in local.mng_asg_custom_tags : { for tag_key, tag_value in mng_tags : "${mng_key}-${tag_key}" => { mng = mng_key, key = tag_key, value = tag_value } }]...)
# Lookup the ASG name for the MNG, erroring if there is more than one
autoscaling_group_name = one(module.eks_managed_node_group[each.value.mng].node_group_autoscaling_group_names)
tag {
key = each.value.key
value = each.value.value
propagate_at_launch = true
}
} This produces the following output for example in console: {
"c5-xlarge-pub-a" = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/node-template/label/network" = "public"
"k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
"k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1a"
"k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1a"
"k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
}
"c5-xlarge-pub-b" = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/node-template/label/network" = "public"
"k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
"k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1b"
"k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1b"
"k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
}
"c5-xlarge-pub-c" = {
"k8s.io/cluster-autoscaler/enabled" = "true"
"k8s.io/cluster-autoscaler/node-template/label/network" = "public"
"k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type" = "c5.xlarge"
"k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone" = "eu-west-1c"
"k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone" = "eu-west-1c"
"k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "true:NoSchedule"
} |
Still not sure if you have to do it this way. My definition:
Values for size
Is important to set desired_size to 1, there will be created MNG with 1 node and after some time cluster autoscaler will scale down to 0. Now If you create e.g. deployment with nodeSelector set for specific MNG scaled to 0 deployment trigger scale up and new node is created with all labels specified in terraform. |
After some time I've returned to this topic with as I believe final fix at least for our usecase.
Hope it helps to someone. |
@TomasHradecky what your describing is the expected and documented Cluster Autoscaler behaviour which is easy to configure for un-managed node groups but slightly trickier to do for AWS managed node groups. This issue has covered a number of ways to "hack" the ASG behind the MNG but the real issue here is waiting for AWS to support this natively through the MNG API; this is still as far away as it was when the issue was opened and when it was moved to coming soon! |
+1 cost is major concern! |
Scaling Managed Node Groups from 0 to 1 and scaling down back to 0 works well on EKS. We have just tested it out. The hack is to TAG your Autoscaling Group used by Managed Node Group with a specific tag, similar to the K8s label added to your Node Group. I just had to add the following block of code in my Terraform Module for EKS Node Groups and post this the Cluster Autoscaler is able to scale up from 0 to 1 and scale back down to 0 as well. This is also mentioned in the Cluster Autoscaler document here - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0 Our K8s label on Managed Node Group is -
I tested it on our EKS Jupyterhub environment and you can see the scale-up worked well as per logs -
Hope it helps others as well, who're facing the same challenges as us. PS: An example to achieve this is also mentioned here - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/312e4a4d59cb10a762a4045e9944f3f837126933/examples/eks_managed_node_group/main.tf#L673-L712 |
How does it impact cold start? Imagine if the cluster has 3 AZs and all are allowed to scale to 0, wouldn't it impact the first request that comes in with 0 node in ALL AZs? |
Of course, it takes time (30 sec ~ 3 min depending on node OS, probe settings etc) for a node to start, join the cluster and get Ready. If you are responding to requests that need fast responses, don't scale all your node groups to zero. |
How to scale all AZs to zero except one? |
If you're following best practices, you already have one node group per AZ, so you could set min size for each AZ separately. But it's probably better to handle that higher up in the stack — even if you have min size set to 0 for the node group for all AZs, cluster-autoscaler won't scale a node group down to zero if there are pods scheduled on the node(s) that can't be moved elsewhere. |
@khteh - Regarding your question, I would rather suggest to design your workloads in a way to submit on different Node Groups and not all Node Groups have to be 0. In our use-case, we have multiple Node Groups like Services NG - We run DE services like Airflow, JupyterHub, MLflow etc on this. We can't have this NG as Min 0, as services are running 24x7. These are mostly On-Demand EC2 I would rather design in a way which fits specific use-cases and not to keep all NG to 0. |
Would karpenter help here at all? I'd imagine no, but wanted to check. |
I don't think so, as that's not the original purpose of Karpenter. The solution to TAG the ASG of Managed NG is provided by AWS itself to us when we checked about it with them. This is something already done in Terraform Module of EKS provided here - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/312e4a4d59cb10a762a4045e9944f3f837126933/examples/eks_managed_node_group/main.tf#L673-L712 AFAIK, for now, this is the only solution to support Min 0 for Managed NGs in EKS. Unless someone has tried a better solution :). @yuvipanda - I have tried it with Z2JH (which you help maintain) and it works like a charm. Just have to wait for the Autoscaler to kick in, so if we have a timeout on Jhub, need to do couple of retries to spawn the server, but works well and helps save cost a lot |
I think Karpenter is certainly a viable way to handle the use-case of spinning up nodes just-in-time for e.g. batch workloads. Karpenter will simply spin up a node directly when it sees pending pods, no ASGs involved. The controller itself may run on e.g. a statically scaled ASG, or for a completely ASG-less cluster, it should be possible to run it on Fargate. For us at least, Karpenter massively simplified running batch workloads. |
Yeah, good candidate and can run on Fargate nodes 👍 |
Support for scaling managed node groups to 0 will be available starting with Kubernetes 1.24 - kubernetes/autoscaler#4491 (comment) Once EKS releases support for Kuberenetes 1.24, you should be able to configure this functionality |
🚀🚀🚀 Launch Announcement 🚀🚀🚀 Read more about this and other new features available on EKS and Kubernetes 1.24 in the launch blog. |
Currently, managed node groups has a required minimum of 1 node in a node group. This request is to update behavior to support node groups of size 0, to unlock batch and ML use cases.
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0
The text was updated successfully, but these errors were encountered: