Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Propagate managed node group labels and taints to ASG as tags #2457

Closed
wants to merge 9 commits into from

Conversation

wcarlsen
Copy link

@wcarlsen wcarlsen commented Feb 8, 2023

Description

Propagate managed node group labels and taints to ASG as tags with configurable prefixes.

Motivation and Context

This is super relevant usage with cluster-autoscaler, when scaling from zero. Should address following issues #2448, #2128 and #1558.

Breaking Changes

I have disabled this feature by default, so we wont have any propagation of labels nor taints unless specified explicit.

How Has This Been Tested?

  • I have updated at least one of the examples/* to demonstrate and validate my change(s)
  • I have tested and validated these changes using one or more of the provided examples/* projects
  • I have executed pre-commit run -a on my pull request

I had a lot of issues actually getting example for eks-managed-node-groups to work, but finally managed.

I've tested this by changing examples/eks-manage-node-group. I've tested the following settings and changes:

  • Applied with default settings an saw no new added tags on ASG's
  • Added new node groups and saw labels and taints being propagated with the correct applied prefix
  • Disabled label and taint propagation on existing node group and saw tags being removed

Thanks to you guys for the effort and a great module and thanks to @abstrask for helping out.

@wcarlsen wcarlsen changed the title Propagate managed node group labels and taints to ASG as tags feat: Propagate managed node group labels and taints to ASG as tags Feb 8, 2023
@bryantbiggs
Copy link
Member

I am getting
│ Error: error waiting for EKS Node Group (ex-eks-managed-node-group:complete-eks-mng-20230208163733128000000043) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred: on the example with these changes

@wcarlsen
Copy link
Author

wcarlsen commented Feb 8, 2023

If examples actually worked out of the box, which it didn't, it would probably be easier to contribute. Instead of not looking at the actually contribution, could we discuss if the approach is viable and different from what you previously tried, since you haven't commented on it yet ✌️ Where to go from here, any hints

@bryantbiggs
Copy link
Member

I'm sorry but you lost me - are you saying the example on main today does not work or you are looking for hints on making your solution work, something else?

@wcarlsen
Copy link
Author

wcarlsen commented Feb 8, 2023

Both. I'm editing stuff totally unrelated to the actual change in the example. Why would I do that if it worked out the box? But please comment on the approach and if it is any different if what you already tried?

@wcarlsen
Copy link
Author

wcarlsen commented Feb 9, 2023

Yeah examples/eks_managed_node_group/ is broken alright.

So reverting back my commits and running on commit id 4b7fe46 I initially get this error related to subnets and IPv6:

Error: setting EC2 Subnet (subnet-0e66ca5dcc8d25e73) AssignIpv6AddressOnCreation: InvalidParameterValue: Invalid value 'true' for assign-ipv6-address-on-creation. Cannot set assign-ipv6-address-on-creation to true unless the subnet (subnet-0e66ca5dcc8d25e73) has an IPv6 CIDR block associated with it. │ status code: 400, request id: 35801c5c-747b-45bf-88e5-168c3546df1c │ │ with module.vpc.aws_subnet.intra[0], │ on .terraform/modules/vpc/main.tf line 569, in resource "aws_subnet" "intra": │ 569: resource "aws_subnet" "intra" {

Fixing this by adding intra_subnet_ipv6_prefixes = [6, 7, 8] to vpc module fixes the issue.

Then running into this error:
Error: "block_device_mappings.0.ebs.0.kms_key_id" (2fcb3f5e-d474-4820-8dc1-ac6a5b283d39) is an invalid ARN: arn: invalid prefix │ │ with module.eks.module.eks_managed_node_group["complete"].aws_launch_template.this[0], │ on ../../modules/eks-managed-node-group/main.tf line 36, in resource "aws_launch_template" "this": │ 36: resource "aws_launch_template" "this" {

Fixing it by changing kms_key_id = module.ebs_kms_key.key_id to kms_key_id = module.ebs_kms_key.key_arn

Then running into conflicting CIDR for subnets because a previous failed run:

Error: creating EC2 Subnet: InvalidSubnet.Conflict: The CIDR '10.9.53.0/24' conflicts with another subnet │ status code: 400, request id: 71d6e4cb-38d3-449e-b94a-3f4e7e180921 │ │ with module.vpc.aws_subnet.intra[1], │ on .terraform/modules/vpc/main.tf line 569, in resource "aws_subnet" "intra": │ 569: resource "aws_subnet" "intra" {
Fixing it by deleting the conflicting subnets via management console.

And I haven't even gotten close to spawning a cluster yet, nor tested my change.

And I end up with the error you report for complete node group:

│ Error: error waiting for EKS Node Group (ex-eks-managed-node-group:complete-eks-mng-20230209080102945300000017) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred: │ * eks-complete-eks-mng-20230209080102945300000017-0ac31a98-9cd0-8656-d06c-f77a3eddfa5b: AsgInstanceLaunchFailures: Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InternalError: Client error on launch │ * i-01380d86759033bd8, i-0293cb48cccc4c4a8, i-065ef228affe9c3ab, i-0b758c29f131929bb, i-0f7c8eb273291bdc9: NodeCreationFailure: Instances failed to join the kubernetes cluster │ │ │ │ with module.eks.module.eks_managed_node_group["complete"].aws_eks_node_group.this[0], │ on ../../modules/eks-managed-node-group/main.tf line 308, in resource "aws_eks_node_group" "this": │ 308: resource "aws_eks_node_group" "this" {
So the error you refer to was already there prior to my changes and this last one I have no idea how to fix. But if I comment out this node group and apply my tagging asg related changes it works like a dream. And just to make it completely clear the above errors is only adding the above fixes going from commit id 4b7fe46 and not adding my changes for tagging asg's.

@wcarlsen
Copy link
Author

wcarlsen commented Feb 9, 2023

I think fixing that last error is completely out of scope and asking too much for a fix that just adds tags to asg's. I would suggest commenting out the complete = {...} node group section since it is broken in master i.e. 4b7fe46 and then have someone else fix that for a later time.

@abstrask
Copy link

abstrask commented Feb 9, 2023

Error: setting EC2 Subnet (subnet-0e66ca5dcc8d25e73) AssignIpv6AddressOnCreation: InvalidParameterValue: Invalid value 'true' for assign-ipv6-address-on-creation. Cannot set assign-ipv6-address-on-creation to true unless the subnet (subnet-0e66ca5dcc8d25e73) has an IPv6 CIDR block associated with it. │ status code: 400, request id: 35801c5c-747b-45bf-88e5-168c3546df1c │ │ with module.vpc.aws_subnet.intra[0], │ on .terraform/modules/vpc/main.tf line 569, in resource "aws_subnet" "intra": │ 569: resource "aws_subnet" "intra" {

Fixing this by adding intra_subnet_ipv6_prefixes = [6, 7, 8] to vpc module fixes the issue.

I suspect this problem was with the example was introduced with release 3.16.0 of the VPC module, which introduces a lot of changes regarding IPv6.

The version constraint on the VPC module referenced in the example allows for minor version upgrades. Is this intentional or should it be more specific and only allow patch upgrades?

As for the problem with kms_key_id, I think it stems from the fact that the input in the ebs block to aws_launch_template resource expects the ARN of the KMS key and not the ID, as the name suggests.

Changing the reference to the KMS key ARN, as @wcarlsen suggested, seems to fix it.

I haven't dived into the subnet conflict issue 🤷.

@bryantbiggs
Copy link
Member

thanks for the call out - not sure why some of those examples stopped working but I have just updated to ensure they are functioning as intended

@wcarlsen
Copy link
Author

wcarlsen commented Feb 9, 2023

Thanks @bryantbiggs 🥇 anything more you need from me in terms of making this PR ready for a second review?

@wcarlsen
Copy link
Author

wcarlsen commented Feb 9, 2023

Just did another from scratch apply on the example pulling in your fixes and the asg tagging propagation still works as intended. Just to be safe ✌️

@wcarlsen
Copy link
Author

So @bryantbiggs any chance you had the time to look at the actual changes intended by this PR (node group taint and label propagation to asg) or is stuck because of something else I'm missing?

@wcarlsen
Copy link
Author

Just wanted to add that my team and I are quite happy with having implemented automatic propagation of node group labels and taints as tags for the ASGs, when using cluster-autoscaler. This propagation can both be implemented outside the module as show in this issue #2448. But it can also be implemented from inside the module as well as shown with this PR, even though I think it is mostly an AWS responsibility, since we never really configure the ASGs. I hope I can convince someone to have a second look at this PR and give it an actual chance.

@bryantbiggs
Copy link
Member

the more I think about this, I don't see why this needs to be built into the module. Users have the ability to do this today as the module is written. I don't think the added complexity, the added variables, and now the expanded interface are warranted to support this

@wcarlsen
Copy link
Author

wcarlsen commented Apr 2, 2023

I fully respect the explicit choice not to include an ASG tagging feature into the module. Once that is out of the way it is not like this topic is not showing up multiple times in the this projects issues and the statement made here (#2448 (comment)) about this not being possible, from the module perspective, is simply not true.

@wcarlsen wcarlsen closed this Apr 2, 2023
@github-actions
Copy link

github-actions bot commented May 3, 2023

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants