Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨feat(awsmachinepool): custom lifecyclehooks for machinepools #4875

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

sebltm
Copy link

@sebltm sebltm commented Mar 18, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:

This PR adds to the v1beta2 definition for the AWSMachinePool and AWSManagedMachinePool with a new field lifecycleHooks which is a list of:

name: <the name of the lifecycle hook>
notificationTargetARN: <ARN of resource where to send the lifecycle event; optional>
roleARN: <ARN of role to be used when sending notifications; optional>
lifecycleTransition: <autoscaling:EC2_INSTANCE_LAUNCHING/EC2_INSTANCE_TERMINATING>
heartbeatTimeout: <duration of the heartbeat timeout; optional>
defaultResult: <CONTINUE/ABANDON; optional>
notificationMetadata: <some metadata to add to the notification; optional>

The matching webhooks are updated to validate the lifecycle hooks as they are added to the Custom Resource.
The matching reconcilers are updated to enable reconciling those lifecycle hooks: if the lifecycle hook is present in the Custom Resource but not in the cloud, it is created. And if there is a lifecycle hook present in the cloud but not declared in the Custom Resource then it is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4020

AWS supports Lifecycle Hooks before/after performing certain actions on an ASG. For example, before scaling in (removing) a node, the ASG can publish an event in an SQS queue which can them be consumed by the node-termination-handler to ensure its proper removal from Kubernetes (it will cordon, drain the node and wait for a period of time for applications to be removed before allowing the Autoscaling Group to terminate the instance).

This allows Kubernetes or other components to be aware of the node's lifecycle and take appropriate actions

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Adding support for custom Lifecycle Hooks in AWSMachinePools for external hooks (e.g support for the aws-node-termination-handler with SQS)

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 18, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Mar 18, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @sebltm!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 18, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @sebltm. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sebltm sebltm changed the title feat(awsmachinepool): add the ability to add lifecycle hooks ✨feat(awsmachinepool): add the ability to add lifecycle hooks Mar 18, 2024
@sebltm sebltm marked this pull request as ready for review April 16, 2024 11:41
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024
@sebltm sebltm changed the title ✨feat(awsmachinepool): add the ability to add lifecycle hooks ✨feat(awsmachinepool): custom lifecyclehooks for machinepools May 10, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Jul 3, 2024

I have two requests before getting to the review:

  • Neither title nor PR description describe the change. Lifecycle hooks and reacting to node shutdown is great – but what is this PR doing and achieving? Also, the release note entry in the PR template must be filled.
  • You're moving lots of code. Please revert those changes as much as possible so the PR becomes reviewable. Refactoring and file renames can be done separately.

@AndiDog
Copy link
Contributor

AndiDog commented Jul 3, 2024

/assign

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 4, 2024
@sebltm
Copy link
Author

sebltm commented Jul 4, 2024

@AndiDog sorry I hadn't cleaned up the PR, I didn't know if it would get some traction :)
I've updated the PR, updated the description. Let me know if it looks good, I'll write some docs and add release notes

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 13, 2024
@sebltm
Copy link
Author

sebltm commented Jul 13, 2024

@AndiDog let me know if this looks good or if there's anything else I should take a look at :)

@AndiDog
Copy link
Contributor

AndiDog commented Jul 15, 2024

The PR is definitely reviewable now. I'm not much experienced with lifecycle hooks and aws-node-termination-handler (is that your actual use case?). Maybe MachinePool machines (#4527) give us a good way to detect node shutdown and have CAPI/CAPA take care of it? Or in other words: I'm not fully confident reviewing here with my knowledge, but maybe others have a better clue – please feel free to ping or discuss in Slack (#cluster-api-aws) so we can find someone to check this feature request.

@AndiDog
Copy link
Contributor

AndiDog commented Jul 15, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 24, 2024
@sebltm
Copy link
Author

sebltm commented Nov 24, 2024

Sorry I've been away, thank you @AndiDog for picking this one up

@@ -133,7 +133,6 @@ func (r *AWSMachinePool) validateAdditionalSecurityGroups() field.ErrorList {
}
return allErrs
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency sake, I'd keep this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, that empty line was a mistake from merging

Comment on lines 178 to 190
{
name: "Should fail if either roleARN or notifcationARN is set but not both",
pool: &AWSMachinePool{
Spec: AWSMachinePoolSpec{
AWSLifecycleHooks: []AWSLifecycleHook{
{
RoleARN: aws.String("role-arn"),
},
},
},
},
wantErr: true,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a case for only setting roleARN, and another one only setting notificationARN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -298,6 +298,11 @@ func (r *AWSMachinePoolReconciler) reconcileNormal(ctx context.Context, machineP
return nil
}

if err := r.reconcileLifecycleHooks(machinePoolScope, asgsvc); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a ctx variable in this function, which we don't pass here, but later down the stack we end up creating a context.TODO(). May be worth passing the context that we already have. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiunchinho it looks like most of the other interfaces for the ASGInterface and EC2Interface use the same pattern (they get called from places that have context, and they themselves create their own context.TODO context).
We could start breaking the pattern here, it'd be a bit of a divergence to the rest of the code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then for consistency sake, it'd be better to follow the same pattern for now. It could be addressed in a different PR later on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating a context.TODO might have been a mistake when usage of *WithContext AWS SDK functions was introduced. Context should always be specified where possible in order to support timeouts, for instance. Some other interface functions are correctly taking such an argument already.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 25, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Nov 25, 2024

I noticed that CreateASG didn't handle the hooks. Likely, it's best if both are created atomically, so I added this as another commit.

@AndiDog
Copy link
Contributor

AndiDog commented Nov 26, 2024

@sebltm I'll try to continue here to bring it through review

AndiDog added a commit to giantswarm/cluster-api-provider-aws that referenced this pull request Nov 27, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Nov 27, 2024

/test pull-cluster-api-provider-aws-e2e

Giving it a try, but E2E might be problematic right now.

Copy link
Contributor

@fiunchinho fiunchinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2024
AndiDog added a commit to giantswarm/cluster-api-provider-aws that referenced this pull request Nov 28, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

@sebltm: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-docker-release-2-6 2421ec3 link true /test pull-cluster-api-provider-aws-build-docker-release-2-6
pull-cluster-api-provider-aws-build-release-2-6 2421ec3 link true /test pull-cluster-api-provider-aws-build-release-2-6
pull-cluster-api-provider-aws-e2e c35bbc3 link false /test pull-cluster-api-provider-aws-e2e
pull-cluster-api-provider-aws-test b95757c link true /test pull-cluster-api-provider-aws-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

AndiDog added a commit to giantswarm/cluster-api-provider-aws that referenced this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lifecycle Hooks for MachinePool/ASG
4 participants