Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for defining/using AWS ASG Lifecycle Hooks #8708

Closed
andersosthus opened this issue Mar 9, 2020 · 18 comments
Closed

Add support for defining/using AWS ASG Lifecycle Hooks #8708

andersosthus opened this issue Mar 9, 2020 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@andersosthus
Copy link
Contributor

We would like support for defining AWS ASG Lifecycle hooks on an InstanceGroup.

Use case:
We use AWS ASG Lifecycle hooks with a custom script that watches for AWS Spot terminations, and if detected runs a drain on the node and then sends a COMPLETED signal to the Lifecycle hook, that then allows the instance to be terminated.
The fact that we have this Lifecycle hooks kinda "breaks" kops rolling-update cluster, since the instance won't be terminated until either a COMPLETED signal is sent or it reaches the timeout value.

Our proposed solution would look something like this:
Add an awsAsgLifecycle property to InstanceGroup where one can set the Lifecycle properties (name, transition, default result, heartbeat timeout, notification arn, role arn).

When kops is doing a node drain, if the InstanceGroup has the awsAsgLifecycle set, it should send a COMPLETED signal when the drain is done.

If this sounds ok, I can do the implementation (though I probably need some guidance since I'm not that familiar with the kops codebase)

@paalkr
Copy link

paalkr commented Mar 9, 2020

Yes, please!

@mikesplain
Copy link
Contributor

Out of curiosity, is there a reason you chose to go this way and not use something like https://github.com/pusher/k8s-spot-termination-handler?

@paalkr
Copy link

paalkr commented Mar 9, 2020

I'm actually working with @andersosthus on this one, so I will make an answer. Yes, there is a good reason why we don't use the component you linked to. It does only support spot termination and we need a solution that also works on AGS termination. We do NOT use the common Cluster Autoscaler because we have memory bound and spiky workloads. So we export utilization metrics from Prometheus to CloudWatch, and the we use a combination of TargetTracking and StepScaling to modify the Autoscaling Groups desired count.

Also the proposed solution from @andersosthus will allow to attach custom ARNs, like a Lambda function to do arbitrary activities when machines are added or removed from an Autoscaling group.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2020
@olemarkus
Copy link
Member

This sounds good to me

@rifelpet
Copy link
Member

/remove-lifecycle stale
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2020
@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Sep 2, 2020

@andersosthus are you still interested in implementing this? If not I'm going to take a shot at it.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020
@olemarkus
Copy link
Member

/remove-lifecycle stale

Anyone had any luck with this issue?

This is to some extent related to #7119

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020
@paalkr
Copy link

paalkr commented Dec 2, 2020

Currently we handled this using a custom systemd unit in the instance group. But the use case will not longer apply when we get this NTH implementation in place.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021
@bigman3
Copy link

bigman3 commented Mar 15, 2021

Currently we handled this using a custom systemd unit in the instance group. But the use case will not longer apply when we get this NTH implementation in place.

@paalkr can you share/guide how the systemd unit should like? I am trying to replicate, with increased timeouts (TimeoutStopSec, TimeoutSec) but hitting 2 minutes when it stops not gracefully

@bigman3
Copy link

bigman3 commented Mar 31, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2021
@olemarkus
Copy link
Member

With the support for NTH in SQS mode, are the use cases covered?
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@olemarkus
Copy link
Member

Closing it is
/close

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Closing this issue.

In response to this:

Closing it is
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

10 participants