Add support for defining/using AWS ASG Lifecycle Hooks #8708

andersosthus · 2020-03-09T12:54:24Z

We would like support for defining AWS ASG Lifecycle hooks on an InstanceGroup.

Use case:
We use AWS ASG Lifecycle hooks with a custom script that watches for AWS Spot terminations, and if detected runs a drain on the node and then sends a COMPLETED signal to the Lifecycle hook, that then allows the instance to be terminated.
The fact that we have this Lifecycle hooks kinda "breaks" kops rolling-update cluster, since the instance won't be terminated until either a COMPLETED signal is sent or it reaches the timeout value.

Our proposed solution would look something like this:
Add an awsAsgLifecycle property to InstanceGroup where one can set the Lifecycle properties (name, transition, default result, heartbeat timeout, notification arn, role arn).

When kops is doing a node drain, if the InstanceGroup has the awsAsgLifecycle set, it should send a COMPLETED signal when the drain is done.

If this sounds ok, I can do the implementation (though I probably need some guidance since I'm not that familiar with the kops codebase)

The text was updated successfully, but these errors were encountered:

paalkr · 2020-03-09T13:14:18Z

Yes, please!

mikesplain · 2020-03-09T14:12:50Z

Out of curiosity, is there a reason you chose to go this way and not use something like https://github.com/pusher/k8s-spot-termination-handler?

paalkr · 2020-03-09T14:22:46Z

I'm actually working with @andersosthus on this one, so I will make an answer. Yes, there is a good reason why we don't use the component you linked to. It does only support spot termination and we need a solution that also works on AGS termination. We do NOT use the common Cluster Autoscaler because we have memory bound and spiky workloads. So we export utilization metrics from Prometheus to CloudWatch, and the we use a combination of TargetTracking and StepScaling to modify the Autoscaling Groups desired count.

Also the proposed solution from @andersosthus will allow to attach custom ARNs, like a Lambda function to do arbitrary activities when machines are added or removed from an Autoscaling group.

fejta-bot · 2020-06-07T15:11:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

olemarkus · 2020-06-08T07:28:35Z

This sounds good to me

rifelpet · 2020-06-18T22:52:55Z

/remove-lifecycle stale
/kind feature

rdrgmnzs · 2020-09-02T18:50:53Z

@andersosthus are you still interested in implementing this? If not I'm going to take a shot at it.

fejta-bot · 2020-12-01T19:16:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

olemarkus · 2020-12-01T19:21:33Z

/remove-lifecycle stale

Anyone had any luck with this issue?

This is to some extent related to #7119

paalkr · 2020-12-02T10:32:21Z

Currently we handled this using a custom systemd unit in the instance group. But the use case will not longer apply when we get this NTH implementation in place.

fejta-bot · 2021-03-02T10:43:18Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

bigman3 · 2021-03-15T09:36:44Z

Currently we handled this using a custom systemd unit in the instance group. But the use case will not longer apply when we get this NTH implementation in place.

@paalkr can you share/guide how the systemd unit should like? I am trying to replicate, with increased timeouts (TimeoutStopSec, TimeoutSec) but hitting 2 minutes when it stops not gracefully

bigman3 · 2021-03-31T08:45:02Z

/remove-lifecycle stale

fejta-bot · 2021-06-29T09:06:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

olemarkus · 2021-06-29T17:10:43Z

With the support for NTH in SQS mode, are the use cases covered?
/remove-lifecycle stale

k8s-triage-robot · 2021-09-27T17:51:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

olemarkus · 2021-09-27T17:54:04Z

Closing it is
/close

k8s-ci-robot · 2021-09-27T17:54:20Z

@olemarkus: Closing this issue.

In response to this:

Closing it is
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 7, 2020

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 18, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021

k8s-ci-robot closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for defining/using AWS ASG Lifecycle Hooks #8708

Add support for defining/using AWS ASG Lifecycle Hooks #8708

andersosthus commented Mar 9, 2020

paalkr commented Mar 9, 2020

mikesplain commented Mar 9, 2020

paalkr commented Mar 9, 2020

fejta-bot commented Jun 7, 2020

olemarkus commented Jun 8, 2020

rifelpet commented Jun 18, 2020

rdrgmnzs commented Sep 2, 2020

fejta-bot commented Dec 1, 2020

olemarkus commented Dec 1, 2020

paalkr commented Dec 2, 2020

fejta-bot commented Mar 2, 2021

bigman3 commented Mar 15, 2021

bigman3 commented Mar 31, 2021

fejta-bot commented Jun 29, 2021

olemarkus commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

olemarkus commented Sep 27, 2021

k8s-ci-robot commented Sep 27, 2021

Add support for defining/using AWS ASG Lifecycle Hooks #8708

Add support for defining/using AWS ASG Lifecycle Hooks #8708

Comments

andersosthus commented Mar 9, 2020

paalkr commented Mar 9, 2020

mikesplain commented Mar 9, 2020

paalkr commented Mar 9, 2020

fejta-bot commented Jun 7, 2020

olemarkus commented Jun 8, 2020

rifelpet commented Jun 18, 2020

rdrgmnzs commented Sep 2, 2020

fejta-bot commented Dec 1, 2020

olemarkus commented Dec 1, 2020

paalkr commented Dec 2, 2020

fejta-bot commented Mar 2, 2021

bigman3 commented Mar 15, 2021

bigman3 commented Mar 31, 2021

fejta-bot commented Jun 29, 2021

olemarkus commented Jun 29, 2021

k8s-triage-robot commented Sep 27, 2021

olemarkus commented Sep 27, 2021

k8s-ci-robot commented Sep 27, 2021