Provide ability on when to actually rotate expired nodes #903

mikesir87 · 2021-12-03T03:43:17Z

Tell us about your request
I'd love to have the ability on a Provisioner to configure when expiration/rotation of nodes should actually occur. The idea would be to have the the ability on when to mark a node as expired versus when to actually rotate the nodes (default to once expired).

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Many of the applications I support are not able to run in HA modes and require sticky sessions. But, most of those systems currently restart once a day, early in the morning, when they have very little traffic. While the app team loves the idea of automatically rotating nodes to ensure their nodes are running the latest AMIs, they worry the expiration might happen during a bad time and impact users.

Are you currently working around this issue?
Currently, we're planning to not use expiration on those nodes but then use our own job that runs during the maintenance window and deletes nodes, effectively managing our own expiration time.

Additional context
Not that I can think of currently.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2021-12-03T03:49:09Z

Hey @mikesir87 thanks for writing this up. I'm definitely interested in supporting some sort of "maintenance window" for scaling down. We've talked in the team about the idea of a NodeDisruptionBudget that would work similarly to a PDB, but we could add additional fields like maintenance windows.

For your use case, there may be an easy alternative. Karpenter respects Pod Disruption Budgets as well as GraceTerminationSeconds. When node termination is triggered, the node is cordoned, and the pods evicted. PDBs prevent eviction, and Karpenter will never force terminate a pod, instead delegating to GraceTerminationSeconds. Do you think these existing mechanisms are suitable for your use case?

I'd be happy to discuss either of these ideas further, either here or in the https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md

mikesir87 · 2021-12-03T15:02:47Z

@ellistarn - Good to know about those options. But, I don't think they'll work for this use case, as we may not always know what value the grace period should have. Imagine the scenario in which a node expires at noon, but we want to rotate it at 3am. But, a scale-up event occurs that adds another node that will cause a node to expire at 4pm. Recognizing that there are a few sliding windows time-wise, anything second-based will start to drift over time and doesn't ensure the nodes will rotate in the correct window (without something else monitoring the values and adjusting them).

And thanks for pointing me to the working group. I'll have to swing by and say hello to the team! 😄

ellistarn · 2021-12-03T17:10:35Z

Yeah this makes sense to me. What I'm hearing is that while it would be nice to specify everything at the pod level, it's important for the ops team to be able to protect dev teams with broadly applied policy. Thoughts on the NodeDisruptionBudget approach?

mikesir87 · 2021-12-03T17:24:20Z

Yeah... if we (as a platform/ops team) are going to automatically expire and rotate nodes, we want to make sure it doesn't occur during a time that affects our customers.

I'll have to see what a NodeDisruptionBudget might look like, but the hypothetical sounds reasonable so far.

ellistarn · 2022-07-06T17:22:39Z

Closing in favor of #1738

njtran · 2023-12-14T20:53:28Z

solved with kubernetes-sigs/karpenter#849

mikesir87 added the feature New feature or request label Dec 3, 2021

akestner added the termination Issues related to node termination label Dec 6, 2021

tzneal mentioned this issue Apr 29, 2022

Mega Issue: Deprovisioning Controls #1738

Closed

18 tasks

ellistarn closed this as completed Jul 6, 2022

njtran mentioned this issue Dec 14, 2023

feat: Disruption Budgets kubernetes-sigs/karpenter#849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide ability on when to actually rotate expired nodes #903

Provide ability on when to actually rotate expired nodes #903

mikesir87 commented Dec 3, 2021

ellistarn commented Dec 3, 2021

mikesir87 commented Dec 3, 2021

ellistarn commented Dec 3, 2021

mikesir87 commented Dec 3, 2021

ellistarn commented Jul 6, 2022

njtran commented Dec 14, 2023

Provide ability on when to actually rotate expired nodes #903

Provide ability on when to actually rotate expired nodes #903

Comments

mikesir87 commented Dec 3, 2021

Community Note

ellistarn commented Dec 3, 2021

mikesir87 commented Dec 3, 2021

ellistarn commented Dec 3, 2021

mikesir87 commented Dec 3, 2021

ellistarn commented Jul 6, 2022

njtran commented Dec 14, 2023