Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide ability on when to actually rotate expired nodes #903

Closed
mikesir87 opened this issue Dec 3, 2021 · 6 comments
Closed

Provide ability on when to actually rotate expired nodes #903

mikesir87 opened this issue Dec 3, 2021 · 6 comments
Labels
feature New feature or request termination Issues related to node termination

Comments

@mikesir87
Copy link
Contributor

Tell us about your request
I'd love to have the ability on a Provisioner to configure when expiration/rotation of nodes should actually occur. The idea would be to have the the ability on when to mark a node as expired versus when to actually rotate the nodes (default to once expired).

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Many of the applications I support are not able to run in HA modes and require sticky sessions. But, most of those systems currently restart once a day, early in the morning, when they have very little traffic. While the app team loves the idea of automatically rotating nodes to ensure their nodes are running the latest AMIs, they worry the expiration might happen during a bad time and impact users.

Are you currently working around this issue?
Currently, we're planning to not use expiration on those nodes but then use our own job that runs during the maintenance window and deletes nodes, effectively managing our own expiration time.

Additional context
Not that I can think of currently.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@mikesir87 mikesir87 added the feature New feature or request label Dec 3, 2021
@ellistarn
Copy link
Contributor

Hey @mikesir87 thanks for writing this up. I'm definitely interested in supporting some sort of "maintenance window" for scaling down. We've talked in the team about the idea of a NodeDisruptionBudget that would work similarly to a PDB, but we could add additional fields like maintenance windows.

For your use case, there may be an easy alternative. Karpenter respects Pod Disruption Budgets as well as GraceTerminationSeconds. When node termination is triggered, the node is cordoned, and the pods evicted. PDBs prevent eviction, and Karpenter will never force terminate a pod, instead delegating to GraceTerminationSeconds. Do you think these existing mechanisms are suitable for your use case?

I'd be happy to discuss either of these ideas further, either here or in the https://github.com/aws/karpenter/blob/main/WORKING_GROUP.md

@mikesir87
Copy link
Contributor Author

@ellistarn - Good to know about those options. But, I don't think they'll work for this use case, as we may not always know what value the grace period should have. Imagine the scenario in which a node expires at noon, but we want to rotate it at 3am. But, a scale-up event occurs that adds another node that will cause a node to expire at 4pm. Recognizing that there are a few sliding windows time-wise, anything second-based will start to drift over time and doesn't ensure the nodes will rotate in the correct window (without something else monitoring the values and adjusting them).

And thanks for pointing me to the working group. I'll have to swing by and say hello to the team! 😄

@ellistarn
Copy link
Contributor

Yeah this makes sense to me. What I'm hearing is that while it would be nice to specify everything at the pod level, it's important for the ops team to be able to protect dev teams with broadly applied policy. Thoughts on the NodeDisruptionBudget approach?

@mikesir87
Copy link
Contributor Author

Yeah... if we (as a platform/ops team) are going to automatically expire and rotate nodes, we want to make sure it doesn't occur during a time that affects our customers.

I'll have to see what a NodeDisruptionBudget might look like, but the hypothetical sounds reasonable so far.

@akestner akestner added the termination Issues related to node termination label Dec 6, 2021
@ellistarn
Copy link
Contributor

Closing in favor of #1738

@njtran
Copy link
Contributor

njtran commented Dec 14, 2023

solved with kubernetes-sigs/karpenter#849

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request termination Issues related to node termination
Projects
None yet
Development

No branches or pull requests

4 participants