-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace liveness and readiness with initialization node controller #1186
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: 382440e 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61e8a48a02c93b00075484ef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
} | ||
|
||
if !node.IsReady(n) { | ||
if age := injectabletime.Now().Sub(n.GetCreationTimestamp().Time); age < InitializationTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could shorten with:
if age := injectabletime.Since(n.GetCreationTimestamp().Time); age < InitializationTimeout {
if age := injectabletime.Now().Sub(n.GetCreationTimestamp().Time); age < InitializationTimeout { | ||
return reconcile.Result{RequeueAfter: InitializationTimeout - age}, nil | ||
} | ||
logging.FromContext(ctx).Infof("Triggering termination for node that failed to become ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to log the node name here, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is included in our logger. All reconcilers include resource name in their logger at the top level. It helps avoid redundant statements.
1. Issue, if available:
#1135
2. Description of changes:
Earlier we would only delete worker nodes iff the kubelet would have a status of
NodeStatusNeverUpdated
for more than 15 minutes. If this were the case, it meant the Kubelet never connected to the API Server at all.We now delete worker nodes iff the kubelet is in NotReady status for more than 15 minutes after startup. The mechanism we use to discover that a node has only been started is by looking at a taint we apply during node object creation. Once the taint has been removed, we declare the startup as complete and no longer re-evaluate the node for deletion. We will re-introduce something like a liveness controller in the future if necessary but we want to be very careful with that to maintain static stability of the cluster.
3. How was this change tested?
Case 1 - I removed the CNI policy from the Node Role. Nodes were stuck in NotReady and I can see them being terminated after 15 minutes.
Case 2 - I'm still looking to replicate the case, where a node does come up as healthy, but then goes into NotReady. I'm probably going to try a network partition to replicate this scenario but this is WIP.
4. Does this change impact docs?
I don't think this impact our docs but I'm happy to add a callout somewhere if needed.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.