-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a bug with liveness logic #614
Conversation
✔️ Deploy Preview for karpenter-docs-prod ready! 🔨 Explore the source changes: 186d75d 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/611edd896dca65000825f13e 😎 Browse the preview: https://deploy-preview-614--karpenter-docs-prod.netlify.app |
…o the node controller writing a status
pkg/controllers/node/suite_test.go
Outdated
|
||
ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner)) | ||
|
||
// Expect n not deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make n -> node here
pkg/controllers/node/suite_test.go
Outdated
Expect(n.DeletionTimestamp.IsZero()).To(BeTrue()) | ||
|
||
// Delete pod and do another reconcile | ||
Expect(env.Client.Delete(ctx, pod)).To(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn’t need to make and delete a pod for this test to work, it’s sufficient to have it be ready unknown, which won’t trigger underutilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I ported these tests from your liveness logic tests, which at some point I think relied on pods existing.
pkg/controllers/node/suite_test.go
Outdated
ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner)) | ||
|
||
// Expect node not deleted | ||
// Expect node not be deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not to be deleted" (for what it's worth, that's pretty much what the code says on line 220
Co-authored-by: Jonathan Innis <[email protected]>
Issue, if available:
#613, #601
Description of changes:
Fixes an issue where nodes don't terminate if they can't connect to the API Server, resulting in runaway scaling.
The liveness controller will terminate nodes that don't ready up. This is critical, because pods will only survive on not ready nodes for 5 minutes by default (tolerations). The controller must differentiate between a node that never connected (cleanup) vs a node that connected but lost connection (e.g. network partition, don't cleanup).
The previous code assumed that the timestamp would be nil. There is a note on this here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1012. The nodelifecycle controller writes status here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1131.
This change should catch both the case where the node controller never writes anything, as well as the case where it writes
NodeStatusNeverUpdated
which is the expected behavior.We will continue to only take any action after the liveness timeout (5 mins).
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.