Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a bug with liveness logic #614

Merged
merged 3 commits into from
Aug 19, 2021
Merged

Fixed a bug with liveness logic #614

merged 3 commits into from
Aug 19, 2021

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Aug 13, 2021

Issue, if available:
#613, #601

Description of changes:
Fixes an issue where nodes don't terminate if they can't connect to the API Server, resulting in runaway scaling.

The liveness controller will terminate nodes that don't ready up. This is critical, because pods will only survive on not ready nodes for 5 minutes by default (tolerations). The controller must differentiate between a node that never connected (cleanup) vs a node that connected but lost connection (e.g. network partition, don't cleanup).

The previous code assumed that the timestamp would be nil. There is a note on this here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1012. The nodelifecycle controller writes status here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1131.

This change should catch both the case where the node controller never writes anything, as well as the case where it writes NodeStatusNeverUpdated which is the expected behavior.

We will continue to only take any action after the liveness timeout (5 mins).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Aug 13, 2021

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: 186d75d

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/611edd896dca65000825f13e

😎 Browse the preview: https://deploy-preview-614--karpenter-docs-prod.netlify.app

@ellistarn ellistarn changed the title Fixed a bug with liveness logic where nodes would not terminate due t… Fixed a bug with liveness logic Aug 13, 2021

ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner))

// Expect n not deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make n -> node here

Expect(n.DeletionTimestamp.IsZero()).To(BeTrue())

// Delete pod and do another reconcile
Expect(env.Client.Delete(ctx, pod)).To(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn’t need to make and delete a pod for this test to work, it’s sufficient to have it be ready unknown, which won’t trigger underutilization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I ported these tests from your liveness logic tests, which at some point I think relied on pods existing.

@JacobGabrielson JacobGabrielson self-requested a review August 16, 2021 01:48
ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner))

// Expect node not deleted
// Expect node not be deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not to be deleted" (for what it's worth, that's pretty much what the code says on line 220

@ellistarn ellistarn merged commit cb274e9 into aws:main Aug 19, 2021
@ellistarn ellistarn deleted the notready branch August 19, 2021 23:25
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants