Fixed a bug with liveness logic #614

ellistarn · 2021-08-13T02:05:33Z

Issue, if available:
#613, #601

Description of changes:
Fixes an issue where nodes don't terminate if they can't connect to the API Server, resulting in runaway scaling.

The liveness controller will terminate nodes that don't ready up. This is critical, because pods will only survive on not ready nodes for 5 minutes by default (tolerations). The controller must differentiate between a node that never connected (cleanup) vs a node that connected but lost connection (e.g. network partition, don't cleanup).

The previous code assumed that the timestamp would be nil. There is a note on this here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1012. The nodelifecycle controller writes status here: https://github.com/kubernetes/kubernetes/blob/release-1.17/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L1131.

This change should catch both the case where the node controller never writes anything, as well as the case where it writes NodeStatusNeverUpdated which is the expected behavior.

We will continue to only take any action after the liveness timeout (5 mins).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-08-13T02:05:38Z

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: 186d75d

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/611edd896dca65000825f13e

😎 Browse the preview: https://deploy-preview-614--karpenter-docs-prod.netlify.app

…o the node controller writing a status

njtran · 2021-08-13T02:25:13Z

pkg/controllers/node/suite_test.go

+
+			ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner))
+
+			// Expect n not deleted


Can you make n -> node here

njtran · 2021-08-13T02:26:07Z

pkg/controllers/node/suite_test.go

+			Expect(n.DeletionTimestamp.IsZero()).To(BeTrue())
+
+			// Delete pod and do another reconcile
+			Expect(env.Client.Delete(ctx, pod)).To(Succeed())


We shouldn’t need to make and delete a pod for this test to work, it’s sufficient to have it be ready unknown, which won’t trigger underutilization.

FWIW, I ported these tests from your liveness logic tests, which at some point I think relied on pods existing.

pkg/controllers/node/liveness.go

JacobGabrielson · 2021-08-16T02:00:58Z

pkg/controllers/node/suite_test.go

 			ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner))

-			// Expect node not deleted
+			// Expect node not be deleted


"not to be deleted" (for what it's worth, that's pretty much what the code says on line 220

Co-authored-by: Jonathan Innis <[email protected]>

ellistarn changed the title ~~Fixed a bug with liveness logic where nodes would not terminate due t…~~ Fixed a bug with liveness logic Aug 13, 2021

ellistarn force-pushed the notready branch from 2195bc1 to 8c89d7b Compare August 13, 2021 02:12

Fixed a bug with liveness logic where nodes would not terminate due t…

57e78e7

…o the node controller writing a status

ellistarn force-pushed the notready branch from 8c89d7b to 57e78e7 Compare August 13, 2021 02:13

ellistarn mentioned this pull request Aug 13, 2021

Spot as Default Capacity Type #613

Closed

njtran reviewed Aug 13, 2021

View reviewed changes

PR Comments

1dd2e15

JacobGabrielson self-requested a review August 16, 2021 01:48

JacobGabrielson reviewed Aug 16, 2021

View reviewed changes

PR comments

186d75d

JacobGabrielson approved these changes Aug 19, 2021

View reviewed changes

ellistarn merged commit cb274e9 into aws:main Aug 19, 2021

ellistarn deleted the notready branch August 19, 2021 23:25

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

chore: correctly parse boolean CLI flags (aws#614)

604d23c

Co-authored-by: Jonathan Innis <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a bug with liveness logic #614

Fixed a bug with liveness logic #614

ellistarn commented Aug 13, 2021 •

edited

Loading

netlify bot commented Aug 13, 2021 •

edited

Loading

njtran Aug 13, 2021

njtran Aug 13, 2021

ellistarn Aug 13, 2021

JacobGabrielson Aug 16, 2021


		ExpectReconcileSucceeded(ctx, controller, client.ObjectKeyFromObject(provisioner))

		// Expect n not deleted

Fixed a bug with liveness logic #614

Fixed a bug with liveness logic #614

Conversation

ellistarn commented Aug 13, 2021 • edited Loading

netlify bot commented Aug 13, 2021 • edited Loading

njtran Aug 13, 2021

Choose a reason for hiding this comment

njtran Aug 13, 2021

Choose a reason for hiding this comment

ellistarn Aug 13, 2021

Choose a reason for hiding this comment

JacobGabrielson Aug 16, 2021

Choose a reason for hiding this comment

ellistarn commented Aug 13, 2021 •

edited

Loading

netlify bot commented Aug 13, 2021 •

edited

Loading