Refactored expiration/utilization controller into node controller #594

ellistarn · 2021-08-02T22:26:55Z

Issue, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-08-02T22:27:01Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 6a94249

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/610ad8984b1cdb000711227c

njtran

nice work. some comments on your comments

pkg/controllers/node/controller.go

njtran · 2021-08-02T22:37:55Z

pkg/controllers/node/controller.go

+				}
+				return requests
+			}),
+		).
 		WithOptions(controller.Options{MaxConcurrentReconciles: 10}).


I realize this was included in a prior PR, but why did we pick 10 here? Are there any detriments to bumping this number up?

We could consider setting this to kubeclient qps (default 20). I found that 10 was reasonably high parallelism without exhausting default qps. Perhaps we can evaluate this number as we do more scale testing.

pkg/controllers/node/expiration.go

JacobGabrielson · 2021-08-02T23:58:22Z

pkg/controllers/node/suite_test.go

+			node = ExpectNodeExists(env.Client, node.Name)
+			Expect(node.DeletionTimestamp.IsZero()).To(BeTrue())
+		})
+		It("should terminate nodes after expiry", func() {


Do we have the ability to simulate time going by in these tests? I worry a bit that there might be some code paths not quite exercised here ... compared to having a TTL set (to a non-zero value like 30), prove that (say) 15 seconds in the node is not marked for termination, then (say) 30+ seconds in it actually is so marked?

These tests are just copied over since it's a refactor, but I'll look into mocking time for these tests.

pkg/controllers/node/controller.go

pkg/controllers/node/emptiness.go

njtran · 2021-08-03T22:17:24Z

pkg/controllers/node/controller.go

 	}{
 		c.readiness,
+		c.liveness,
+		c.expiration,
+		c.emptiness,


If a node doesn't have a finalizer but will be expired for emptiness/expiration, we could choose to delete this node before we add the finalizer since the expiration and emptiness controllers send a delete request before we patch the node.

In addition, since we're running these in parallel, is it possible that the order in which we execute these controllers change?

Semantically, I want the order of these to not matter. I know that in theory, there are some edge cases (i.e, node comes online, instantly ready, we can't reconcile, and emptyttl is 0), but I think they're contrived enough that I'd much rather define an invariant where the order of these reconcilers doesn't matter.

I agree that these cases probably won't happen naturally. I also think our logging is in a good enough state that we would be able to debug these cases if they do happen naturally. From then, I'd love to reconsider this approach if necessary.

pkg/controllers/node/emptiness.go

pkg/controllers/node/expiration.go

njtran · 2021-08-03T22:26:42Z

pkg/controllers/node/liveness.go

+
+const LivenessTimeout = 5 * time.Minute
+
+// Liveness is a subreconciler that deletes nodes if its determined to be unrecoverable


I think this comment might be a little misleading. From reading this, I would assume that a node that has joined, but then has become in an unrecoverable state would then be subject to the workings of this controller.

I think in theory, that's what this controller should do. I think we do want to implement something like "auto repair" in here.

I agree. Yet, while we don't have that functionality in place, I think the keeping the comment in line with the actual state of the code makes more sense to me, since we don't know if we want to or will implement "auto-repair"

That sounds good to me. I'd like to keep the comments roughly the same irrespective of the content.

pkg/utils/functional/functional.go

pkg/controllers/node/expiration.go

njtran · 2021-08-04T18:09:19Z

pkg/controllers/node/emptiness.go

+	if !empty {
+		if _, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]; ok {
+			delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)
+			logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)
+		}
+		return reconcile.Result{}, nil
+	}
+	// 3. Set TTL if not set
+	n.Annotations = functional.UnionStringMaps(n.Annotations)
+	ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second
+	emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]


WDYT?

Suggested change

if !empty {

if _, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]; ok {

delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)

logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)

}

return reconcile.Result{}, nil

}

// 3. Set TTL if not set

n.Annotations = functional.UnionStringMaps(n.Annotations)

ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second

emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]

emptinessTimestamp, ok := n.Annotations[v1alpha3.EmptinessTimestampAnnotationKey]

if !empty {

if ok {

delete(n.Annotations, v1alpha3.EmptinessTimestampAnnotationKey)

logging.FromContext(ctx).Infof("Removed emptiness TTL from node %s", n.Name)

}

return reconcile.Result{}, nil

}

// 3. Set TTL if not set

n.Annotations = functional.UnionStringMaps(n.Annotations)

ttl := time.Duration(ptr.Int64Value(provisioner.Spec.TTLSecondsAfterEmpty)) * time.Second

njtran

LGTM! Super nice work. Excited to see this work.

ellistarn force-pushed the nodecontroller branch from df38956 to ad7f772 Compare August 2, 2021 22:27

njtran reviewed Aug 2, 2021

View reviewed changes

JacobGabrielson reviewed Aug 2, 2021

View reviewed changes

ellistarn force-pushed the nodecontroller branch from ad7f772 to 641dad0 Compare August 3, 2021 19:43

ellistarn changed the title ~~Refactored expiration controller into node controller~~ Refactored expiration/utilization controller into node controller Aug 3, 2021

ellistarn force-pushed the nodecontroller branch 5 times, most recently from 1eccfd9 to bfb2e60 Compare August 3, 2021 21:58

njtran reviewed Aug 3, 2021

View reviewed changes

Refactored expiration controller into node controller

10ae4ef

ellistarn force-pushed the nodecontroller branch from bfb2e60 to 10ae4ef Compare August 3, 2021 23:16

PR comments

d5336ad

ellistarn force-pushed the nodecontroller branch from 6a86f63 to d5336ad Compare August 4, 2021 00:54

njtran reviewed Aug 4, 2021

View reviewed changes

pkg/controllers/node/expiration.go Outdated Show resolved Hide resolved

njtran reviewed Aug 4, 2021

View reviewed changes

PR Comments

6a94249

njtran approved these changes Aug 4, 2021

View reviewed changes

ellistarn merged commit a53687e into aws:main Aug 4, 2021

ellistarn deleted the nodecontroller branch August 4, 2021 18:42

njtran mentioned this pull request Aug 23, 2021

Refactor Reallocation Controller #555

Closed

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

feat: CEL Validation for NodeClaim Requirement (aws#594)

ba353d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored expiration/utilization controller into node controller #594

Refactored expiration/utilization controller into node controller #594

ellistarn commented Aug 2, 2021

netlify bot commented Aug 2, 2021 •

edited

Loading

njtran left a comment

njtran Aug 2, 2021

ellistarn Aug 3, 2021

JacobGabrielson Aug 2, 2021

ellistarn Aug 3, 2021

ellistarn Aug 3, 2021

njtran Aug 3, 2021

ellistarn Aug 4, 2021 •

edited

Loading

njtran Aug 4, 2021

njtran Aug 3, 2021

ellistarn Aug 4, 2021

njtran Aug 4, 2021

ellistarn Aug 4, 2021

njtran Aug 4, 2021

njtran left a comment


		const LivenessTimeout = 5 * time.Minute

		// Liveness is a subreconciler that deletes nodes if its determined to be unrecoverable

Refactored expiration/utilization controller into node controller #594

Refactored expiration/utilization controller into node controller #594

Conversation

ellistarn commented Aug 2, 2021

netlify bot commented Aug 2, 2021 • edited Loading

njtran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njtran left a comment

Choose a reason for hiding this comment

netlify bot commented Aug 2, 2021 •

edited

Loading

ellistarn Aug 4, 2021 •

edited

Loading