Implemented topology spread constraints for zone and hostname #619

ellistarn · 2021-08-15T21:26:54Z

Issue, if available:
#481

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-08-15T21:26:59Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 43216b8

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/612d198a40f8d60008313d63

pkg/apis/provisioning/v1alpha3/provisioner_validation_test.go

pkg/controllers/allocation/filter.go

ellistarn · 2021-08-27T00:17:24Z

pkg/controllers/allocation/scheduling/scheduler.go

+// getSchedules separates pods into a set of schedules. All pods in each group
+// contain compatible scheduling constarints and can be deployed together on the
+// same node, or multiple similar nodes if the pods exceed one node's capacity.
+func (s *Scheduler) getSchedules(ctx context.Context, provisioner *v1alpha3.Provisioner, pods []*v1.Pod) ([]*Schedule, error) {


Keep in mind, this code is untouched and a direct port of constraints.go (now removed)

JacobGabrielson · 2021-08-27T15:25:31Z

pkg/controllers/allocation/scheduling/topology.go

+}
+
+// computeZonalTopology for the topology group. Zones include viable zones for
+// the {cloudprovider, provisioner, pod }. If these zones change over time,


nit: extra space }?

JacobGabrielson · 2021-08-27T15:26:57Z

pkg/controllers/allocation/scheduling/topology.go

+		}
+		domain, ok := node.Labels[topologyGroup.Constraint.TopologyKey]
+		if !ok {
+			continue // Don't include pods if node doesn't contain domain


nit: Could you change the comment to explain why not including the domain matters? as it is now, the comment basically just restates what the code already says.

Added a link to all the details

Just checking - I don't see the update but maybe I'm missing it?

JacobGabrielson · 2021-08-27T15:28:07Z

pkg/controllers/allocation/scheduling/topologygroup.go

+type TopologyGroup struct {
+	Constraint v1.TopologySpreadConstraint
+	Pods       []*v1.Pod
+	// spread is an internal field used to track current spread


but what is spread in the first place, conceptually?

This is core to the feature: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#example-one-topologyspreadconstraint

Ok, maybe just clarify you mean the upstream sense of spread?

JacobGabrielson · 2021-08-27T15:28:50Z

pkg/controllers/allocation/scheduling/topologygroup.go

+	}
+}
+
+// Increment increments a domain if known


doesn't it increment the "spread" of a domain?

JacobGabrielson · 2021-08-27T15:29:32Z

pkg/controllers/allocation/scheduling/topologygroup.go

+	}
+}
+
+// Spread chooses a domain that minimizes skew and increments its count


Maybe call this NextSpread() or something? Right now the name of the function sounds kinda stateless.

Haha I toyed with the idea of calling it Next() or Spread(). What do you think about NextDomain()

pkg/controllers/termination/controller.go

JacobGabrielson · 2021-08-27T15:49:30Z

Overall I didn't see any problems, just a few comments about documentation/comments.

bwagner5 · 2021-08-27T16:04:33Z

pkg/controllers/allocation/controller.go

@@ -46,15 +47,15 @@ import (

 const (
 	maxBatchWindow   = 10 * time.Second
-	batchIdleTimeout = 2 * time.Second
+	batchIdleTimeout = 1 * time.Second


bwagner5 · 2021-08-27T16:06:19Z

pkg/controllers/allocation/scheduling/scheduler.go

+
+	// 2. Separate pods into schedules of compatible scheduling constraints.
+	schedules, err := s.getSchedules(ctx, provisioner, pods)
+	if err != nil {


If this errors for some reason, then the injected labels do not get removed. It might be fine if we're always retrying a reconciliation so we rebuild everything. If something was parallelized while solving, the pods would be getting additional labels added and removed. This just seems a bit leaky. Do you think copying the pods would be too much overhead? Or maybe defering the label deletion?

Agreed. It felt a little bit dirty doing this in memory mutation thing in general, but it really really simplified the code. The extra annoying bit is that I can leave the injected zone because it'll end up being the same string, but I can't know the hostname (since I can't specify ec2 instance ids), so I have to rip out the hostname after scheduling.

The key detail is that in the injected NodeSelector is never deleted from the pod in this reconcile loop, but it is rebuilt on the next reconcile loop. I'm wary of the complexity it adds to remove these, but it might be tenable if I do it in this function and don't allow it to leak anywhere else. What do you think?

I think it's worth while to remove in this function. I'm not entirely sure where a leak would cause issue, but if it does, it might be difficult to spot.

bwagner5

Had a small comment on the leakiness of the label injection to make the topology scheduling simpler, but overall, feel free to merge, looks great!

ellistarn force-pushed the nodeaffinity branch from 5587747 to ebf0483 Compare August 15, 2021 21:33

JacobGabrielson reviewed Aug 16, 2021

View reviewed changes

pkg/apis/provisioning/v1alpha3/provisioner_validation_test.go Show resolved Hide resolved

ellistarn force-pushed the nodeaffinity branch 3 times, most recently from ffec861 to cb564dd Compare August 25, 2021 21:37

ellistarn changed the title ~~[WIP][Experimental] Implemented topology spread constraints for zone and hostname~~ Implemented topology spread constraints for zone and hostname Aug 25, 2021

ellistarn force-pushed the nodeaffinity branch 2 times, most recently from cc85f59 to 6589255 Compare August 25, 2021 23:25

ellistarn commented Aug 26, 2021

View reviewed changes

pkg/controllers/allocation/filter.go Outdated Show resolved Hide resolved

ellistarn force-pushed the nodeaffinity branch 4 times, most recently from fe6f800 to 143a20f Compare August 27, 2021 00:19

ellistarn commented Aug 27, 2021

View reviewed changes

Implemented topology spread constraints for zone and hostname

25f783d

ellistarn force-pushed the nodeaffinity branch from 143a20f to 25f783d Compare August 27, 2021 00:21

JacobGabrielson reviewed Aug 27, 2021

View reviewed changes

bwagner5 reviewed Aug 27, 2021

View reviewed changes

PR comments

ebd4ad8

JacobGabrielson self-requested a review August 27, 2021 22:08

PR comments

959cf4c

bwagner5 approved these changes Aug 30, 2021

View reviewed changes

Merge branch 'main' into nodeaffinity

43216b8

ellistarn merged commit a3e2d8c into aws:main Aug 30, 2021

ellistarn deleted the nodeaffinity branch August 30, 2021 18:01

geoffcline mentioned this pull request Sep 7, 2021

document topology spread; #658

Closed

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

chore: change log level of initialization (aws#619)

4c2f7f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented topology spread constraints for zone and hostname #619

Implemented topology spread constraints for zone and hostname #619

ellistarn commented Aug 15, 2021 •

edited

Loading

netlify bot commented Aug 15, 2021 •

edited

Loading

ellistarn Aug 27, 2021

JacobGabrielson Aug 27, 2021

JacobGabrielson Aug 27, 2021

ellistarn Aug 27, 2021

JacobGabrielson Aug 27, 2021

JacobGabrielson Aug 27, 2021

ellistarn Aug 27, 2021

JacobGabrielson Aug 27, 2021

JacobGabrielson Aug 27, 2021

ellistarn Aug 27, 2021

JacobGabrielson Aug 27, 2021

ellistarn Aug 27, 2021

JacobGabrielson commented Aug 27, 2021

bwagner5 Aug 27, 2021

bwagner5 Aug 27, 2021

ellistarn Aug 27, 2021

bwagner5 Aug 27, 2021

bwagner5 left a comment

Implemented topology spread constraints for zone and hostname #619

Implemented topology spread constraints for zone and hostname #619

Conversation

ellistarn commented Aug 15, 2021 • edited Loading

netlify bot commented Aug 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacobGabrielson commented Aug 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwagner5 left a comment

Choose a reason for hiding this comment

ellistarn commented Aug 15, 2021 •

edited

Loading

netlify bot commented Aug 15, 2021 •

edited

Loading