Add affinity and anti-affinity support #1626

tzneal · 2022-04-05T18:10:41Z

1. Issue, if available:

Fixes #942 and #985

2. Description of changes:

Adds support for pod affinity and anti-affinity

3. How was this change tested?

Unit testing and on my local EKS cluster.

4. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2022-04-05T18:11:00Z

✅ Deploy Preview for karpenter-docs-prod canceled.

Name	Link
🔨 Latest commit	`08ac9ec`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/6255f02a3225d40008805590

pkg/controllers/provisioning/scheduling/topology.go

pkg/apis/provisioning/v1alpha5/requirements.go

pkg/controllers/provisioning/scheduling/topologynodefilter.go

ellistarn · 2022-04-08T15:46:39Z

pkg/controllers/provisioning/scheduling/topologynodefilter.go

+// included for topology counting purposes. This is only used with topology spread constraints as affinities/anti-affinities
+// always count across all nodes. A nil or zero-value TopologyNodeFilter behaves well and the filter returns true for
+// all nodes.
+type TopologyNodeFilter []v1alpha5.Requirements


Now that this type has been simplified, it occurs to me that it might be more straightforward to fold this into topologygroup e.g.

type struct TopologyGroup { ... nodeFilter []v1alpha5.Requirements }

. It should be ~40 lines of code.

I think it's less clear if it's merged in. Now, it has this discrete functionality of being a filter. That fact that it's internally represented as a type definition for a list of requirements is an implementation detail.

If it's merged into topology group, it's no longer a filter and instead is just a list of requirements that may or may not get matched against a node or other set of requirements depending on the topology type. The methods can't hang off the list of requirements, unless they remain a type in which case it's just concatenating two source files. The reader can't treat that as a black box "filter" concept as easily.

njtran · 2022-04-08T16:11:08Z

pkg/controllers/provisioning/scheduling/preferences.go

+		p.removeRequiredNodeAffinityTerm,
+		p.removePreferredPodAffinityTerm,
+		p.removePreferredPodAntiAffinityTerm,
+		p.removePreferredNodeAffinityTerm,
+		p.removeTopologySpreadScheduleAnyway,
+		p.toleratePreferNoScheduleTaints,


Can the order in which these are done change the end result? Is this the suggested order by Kubernetes?

Yes, it can definitely change the result. There isn't an order specified by K8s, they perform scheduling in a different way from what I can tell (find all compatible nodes, sort by score). We schedule on the first compatible node treating all preferences as required, and removing one preference at a time until it either schedules or there are no more to remove and it fails.

pkg/controllers/provisioning/scheduling/preferences.go

pkg/controllers/provisioning/scheduling/queue.go

njtran · 2022-04-08T16:36:09Z

pkg/controllers/provisioning/scheduling/scheduler.go

-	return func(i, j int) bool {
-		return instanceTypes[i].Price() < instanceTypes[j].Price()
+func (s *Scheduler) scheduleExisting(pod *v1.Pod, nodes []*Node) *Node {
+	// Try nodes in ascending order of number of pods to more evenly distribute nodes, 100ms at 2000 nodes.


Are there any other dimensions that could affect this speed? Could imagine that as constraints tighten and requirements increase, this will naturally take longer to solve the constraints. Additionally, does this number depend on the hardware it runs on? A number like this could be misleading if so.

It may vary depending on CPU, but that should be it. We're just sorting s list of pointers by the pod count which is a constant operation to retrieve.

pkg/controllers/provisioning/scheduling/scheduler.go

njtran · 2022-04-08T16:41:06Z

pkg/controllers/provisioning/scheduling/scheduler.go

+				if relaxed {
+					// The pod has changed, so topology needs to be recomputed
+					if err := topology.Update(ctx, pod); err != nil {
+						logging.FromContext(ctx).With("pod", client.ObjectKeyFromObject(pod)).Errorf("updating topology, %s", err)


How does this get surfaced in the logs? I can imagine this would get REALLY noisy.

Only logs on topology update error, IIRC that error only occurs when a kube-api server call fails which shouldn't occur often.

Keep in mind. This call is cached. It should only fail due to a code bug.

njtran · 2022-04-08T16:45:43Z

pkg/controllers/provisioning/scheduling/scheduling_benchmark_test.go

-	pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelHostname)...)
-	pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelTopologyZone)...)
+	// We intentionally don't do anti-affinity by zone as that creates tons of unschedulable pods.
+	//pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelTopologyZone)...)


Should we just remove this line and keep the comment since we aren't doing it?

It's a benchmark test and I left it in for now as I keep adding/removing that line to perform some benchmarking just to get the numbers.

njtran · 2022-04-08T16:53:50Z

pkg/controllers/provisioning/scheduling/topologygroup.go

 }

-// TopologyGroup is a set of pods that share a topology spread constraint
+// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_ke


I like the surprise SQL appearance :)

Suggested change

// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_ke

// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_key))

ellistarn

Let's get one more approver, but I'm good to go.

BeardyBear · 2022-04-12T12:30:50Z

Hi, how long would it take for this to be released as part of new Karpenter version?
Is there a way I could run it now or should I just wait for the release?
We really need this feature.

Thank you :)

- implement affinity/anti-affinity - rework topology spread support

Node affinity more than likely prevents scheduling on a provisioner, so remove it first. This prevents the current selection process from removing several other preferred terms before removing the one that is preventing selection.

…affinities

…spread

For anti-affinities we need to block out every possible domain.

Previously topology spread didn't work with match expressions, we had no tests to cover this case. The operators have different string values so just casting types isn't correct.

In this scenario, we can only schedule to the min domain. We also rework the requirement collapsing code to cause the collapsing to occur during topology domain selection

We only count nodes that match the pod node required affinities.

We were carrying around tons of duplicate requirements. The requirement Add() function had to process these every time it added. When this occurred the set based requirements would narrow down, but the node selector version would just keep appending possibly huge requirements to the list.

tzneal · 2022-04-13T12:31:14Z

Hi, how long would it take for this to be released as part of new Karpenter version? Is there a way I could run it now or should I just wait for the release? We really need this feature.

Thank you :)

We plan to make a snapshot image available soon so it can be tested out before the next release.

joebowbeer · 2022-04-14T06:34:48Z

Also closes #985 ?

tzneal requested a review from a team as a code owner April 5, 2022 18:10

tzneal changed the title ~~Add affinity support~~ Add affinity and anti-affinity support Apr 5, 2022

tzneal force-pushed the add-affinity-support branch 2 times, most recently from f102009 to 1290d3e Compare April 6, 2022 02:01

tzneal commented Apr 6, 2022

View reviewed changes

pkg/controllers/provisioning/scheduling/topology.go Show resolved Hide resolved

ellistarn reviewed Apr 6, 2022

View reviewed changes

pkg/controllers/provisioning/scheduling/topology.go Show resolved Hide resolved

tzneal force-pushed the add-affinity-support branch 3 times, most recently from 2a9ce99 to ce87b8c Compare April 7, 2022 02:27