-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add affinity and anti-affinity support #1626
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
f102009
to
1290d3e
Compare
2a9ce99
to
ce87b8c
Compare
fa81d41
to
ccba867
Compare
dcf64d1
to
3be1b37
Compare
// included for topology counting purposes. This is only used with topology spread constraints as affinities/anti-affinities | ||
// always count across all nodes. A nil or zero-value TopologyNodeFilter behaves well and the filter returns true for | ||
// all nodes. | ||
type TopologyNodeFilter []v1alpha5.Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this type has been simplified, it occurs to me that it might be more straightforward to fold this into topologygroup e.g.
type struct TopologyGroup {
...
nodeFilter []v1alpha5.Requirements
}
. It should be ~40 lines of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's less clear if it's merged in. Now, it has this discrete functionality of being a filter. That fact that it's internally represented as a type definition for a list of requirements is an implementation detail.
If it's merged into topology group, it's no longer a filter and instead is just a list of requirements that may or may not get matched against a node or other set of requirements depending on the topology type. The methods can't hang off the list of requirements, unless they remain a type in which case it's just concatenating two source files. The reader can't treat that as a black box "filter" concept as easily.
p.removeRequiredNodeAffinityTerm, | ||
p.removePreferredPodAffinityTerm, | ||
p.removePreferredPodAntiAffinityTerm, | ||
p.removePreferredNodeAffinityTerm, | ||
p.removeTopologySpreadScheduleAnyway, | ||
p.toleratePreferNoScheduleTaints, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the order in which these are done change the end result? Is this the suggested order by Kubernetes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can definitely change the result. There isn't an order specified by K8s, they perform scheduling in a different way from what I can tell (find all compatible nodes, sort by score). We schedule on the first compatible node treating all preferences as required, and removing one preference at a time until it either schedules or there are no more to remove and it fails.
return func(i, j int) bool { | ||
return instanceTypes[i].Price() < instanceTypes[j].Price() | ||
func (s *Scheduler) scheduleExisting(pod *v1.Pod, nodes []*Node) *Node { | ||
// Try nodes in ascending order of number of pods to more evenly distribute nodes, 100ms at 2000 nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other dimensions that could affect this speed? Could imagine that as constraints tighten and requirements increase, this will naturally take longer to solve the constraints. Additionally, does this number depend on the hardware it runs on? A number like this could be misleading if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may vary depending on CPU, but that should be it. We're just sorting s list of pointers by the pod count which is a constant operation to retrieve.
if relaxed { | ||
// The pod has changed, so topology needs to be recomputed | ||
if err := topology.Update(ctx, pod); err != nil { | ||
logging.FromContext(ctx).With("pod", client.ObjectKeyFromObject(pod)).Errorf("updating topology, %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this get surfaced in the logs? I can imagine this would get REALLY noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only logs on topology update error, IIRC that error only occurs when a kube-api server call fails which shouldn't occur often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind. This call is cached. It should only fail due to a code bug.
pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelHostname)...) | ||
pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelTopologyZone)...) | ||
// We intentionally don't do anti-affinity by zone as that creates tons of unschedulable pods. | ||
//pods = append(pods, makePodAntiAffinityPods(count/7, v1.LabelTopologyZone)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just remove this line and keep the comment since we aren't doing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a benchmark test and I left it in for now as I keep adding/removing that line to perform some benchmarking just to get the numbers.
} | ||
|
||
// TopologyGroup is a set of pods that share a topology spread constraint | ||
// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_ke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the surprise SQL appearance :)
// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_ke | |
// TopologyGroup is used to track pod counts that match a selector by the topology domain (e.g. SELECT COUNT(*) FROM pods GROUP BY(topology_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get one more approver, but I'm good to go.
df2d93a
to
0036137
Compare
Hi, how long would it take for this to be released as part of new Karpenter version? Thank you :) |
0036137
to
68e9030
Compare
- implement affinity/anti-affinity - rework topology spread support
Node affinity more than likely prevents scheduling on a provisioner, so remove it first. This prevents the current selection process from removing several other preferred terms before removing the one that is preventing selection.
For anti-affinities we need to block out every possible domain.
Previously topology spread didn't work with match expressions, we had no tests to cover this case. The operators have different string values so just casting types isn't correct.
In this scenario, we can only schedule to the min domain. We also rework the requirement collapsing code to cause the collapsing to occur during topology domain selection
We only count nodes that match the pod node required affinities.
We were carrying around tons of duplicate requirements. The requirement Add() function had to process these every time it added. When this occurred the set based requirements would narrow down, but the node selector version would just keep appending possibly huge requirements to the list.
68e9030
to
08ac9ec
Compare
We plan to make a snapshot image available soon so it can be tested out before the next release. |
Also closes #985 ? |
1. Issue, if available:
Fixes #942 and #985
2. Description of changes:
Adds support for pod affinity and anti-affinity
3. How was this change tested?
Unit testing and on my local EKS cluster.
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.