podAntiAffinity #942

joebowbeer · 2021-12-08T06:59:14Z

Tell us about your request

Please implement podAntiAffinity.

I don't see anything about this on the roadmap. Is this planned?

(In fact, I see almost nothing on the roadmap currently. Issue?)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I want to deploy k8ssandra but cannot because it is using pod anti-affinity to prevent multiple pods from being scheduled on the same worker

Cass Operator uses pod anti-affinity by default to ensure that Cassandra pods are isolated from one another.

and Karpenter refuses to schedule.

I suspect there are a lot of high-availability operators that will currently prefer or require podAntiAffinity.

Prometheus: prometheus-operator/kube-prometheus#1537

Are you currently working around this issue?

allowMultipleNodesPerWorker: true

From: https://docs.k8ssandra.io/tasks/scale/

Additional context

What should these operators be using instead?

I see in the FAQ that this is not a priority for Karpenter but I think podAntiAffinity is fairly common among some popular operators, and it is not clear what they should be doing instead.

Some discussion here: #695 (comment)

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

akestner · 2021-12-08T17:00:21Z

Apologies for the lack of detail on the roadmap, we're working to get that updated to reflect the next steps for Karpenter after the launch at re:Invent last week.

Regarding podAntiAffinity, this is something we've discussed in the past, but it presents a number of challenges to Karpenter's scheduling capabilities. Often, as seems to be the case with k8ssandra and the other projects mentioned, topology spread could be a suitable alternative. I'll let @ellistarn weigh in as well since I know he's thought quite a bit about this too.

joebowbeer · 2021-12-08T18:30:26Z

@akestner Thanks for the background.

I suspect several CNCF projects are configured with podAntiAffinity now.

For example, these several in prometheus-operator, and it's not clear to everyone that this configuration is inappropriate. (See issue linked in description.)

In the interim, can Karpenter implement [soft] podAntiAffinity as if it were topology spread?

[Edited 12/14/2021] K8ssandra is actually requiring podAntiAffinity. I suspect most of the existing uses of podAntiAffinity are required rather than preferred.

ellistarn · 2021-12-08T18:50:51Z

In the interim, can Karpenter implement soft podAntiAffinity as if it were topology spread?

This is definitely how we'd do it. We already mutate pods in memory for similar things (i.e. nodeselectors are modeled as nodeaffinity in memory). The spec is very broad -- I can see support for a limited set of pod anti-affinity and make sure that maps well to common charts like kassandra.

Pod affinity (not anti) is extremely challenging. I don't know how to do it in a reasonable way -- it doesn't really make sense for provisioned clusters. I'd need to see an airtight technical proposal before we could make progress here.

bdwyertech · 2021-12-21T23:03:32Z

EKS built-in CoreDNS leverages a pod anti-affinity rule.

Stumbled across this wondering why Karpenter is spinning up nodes in the same region with a topology zone anti-affinity rule configured 😄 Many Helm charts expose affinity but do not expose topologySpreadConstraints.

+1

Noksa · 2022-01-12T15:45:42Z

I've added some explanation why we need PodAntiAffinity:
#1010 (comment)

ellistarn · 2022-01-12T17:56:07Z

I'm on board 👍

SamPeng87 · 2022-01-20T04:53:14Z

This is a very popular feature to use, and if karpenter is deployed you have to pay for it, such as calculating a new machine to put in first from request. This is the simplest scheduling strategy, but it solves 90% of the problems.Now Karpenter's strategy is to completely ignore, which is a very unreasonable scheduling strategy. Many companies must have scheduling control of this flag in production environments.

jlarfors · 2022-01-25T13:59:32Z

Seems like almost every 3rd party Helm chart has a default affinity of podAntiAffinity, and doesn't expose topologySpreadConstraints... This really should be a high priority if Karpenter is to be a real alternative to Cluster AutoScaler, which I think it should be as Karpenter is awesome otherwise.

ellistarn · 2022-01-25T17:59:09Z

Doing a bit of research -- it looks like podAffinity is not supported for daemonsets: kubernetes/kubernetes#29276. This should make our implementation a little easier.

ellistarn · 2022-01-25T18:25:29Z

Further, looking at https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity:

For requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity, the admission controller LimitPodHardAntiAffinityTopology was introduced to limit topologyKey to kubernetes.io/hostname. If you want to make it available for custom topologies, you may modify the admission controller, or disable it.

Initial support may be constrained to kubernetes.io/hostname.

ellistarn · 2022-01-25T22:05:53Z

It looks like we may be able to reuse our topology implementation for anti-affinity. I have a prototype here: https://github.com/aws/karpenter/pull/1221/files

This approach comes with some limitations:

No support for namespaces
No support for namespaceSelector
No support for keys outside of kubernetes.io/hostname

However, it's considered best practice to avoid use of these features anyways, outlined here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity:

joebowbeer · 2022-01-26T02:54:23Z

@ellistarn can you clarify what these namespace restrictions mean in practice?

For example, would they still support the antiAffinity uses in the Prometheus Operator? (#942 (comment))

I'm not sure I follow your statement that best practice is to avoid use of these features. By features do you mean pod affinity/antiAffinity? The only statement I find is that it is best to avoid them in clusters larger than several hundred nodes.

ellistarn · 2022-01-26T04:12:22Z

Cross namespace affinity gets expensive quickly -- worse with wildcards. It's not that we can't do it, its just that we need to spend the cycles figuring out how to get it into the scheduling code.

IIUC the prometheus operator is using namespaces: [prometheus]. This is how topology works by default (i.e. same namespace), so its trivial to open it up to that. Supporting cross namespace is definitely a bigger lift.

joebowbeer · 2022-01-26T04:42:28Z

IIUC the prometheus operator is using namespaces: [prometheus]. This is how topology works by default (i.e. same namespace), so its trivial to open it up to that.

Yes, that is my reading of the code as well. Thanks for clarifying.

ellistarn · 2022-01-26T19:56:27Z

See https://github.com/aws/karpenter/pull/1221/files#diff-dd2ae84838dbbe5f8348c698c96aecf8d9ee688821c432568a42029259e29474R48

bakayolo · 2022-02-03T21:11:19Z

The PR opened is for "hostname" but we are using "zone". Can we get this added there too?

zswanson · 2022-02-26T13:01:58Z

In the meantime could the log about this get bumped from DEBUG to ERROR? I have been pulling my hair out wondering why I had pods stuck unscheduled for hours until I changed the log level on karpenter. Or add a big warning to the docs? I searched for 'affinity' on the docs site and there's only a reference to nodes.

DEBUG controller.selection Ignoring pod, pod anti-affinity is not supported

ellistarn · 2022-02-26T17:15:39Z

Good suggestion. https://github.com/aws/karpenter/pull/1423/files

matthiasbertschtng · 2022-03-01T15:23:33Z

We also stumbled across this in combination with Tekton's Affinity Assistant. It uses inter-pod affinity and anti-affinity to schedule TaskRuns that share a Workspace to the same node by using an affinity assistant pod (with an anti-affinity to other affinity assistant pods). Currently, Karpenter simply refuses to schedule these pods.

Support for podAntiAffinity would also help here.

cebernardi · 2022-03-16T17:15:26Z

We also are encountering this issue, but with podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution.
I was wondering what the community thoughts are about relaxing these soft podAntiAffinity configurations, as it was done for nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution: https://github.com/aws/karpenter/blob/ba89ba83cb8f67c726dea0a1a0ba2ea57bbe1787/pkg/controllers/selection/preferences.go#L86-L99

From a user point of view it feels like preferred affinities shouldn't prevent pods from being scheduled.

What are the maintainer's thoughts? Can we help with this?

ellistarn · 2022-03-16T20:41:53Z

I would absolutely accept a PR that does this.

adjain131995 · 2022-03-17T05:55:09Z

I can see this issue being raised for quite a while and it is a very demanding feature. Do we have any ETA when this will be implemented in karpenter. I do not see any other alternative which we could use and topology constraints are hard ones and the alternatives tried also fails with kube0schedduler kicking in. Here is the track of that one

#1528

adjain131995 · 2022-03-17T09:54:28Z

Also in my use case, i would like to use pod anti-affinity
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: {{ .Release.Name }}
topologyKey: kubernetes.io/hostname
weight: 100
to ensure pods prefer to spread and not go on the same node where that is running but in case there is no option they get deployed. The only way to counter this in karpenter is using topologyKey: kubernetes.io/hostname which forces karpenter to spins up a huge number of nodes in case there is such a requirement.
How can this be catered in karpenter with using pod affinity/anti-affinity

cebernardi · 2022-03-17T12:47:00Z

@ellistarn I'm taking a stab

ellistarn · 2022-03-17T16:58:14Z

Should be straightforward here: https://github.com/aws/karpenter/blob/main/pkg/controllers/selection/preferences.go#L51

cebernardi · 2022-03-18T17:50:13Z

@ellistarn #1541

cebernardi · 2022-03-18T17:52:04Z

ops, I saw your comment now, I took another approach, please let me know your thoughts

ellistarn · 2022-03-18T18:37:28Z

Either works. Removing in preferences will survive when we implement affinity for real.

cebernardi · 2022-03-18T18:44:56Z

I can include this in the PR of course!
Would you suggest removing them iteratively as it's done for nodeAffinity, or removing every terms at once?

ellistarn · 2022-03-18T23:25:14Z

Iteratively SGTM

marcportabellaclotet-mt · 2022-04-05T13:55:22Z

It would be great if this feature (anti-affinity) is supported by Karpenter.
Our use case is jenkins-operator. We spread every job to a new node using anti-affinity. When jobs are finished, slave hosts scale to 0.
Karpenter is an Amazing project!!

implement pod affinity & anti-affinity - implement pod affinity/anti-affinity - rework topology spread support Fixes #942 and #985 Co-authored-by: Ellis Tarn <[email protected]>

tzneal · 2022-04-14T20:34:07Z

If you've got a fairly recent version of Karpenter already running, you can drop in this snapshot controller image to test out pod affinity & pod anti-affinity. This is just a snapshot, not meant for production, etc.:

kubectl set image deployment/karpenter -n karpenter controller=public.ecr.aws/z4v8y7u8/controller:08ac9ec303aabe2da56edb0ee0e235a60a287206

iaacautomation · 2022-05-06T08:47:58Z

Its also useful for elasticserach operator, if you create 3 node cluster topology spread can schedule 2,1 while preferred for HA is 1,1,1.

tzneal · 2022-05-06T12:12:13Z

@iaacautomation Pod Affinity & Anti-Affinity support was released in v0.9.0. Feel free to open a different issue if you are having problems with them.

joebowbeer added the feature New feature or request label Dec 8, 2021

joebowbeer mentioned this issue Dec 8, 2021

K8SSAND-1073 ⁃ Expose the AllowMultipleNodesPerWorker property of the CassandraDatacenter k8ssandra/k8ssandra-operator#225

Closed

ellistarn added the scheduling Issues related to Kubernetes scheduling label Dec 8, 2021

maxbrunet mentioned this issue Dec 8, 2021

docs(concepts): Clarify podAffinity/podAntiAffinity note #948

Merged

3 tasks

olemarkus mentioned this issue Dec 14, 2021

podAffinity #985

Closed

joebowbeer mentioned this issue Dec 14, 2021

Updated Karpenter section aws-samples/eks-workshop#1305

Merged

ellistarn mentioned this issue Dec 18, 2021

Support PodAntiAffinity #1010

Closed

ellistarn mentioned this issue Jan 25, 2022

[Experimental] Implemented Hostname AntiAffinity #1221

Closed

3 tasks

ellistarn mentioned this issue Jan 27, 2022

invalid nodeSelector "kubernetes.io/hostname", [] not in [] #1200

Closed

felix-zhe-huang mentioned this issue Feb 10, 2022

[Feature Request] Pod Anti-Affinity #1308

Closed

ellistarn mentioned this issue Mar 1, 2022

Ignore podAffinity/podAntiAffinity #1426

Closed

alex-berger mentioned this issue Mar 7, 2022

Add support for topologySpreadConstraints to Helm Chart kubernetes-sigs/metrics-server#977

Merged

tzneal mentioned this issue Mar 8, 2022

Pod Antiaffinity Feature #1481

Closed

ellistarn mentioned this issue Mar 13, 2022

controller.selection Ignoring pod, pod anti-affinity is not supported #1504

Closed

dewjam mentioned this issue Mar 14, 2022

I am getting this error when deploy kafka => Ignoring pod, pod anti-affinity is not supported. But some kafka pod deployed. #1507

Closed

cebernardi mentioned this issue Mar 18, 2022

allow pod scheduling if they express soft pod affinity / anti affinity #1541

Merged

3 tasks

tzneal self-assigned this Mar 29, 2022

jcogilvie mentioned this issue Mar 29, 2022

Implement topology spread constraints linkerd/linkerd2#8168

Closed

Swalloow mentioned this issue Mar 31, 2022

Add support for topologySpreadConstraints to Helm Chart apache/airflow#22652

Closed

2 tasks

tzneal mentioned this issue Apr 5, 2022

Add affinity and anti-affinity support #1626

Merged

3 tasks

tzneal closed this as completed in #1626 Apr 13, 2022

njtran mentioned this issue Apr 14, 2022

misspelling in ERROR message #1676

Closed

ochabaiev-boku mentioned this issue Aug 16, 2023

pod_anti_affinity does not force multiple nodes scheduling #4440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podAntiAffinity #942

podAntiAffinity #942

joebowbeer commented Dec 8, 2021 •

edited

Loading

akestner commented Dec 8, 2021

joebowbeer commented Dec 8, 2021 •

edited

Loading

ellistarn commented Dec 8, 2021

bdwyertech commented Dec 21, 2021

Noksa commented Jan 12, 2022

ellistarn commented Jan 12, 2022

SamPeng87 commented Jan 20, 2022 •

edited

Loading

jlarfors commented Jan 25, 2022

ellistarn commented Jan 25, 2022

ellistarn commented Jan 25, 2022

ellistarn commented Jan 25, 2022 •

edited

Loading

joebowbeer commented Jan 26, 2022 •

edited

Loading

ellistarn commented Jan 26, 2022 •

edited

Loading

joebowbeer commented Jan 26, 2022

ellistarn commented Jan 26, 2022

bakayolo commented Feb 3, 2022

zswanson commented Feb 26, 2022 •

edited

Loading

ellistarn commented Feb 26, 2022

matthiasbertschtng commented Mar 1, 2022

cebernardi commented Mar 16, 2022 •

edited

Loading

ellistarn commented Mar 16, 2022

adjain131995 commented Mar 17, 2022

adjain131995 commented Mar 17, 2022 •

edited

Loading

cebernardi commented Mar 17, 2022

ellistarn commented Mar 17, 2022

cebernardi commented Mar 18, 2022

cebernardi commented Mar 18, 2022

ellistarn commented Mar 18, 2022

cebernardi commented Mar 18, 2022 •

edited

Loading

ellistarn commented Mar 18, 2022

marcportabellaclotet-mt commented Apr 5, 2022 •

edited

Loading

tzneal commented Apr 14, 2022

iaacautomation commented May 6, 2022

tzneal commented May 6, 2022

podAntiAffinity #942

podAntiAffinity #942

Comments

joebowbeer commented Dec 8, 2021 • edited Loading

Community Note

akestner commented Dec 8, 2021

joebowbeer commented Dec 8, 2021 • edited Loading

ellistarn commented Dec 8, 2021

bdwyertech commented Dec 21, 2021

Noksa commented Jan 12, 2022

ellistarn commented Jan 12, 2022

SamPeng87 commented Jan 20, 2022 • edited Loading

jlarfors commented Jan 25, 2022

ellistarn commented Jan 25, 2022

ellistarn commented Jan 25, 2022

ellistarn commented Jan 25, 2022 • edited Loading

joebowbeer commented Jan 26, 2022 • edited Loading

ellistarn commented Jan 26, 2022 • edited Loading

joebowbeer commented Jan 26, 2022

ellistarn commented Jan 26, 2022

bakayolo commented Feb 3, 2022

zswanson commented Feb 26, 2022 • edited Loading

ellistarn commented Feb 26, 2022

matthiasbertschtng commented Mar 1, 2022

cebernardi commented Mar 16, 2022 • edited Loading

ellistarn commented Mar 16, 2022

adjain131995 commented Mar 17, 2022

adjain131995 commented Mar 17, 2022 • edited Loading

cebernardi commented Mar 17, 2022

ellistarn commented Mar 17, 2022

cebernardi commented Mar 18, 2022

cebernardi commented Mar 18, 2022

ellistarn commented Mar 18, 2022

cebernardi commented Mar 18, 2022 • edited Loading

ellistarn commented Mar 18, 2022

marcportabellaclotet-mt commented Apr 5, 2022 • edited Loading

tzneal commented Apr 14, 2022

iaacautomation commented May 6, 2022

tzneal commented May 6, 2022

joebowbeer commented Dec 8, 2021 •

edited

Loading

joebowbeer commented Dec 8, 2021 •

edited

Loading

SamPeng87 commented Jan 20, 2022 •

edited

Loading

ellistarn commented Jan 25, 2022 •

edited

Loading

joebowbeer commented Jan 26, 2022 •

edited

Loading

ellistarn commented Jan 26, 2022 •

edited

Loading

zswanson commented Feb 26, 2022 •

edited

Loading

cebernardi commented Mar 16, 2022 •

edited

Loading

adjain131995 commented Mar 17, 2022 •

edited

Loading

cebernardi commented Mar 18, 2022 •

edited

Loading

marcportabellaclotet-mt commented Apr 5, 2022 •

edited

Loading