Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podAntiAffinity #942

Closed
joebowbeer opened this issue Dec 8, 2021 · 34 comments · Fixed by #1626
Closed

podAntiAffinity #942

joebowbeer opened this issue Dec 8, 2021 · 34 comments · Fixed by #1626
Assignees
Labels
feature New feature or request scheduling Issues related to Kubernetes scheduling

Comments

@joebowbeer
Copy link
Contributor

joebowbeer commented Dec 8, 2021

Tell us about your request

Please implement podAntiAffinity.

I don't see anything about this on the roadmap. Is this planned?

(In fact, I see almost nothing on the roadmap currently. Issue?)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

I want to deploy k8ssandra but cannot because it is using pod anti-affinity to prevent multiple pods from being scheduled on the same worker

Cass Operator uses pod anti-affinity by default to ensure that Cassandra pods are isolated from one another.

and Karpenter refuses to schedule.

I suspect there are a lot of high-availability operators that will currently prefer or require podAntiAffinity.

Prometheus: prometheus-operator/kube-prometheus#1537

Are you currently working around this issue?

allowMultipleNodesPerWorker: true

From: https://docs.k8ssandra.io/tasks/scale/

Additional context

What should these operators be using instead?

I see in the FAQ that this is not a priority for Karpenter but I think podAntiAffinity is fairly common among some popular operators, and it is not clear what they should be doing instead.

Some discussion here: #695 (comment)

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@akestner
Copy link
Contributor

akestner commented Dec 8, 2021

Apologies for the lack of detail on the roadmap, we're working to get that updated to reflect the next steps for Karpenter after the launch at re:Invent last week.

Regarding podAntiAffinity, this is something we've discussed in the past, but it presents a number of challenges to Karpenter's scheduling capabilities. Often, as seems to be the case with k8ssandra and the other projects mentioned, topology spread could be a suitable alternative. I'll let @ellistarn weigh in as well since I know he's thought quite a bit about this too.

@joebowbeer
Copy link
Contributor Author

joebowbeer commented Dec 8, 2021

@akestner Thanks for the background.

I suspect several CNCF projects are configured with podAntiAffinity now.

For example, these several in prometheus-operator, and it's not clear to everyone that this configuration is inappropriate. (See issue linked in description.)

In the interim, can Karpenter implement [soft] podAntiAffinity as if it were topology spread?

[Edited 12/14/2021] K8ssandra is actually requiring podAntiAffinity. I suspect most of the existing uses of podAntiAffinity are required rather than preferred.

@ellistarn
Copy link
Contributor

In the interim, can Karpenter implement soft podAntiAffinity as if it were topology spread?

This is definitely how we'd do it. We already mutate pods in memory for similar things (i.e. nodeselectors are modeled as nodeaffinity in memory). The spec is very broad -- I can see support for a limited set of pod anti-affinity and make sure that maps well to common charts like kassandra.

Pod affinity (not anti) is extremely challenging. I don't know how to do it in a reasonable way -- it doesn't really make sense for provisioned clusters. I'd need to see an airtight technical proposal before we could make progress here.

@bdwyertech
Copy link
Contributor

EKS built-in CoreDNS leverages a pod anti-affinity rule.

Stumbled across this wondering why Karpenter is spinning up nodes in the same region with a topology zone anti-affinity rule configured 😄 Many Helm charts expose affinity but do not expose topologySpreadConstraints.

+1

@Noksa
Copy link

Noksa commented Jan 12, 2022

I've added some explanation why we need PodAntiAffinity:
#1010 (comment)

@ellistarn
Copy link
Contributor

I'm on board 👍

@SamPeng87
Copy link

SamPeng87 commented Jan 20, 2022

This is a very popular feature to use, and if karpenter is deployed you have to pay for it, such as calculating a new machine to put in first from request. This is the simplest scheduling strategy, but it solves 90% of the problems.Now Karpenter's strategy is to completely ignore, which is a very unreasonable scheduling strategy. Many companies must have scheduling control of this flag in production environments.

@jlarfors
Copy link

Seems like almost every 3rd party Helm chart has a default affinity of podAntiAffinity, and doesn't expose topologySpreadConstraints... This really should be a high priority if Karpenter is to be a real alternative to Cluster AutoScaler, which I think it should be as Karpenter is awesome otherwise.

@ellistarn
Copy link
Contributor

Doing a bit of research -- it looks like podAffinity is not supported for daemonsets: kubernetes/kubernetes#29276. This should make our implementation a little easier.

@ellistarn
Copy link
Contributor

Further, looking at https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity:

For requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity, the admission controller LimitPodHardAntiAffinityTopology was introduced to limit topologyKey to kubernetes.io/hostname. If you want to make it available for custom topologies, you may modify the admission controller, or disable it.

Initial support may be constrained to kubernetes.io/hostname.

@ellistarn
Copy link
Contributor

ellistarn commented Jan 25, 2022

It looks like we may be able to reuse our topology implementation for anti-affinity. I have a prototype here: https://github.com/aws/karpenter/pull/1221/files

This approach comes with some limitations:

  • No support for namespaces
  • No support for namespaceSelector
  • No support for keys outside of kubernetes.io/hostname

However, it's considered best practice to avoid use of these features anyways, outlined here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity:

@joebowbeer
Copy link
Contributor Author

joebowbeer commented Jan 26, 2022

@ellistarn can you clarify what these namespace restrictions mean in practice?

For example, would they still support the antiAffinity uses in the Prometheus Operator? (#942 (comment))

I'm not sure I follow your statement that best practice is to avoid use of these features. By features do you mean pod affinity/antiAffinity? The only statement I find is that it is best to avoid them in clusters larger than several hundred nodes.

@ellistarn
Copy link
Contributor

ellistarn commented Jan 26, 2022

Cross namespace affinity gets expensive quickly -- worse with wildcards. It's not that we can't do it, its just that we need to spend the cycles figuring out how to get it into the scheduling code.

IIUC the prometheus operator is using namespaces: [prometheus]. This is how topology works by default (i.e. same namespace), so its trivial to open it up to that. Supporting cross namespace is definitely a bigger lift.

@joebowbeer
Copy link
Contributor Author

IIUC the prometheus operator is using namespaces: [prometheus]. This is how topology works by default (i.e. same namespace), so its trivial to open it up to that.

Yes, that is my reading of the code as well. Thanks for clarifying.

@bakayolo
Copy link

bakayolo commented Feb 3, 2022

The PR opened is for "hostname" but we are using "zone". Can we get this added there too?

@zswanson
Copy link

zswanson commented Feb 26, 2022

In the meantime could the log about this get bumped from DEBUG to ERROR? I have been pulling my hair out wondering why I had pods stuck unscheduled for hours until I changed the log level on karpenter. Or add a big warning to the docs? I searched for 'affinity' on the docs site and there's only a reference to nodes.

DEBUG controller.selection Ignoring pod, pod anti-affinity is not supported

@ellistarn
Copy link
Contributor

@matthiasbertschtng
Copy link

We also stumbled across this in combination with Tekton's Affinity Assistant. It uses inter-pod affinity and anti-affinity to schedule TaskRuns that share a Workspace to the same node by using an affinity assistant pod (with an anti-affinity to other affinity assistant pods). Currently, Karpenter simply refuses to schedule these pods.

Support for podAntiAffinity would also help here.

@cebernardi
Copy link
Contributor

cebernardi commented Mar 16, 2022

We also are encountering this issue, but with podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution.
I was wondering what the community thoughts are about relaxing these soft podAntiAffinity configurations, as it was done for nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution: https://github.com/aws/karpenter/blob/ba89ba83cb8f67c726dea0a1a0ba2ea57bbe1787/pkg/controllers/selection/preferences.go#L86-L99

From a user point of view it feels like preferred affinities shouldn't prevent pods from being scheduled.

What are the maintainer's thoughts? Can we help with this?

@ellistarn
Copy link
Contributor

I would absolutely accept a PR that does this.

@adjain131995
Copy link

I can see this issue being raised for quite a while and it is a very demanding feature. Do we have any ETA when this will be implemented in karpenter. I do not see any other alternative which we could use and topology constraints are hard ones and the alternatives tried also fails with kube0schedduler kicking in. Here is the track of that one

#1528

@adjain131995
Copy link

adjain131995 commented Mar 17, 2022

Also in my use case, i would like to use pod anti-affinity
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: {{ .Release.Name }}
topologyKey: kubernetes.io/hostname
weight: 100
to ensure pods prefer to spread and not go on the same node where that is running but in case there is no option they get deployed. The only way to counter this in karpenter is using topologyKey: kubernetes.io/hostname which forces karpenter to spins up a huge number of nodes in case there is such a requirement.
How can this be catered in karpenter with using pod affinity/anti-affinity

@cebernardi
Copy link
Contributor

@ellistarn I'm taking a stab

@ellistarn
Copy link
Contributor

@cebernardi
Copy link
Contributor

@ellistarn #1541

@cebernardi
Copy link
Contributor

ops, I saw your comment now, I took another approach, please let me know your thoughts

@ellistarn
Copy link
Contributor

Either works. Removing in preferences will survive when we implement affinity for real.

@cebernardi
Copy link
Contributor

cebernardi commented Mar 18, 2022

I can include this in the PR of course!
Would you suggest removing them iteratively as it's done for nodeAffinity, or removing every terms at once?

@ellistarn
Copy link
Contributor

Iteratively SGTM

@marcportabellaclotet-mt
Copy link

marcportabellaclotet-mt commented Apr 5, 2022

It would be great if this feature (anti-affinity) is supported by Karpenter.
Our use case is jenkins-operator. We spread every job to a new node using anti-affinity. When jobs are finished, slave hosts scale to 0.
Karpenter is an Amazing project!!

tzneal added a commit that referenced this issue Apr 13, 2022
 implement pod affinity & anti-affinity

- implement pod affinity/anti-affinity
- rework topology spread support

Fixes #942 and #985 

Co-authored-by: Ellis Tarn <[email protected]>
@tzneal
Copy link
Contributor

tzneal commented Apr 14, 2022

If you've got a fairly recent version of Karpenter already running, you can drop in this snapshot controller image to test out pod affinity & pod anti-affinity. This is just a snapshot, not meant for production, etc.:

kubectl set image deployment/karpenter -n karpenter controller=public.ecr.aws/z4v8y7u8/controller:08ac9ec303aabe2da56edb0ee0e235a60a287206

@iaacautomation
Copy link

Its also useful for elasticserach operator, if you create 3 node cluster topology spread can schedule 2,1 while preferred for HA is 1,1,1.

@tzneal
Copy link
Contributor

tzneal commented May 6, 2022

@iaacautomation Pod Affinity & Anti-Affinity support was released in v0.9.0. Feel free to open a different issue if you are having problems with them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request scheduling Issues related to Kubernetes scheduling
Projects
None yet
Development

Successfully merging a pull request may close this issue.