-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podAntiAffinity #942
Comments
Apologies for the lack of detail on the roadmap, we're working to get that updated to reflect the next steps for Karpenter after the launch at re:Invent last week. Regarding |
@akestner Thanks for the background. I suspect several CNCF projects are configured with podAntiAffinity now. For example, these several in prometheus-operator, and it's not clear to everyone that this configuration is inappropriate. (See issue linked in description.) In the interim, can Karpenter implement [soft] podAntiAffinity as if it were topology spread? [Edited 12/14/2021] K8ssandra is actually requiring podAntiAffinity. I suspect most of the existing uses of podAntiAffinity are required rather than preferred. |
This is definitely how we'd do it. We already mutate pods in memory for similar things (i.e. nodeselectors are modeled as nodeaffinity in memory). The spec is very broad -- I can see support for a limited set of pod anti-affinity and make sure that maps well to common charts like kassandra. Pod affinity (not anti) is extremely challenging. I don't know how to do it in a reasonable way -- it doesn't really make sense for provisioned clusters. I'd need to see an airtight technical proposal before we could make progress here. |
EKS built-in CoreDNS leverages a pod anti-affinity rule. Stumbled across this wondering why Karpenter is spinning up nodes in the same region with a topology zone anti-affinity rule configured 😄 Many Helm charts expose affinity but do not expose topologySpreadConstraints. +1 |
I've added some explanation why we need |
I'm on board 👍 |
This is a very popular feature to use, and if karpenter is deployed you have to pay for it, such as calculating a new machine to put in first from request. This is the simplest scheduling strategy, but it solves 90% of the problems.Now Karpenter's strategy is to completely ignore, which is a very unreasonable scheduling strategy. Many companies must have scheduling control of this flag in production environments. |
Seems like almost every 3rd party Helm chart has a default affinity of |
Doing a bit of research -- it looks like podAffinity is not supported for daemonsets: kubernetes/kubernetes#29276. This should make our implementation a little easier. |
Further, looking at https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity:
Initial support may be constrained to |
It looks like we may be able to reuse our topology implementation for anti-affinity. I have a prototype here: https://github.com/aws/karpenter/pull/1221/files This approach comes with some limitations:
However, it's considered best practice to avoid use of these features anyways, outlined here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity: |
@ellistarn can you clarify what these namespace restrictions mean in practice? For example, would they still support the antiAffinity uses in the Prometheus Operator? (#942 (comment)) I'm not sure I follow your statement that best practice is to avoid use of these features. By features do you mean pod affinity/antiAffinity? The only statement I find is that it is best to avoid them in clusters larger than several hundred nodes. |
Cross namespace affinity gets expensive quickly -- worse with wildcards. It's not that we can't do it, its just that we need to spend the cycles figuring out how to get it into the scheduling code. IIUC the prometheus operator is using |
Yes, that is my reading of the code as well. Thanks for clarifying. |
The PR opened is for "hostname" but we are using "zone". Can we get this added there too? |
In the meantime could the log about this get bumped from DEBUG to ERROR? I have been pulling my hair out wondering why I had pods stuck unscheduled for hours until I changed the log level on karpenter. Or add a big warning to the docs? I searched for 'affinity' on the docs site and there's only a reference to nodes.
|
Good suggestion. https://github.com/aws/karpenter/pull/1423/files |
We also stumbled across this in combination with Tekton's Affinity Assistant. It uses inter-pod affinity and anti-affinity to schedule Support for |
We also are encountering this issue, but with From a user point of view it feels like preferred affinities shouldn't prevent pods from being scheduled. What are the maintainer's thoughts? Can we help with this? |
I would absolutely accept a PR that does this. |
I can see this issue being raised for quite a while and it is a very demanding feature. Do we have any ETA when this will be implemented in karpenter. I do not see any other alternative which we could use and topology constraints are hard ones and the alternatives tried also fails with kube0schedduler kicking in. Here is the track of that one |
Also in my use case, i would like to use pod anti-affinity |
@ellistarn I'm taking a stab |
Should be straightforward here: https://github.com/aws/karpenter/blob/main/pkg/controllers/selection/preferences.go#L51 |
ops, I saw your comment now, I took another approach, please let me know your thoughts |
Either works. Removing in preferences will survive when we implement affinity for real. |
I can include this in the PR of course! |
Iteratively SGTM |
It would be great if this feature (anti-affinity) is supported by Karpenter. |
implement pod affinity & anti-affinity - implement pod affinity/anti-affinity - rework topology spread support Fixes #942 and #985 Co-authored-by: Ellis Tarn <[email protected]>
If you've got a fairly recent version of Karpenter already running, you can drop in this snapshot controller image to test out pod affinity & pod anti-affinity. This is just a snapshot, not meant for production, etc.:
|
Its also useful for elasticserach operator, if you create 3 node cluster topology spread can schedule 2,1 while preferred for HA is 1,1,1. |
@iaacautomation Pod Affinity & Anti-Affinity support was released in v0.9.0. Feel free to open a different issue if you are having problems with them. |
Tell us about your request
Please implement
podAntiAffinity
.I don't see anything about this on the roadmap. Is this planned?
(In fact, I see almost nothing on the roadmap currently. Issue?)
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
I want to deploy k8ssandra but cannot because it is using pod anti-affinity to prevent multiple pods from being scheduled on the same worker
and Karpenter refuses to schedule.
I suspect there are a lot of high-availability operators that will currently prefer or require podAntiAffinity.
Prometheus: prometheus-operator/kube-prometheus#1537
Are you currently working around this issue?
allowMultipleNodesPerWorker: true
From: https://docs.k8ssandra.io/tasks/scale/
Additional context
What should these operators be using instead?
I see in the FAQ that this is not a priority for Karpenter but I think podAntiAffinity is fairly common among some popular operators, and it is not clear what they should be doing instead.
Some discussion here: #695 (comment)
Community Note
The text was updated successfully, but these errors were encountered: