Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace podAntiAffinity addon by a topologySpreadConstraints addon? #1537

Open
maxbrunet opened this issue Dec 6, 2021 · 10 comments
Open

Replace podAntiAffinity addon by a topologySpreadConstraints addon? #1537

maxbrunet opened this issue Dec 6, 2021 · 10 comments

Comments

@maxbrunet
Copy link
Contributor

What is missing?

I stumbled upon this new recommendation reading the karpenter.sh documentation1:

Note
Don’t use podAffinity and podAntiAffinity to schedule pods on the same or different nodes as other pods. Kubernetes SIG scalability recommends against these features and Karpenter doesn’t support them. Instead, the Karpenter project recommends topologySpreadConstraints to reduce blast radius and nodeSelectors and taints to implement colocation.

More in the official documentation about the motivations: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#comparison-with-podaffinity-podantiaffinity

It may not be worth to pursue improvements to the podAntiAffinity addon like #1090.

Thoughts?

Why do we need it?

Follow ever evolving Kubernetes best practices

Environment

  • kube-prometheus version:

    Insert Git SHA here

Anything else we need to know?:

anti-affinity addon: https://github.com/prometheus-operator/kube-prometheus/blob/6d013d4e4f980ba99cfdafa9432819d484e2f829/jsonnet/kube-prometheus/addons/anti-affinity.libsonnet

Footnotes

  1. https://karpenter.sh/docs/concepts/

@paulfantom
Copy link
Member

Can you share more about sig scalability requirements about this? I cannot find anything much info related to use of affinity vs topologySpreadConstraints. The only thing I found is KEP for introducing the feature and info that affinity and topologySpreadConstraints can work together.

From what I see topologySpreadConstraints matter more when you want to have multiple constraints using multiple failure domains (something not easily possible with affinity). However, in our case, we are using only one failure domain (node) which begs to ask what is the benefit of switching to a different solution?

Don't get me wrong, I like the idea of using topologySpreadConstraints, but I just want to know more ;)

@maxbrunet
Copy link
Contributor Author

I am at the same point as you on all that. Reading the KEP further, PodAntiAffinity seem to cause them issues and they are even thinking about deprecating it in the linked issue:

Currently PodAntiAffinity supports arbitrary topology domain, but sadly this causes a slow down in scheduling (see Rethink pod affinity/anti-affinity). We're evaluating solutions such as limit topology domain to node, or internally implement a fast/slow path handling that. If this KEP gets implemented, we can simply achieve the semantics of "PodAntiAffinity in zones" via a combination of "Even pods spreading in zones" plus "PodAntiAffinity in nodes" which could be an extra benefit of this KEP.

From this, it seems to be more about how it works internally, rather than the potential gain in flexibility for the users. My guess is, when the Karpenter documentation refers to the "Kubernetes SIG scalability" recommendation, it is not about the scalability of your deployment, but the scalability of your Kubernetes Scheduler

@philipgough
Copy link
Contributor

philipgough commented Dec 8, 2021

they are even thinking about deprecating it in the linked issue

I'm not sure I see anything concrete in that issue which states that there will be any deprecation to affinity or anti-affinity. It is an almost 3 year old issue.

In fact I see various related enhancements landing in 1.22 kubernetes/enhancements#2249

To be clear, I am not saying we shouldn't consider this or it is a bad idea. I just want to be sure we are not being motivated by the wrong source.

@maxbrunet
Copy link
Contributor Author

I got the Karpenter documentation clarified aws/karpenter-provider-aws#948

So in short, pod anti-affinities are a strain for the Kubernetes Scheduler at the moment
They does not seem to be a clear decision on whatever they will be deprecated or improved, but topologySpreadConstraints are here as an alternative

I have finally found the literal recommendation

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which can slow down scheduling in large clusters significantly. We do not recommend using them in clusters larger than several hundred nodes.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity

Now the questions for kube-prometheus:

  1. Are we concerned about the Kubernetes Scheduler performance?
  2. Do we want a topologySpreadConstraints addon?
  3. Do we want to remove the podAntiAffinity addon if doing 2.?

@paulfantom
Copy link
Member

Are we concerned about the Kubernetes Scheduler performance?

I would say yes. If we can improve our users' environments by using a better performing feature, we should use it.

Do we want a topologySpreadConstraints addon?

I vote yes.

Do we want to remove the podAntiAffinity addon if doing 2.?

No, I think this would break too many workflows. We can however put a trace notice in the podAntiAffinity addon to advertise the usage of the new topologySpreadConstraints addon. Additionally, we would need to create better documentation about topology spread and usage of podAntiAffinity as well as topologySpreadConstraints with an explanation on why not to use podAntiAffinity.

By trace notice I mean something like this: https://github.com/prometheus-operator/kube-prometheus/compare/main...paulfantom:trace-notice?expand=1


cc @simonpasquier @philipgough you might be interested in this as AFAIR CMO is using anti-affinity heavily.

@philipgough
Copy link
Contributor

I have no issues with us adding support for the addon while retaining support for inter pod affinity.

Yes CMO is using podAntiAffinity for both AlertManager and the platform Prometheus. I know we have deployed on very large cluster but am not aware of any issues with scheduling performance.

I'd be curious to learn if the impact would be more substantial for high-churn workloads, as opposed to relatively static platform infra. Or if adding podAntiAffinity, podAffinity to a workload affects overall scheduling performance across the cluster in general or is just limited to those workloads.

@maxbrunet
Copy link
Contributor Author

@paulfantom sounds good, thank you for pointing to std.trace(), I had been wondering how to output things like deprecation warning in Jsonnet for some time.

I am going to experiment with topologySpreadConstraints in my clusters, and I will try to contribute an addon later

@migueleliasweb
Copy link

This seems already implemented. https://prometheus-operator.dev/docs/operator/api/

Should this issue be closed?

@joebowbeer
Copy link

Karpenter implemented enough Pod Anti-Affinity support to satisfy the prometheus use case. It was released in v0.9.0

@simonpasquier
Copy link
Contributor

@migueleliasweb I think that this feature request is still valid because the ask is to implement a jsonnet addon that uses topologySpreadConstraints and this doesn't exist yet (though the prometheus-operator CRDs support it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants