Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace PodAntiAffinity with TopologySpreadConstraints #8826

Closed
wants to merge 5 commits into from

Conversation

mikutas
Copy link
Contributor

@mikutas mikutas commented Jul 7, 2022

Fixes #8168

Signed-off-by: Takumi Sue [email protected]

Signed-off-by: Takumi Sue <[email protected]>
@mikutas mikutas marked this pull request as ready for review July 7, 2022 13:22
@mikutas mikutas requested a review from a team as a code owner July 7, 2022 13:22
- {{ .component }}
topologyKey: kubernetes.io/hostname
{{- end }}

{{ define "linkerd.node-affinity" -}}
nodeAffinity:
{{- toYaml .Values.nodeAffinity | trim | nindent 2 }}
{{- end }}

{{ define "linkerd.affinity" -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since node-affinity is now the only thing in this define, I think we can just inline the node-affinity define here to simplify.

RemoteMirrorServiceAccount bool `json:"remoteMirrorServiceAccount"`
RemoteMirrorServiceAccountName string `json:"remoteMirrorServiceAccountName"`
TargetClusterName string `json:"targetClusterName"`
EnableTopologySpreadConstraints bool `json:"enableTopologySpreadConstraints"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in order to safely remove the EnablePodAntiAffinity value and replace it with EnableTopologySpreadConstraints we'd need to go through a deprecation cycle where we temporarily support both values. It's a bit unfortunate because both values have the same intent (to spread control plane pods across nodes) even if what they actually do is slightly different.

What do you think about keeping the EnablePodAntiAffinity value but just having it enable topology spread instead of pod anti-affinity? This would mean the name of the value is a bit of a lie, but it avoids the need for users to migrate from one value to the other.

@alpeb do you have any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not change the meaning of a setting like that as it may confuse users ("why do I have multiple replicas in the same node even if I had enabled pod anti-affinity?"). As for the deprecation cycle, IMO it's enough to call this out in the upgrade notes; 2.12 will require to manually migrate helm configs anyways, so we might as well include this change as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will we run into any serialization/deserialization errors during upgrade because the Values struct is changing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this upgrading using both the CLI and Helm, and config seems to behave well. Old enablePodAffinity remains in the CM, but that should be fine.

{{- if .Values.enableTopologySpreadConstraints }}
topologySpreadConstraints:
- maxSkew: 1
topologyKey: failure-domain.beta.kubernetes.io/zone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label has been declared deprecated. Can you use topology.kubernetes.io/zone instead?

Comment on lines 8 to 12
matchExpressions:
- key: {{ default "linkerd.io/control-plane-component" .label }}
operator: In
values:
- {{ .component }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to this (same applies to rule below):

Suggested change
matchExpressions:
- key: {{ default "linkerd.io/control-plane-component" .label }}
operator: In
values:
- {{ .component }}
matchLabels:
- {{ default "linkerd.io/control-plane-component" .label }}: {{.component}}

RemoteMirrorServiceAccount bool `json:"remoteMirrorServiceAccount"`
RemoteMirrorServiceAccountName string `json:"remoteMirrorServiceAccountName"`
TargetClusterName string `json:"targetClusterName"`
EnableTopologySpreadConstraints bool `json:"enableTopologySpreadConstraints"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this upgrading using both the CLI and Helm, and config seems to behave well. Old enablePodAffinity remains in the CM, but that should be fine.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikutas for the thorough changes. From my testing using --ha I see pods are well distributed among hosts when installing. And they don't get blocked if the number of nodes is inferior to the number of desired replicas, unlike when using pod anti-affinity. However when upgrading from the latest edge to this branch, they usually don't get well distributed, unlike they did with the previous approach. Any ideas how we could improve that?

@mikutas
Copy link
Contributor Author

mikutas commented Jul 9, 2022

Hmm can we recommend users to use descheduler if necessary in upgrade note? (thats all I have for now 💭 ) @alpeb

@alpeb
Copy link
Member

alpeb commented Jul 11, 2022

@mikutas Ok I see in the docs that TopologySpreadConstraints is really intended only for when the pods get created, and as things evolve there's no guarantee the constraints continue to be respected and so they recommend coupling with the Descheduler as you point out, which is a bummer 🤷‍♂️

There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution. You can use Descheduler to rebalance the Pods distribution.

@jeremychase @adleong WDYT? I think this new approach is still better because it avoids locking things down as in the scenario described in #8168, but it can be surprising it will stop being relevant after upgrades, so we should at least document that in the Helm config. And for people requiring a more strict spreading policy, well they can put up with the extra work.

@adleong
Copy link
Member

adleong commented Jul 12, 2022

That's a disappointing wrinkle that TopologySpreadConstraints don't work like we expected. Given that, I'd be inclined to stick with PodAntiAffinity since that gives us better guarantees that control plane pods will be spread between nodes and that the control plane will be resilient to node failure.

I realize that this doesn't address #8168 but it's unclear to me why one would want more control plane replicas than nodes. One replica should be more than enough to handle a node's worth of pods and HA mode isn't really appropriate for clusters with a very small number of nodes anyway.

topologySpreadConstraints:
- maxSkew: 1
topologyKey: failure-domain.beta.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alpeb @adleong

I'm inclined to think that if we switched this to DoNotSchedule the unexpected behavior may not be a problem.

For this to work we would need to ensure that the pods get scheduled after the condition is satisfied again.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my testing, if that label is not there (which is the case in k3d for example), switching that will cause pods to not being scheduled at all. Removing that rule and leaving only the one for kubernetes.io/hostname still causes pods to not being allocated as expected during upgrades unfortunately.

mikutas added 2 commits July 13, 2022 14:53
failure-domain.beta.kubernetes.io/zone is deprecated

Signed-off-by: Takumi Sue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement topology spread constraints
4 participants