-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace PodAntiAffinity with TopologySpreadConstraints #8826
Conversation
Fixes linkerd#8168 Signed-off-by: Takumi Sue <[email protected]>
Signed-off-by: Takumi Sue <[email protected]>
- {{ .component }} | ||
topologyKey: kubernetes.io/hostname | ||
{{- end }} | ||
|
||
{{ define "linkerd.node-affinity" -}} | ||
nodeAffinity: | ||
{{- toYaml .Values.nodeAffinity | trim | nindent 2 }} | ||
{{- end }} | ||
|
||
{{ define "linkerd.affinity" -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since node-affinity is now the only thing in this define, I think we can just inline the node-affinity define here to simplify.
RemoteMirrorServiceAccount bool `json:"remoteMirrorServiceAccount"` | ||
RemoteMirrorServiceAccountName string `json:"remoteMirrorServiceAccountName"` | ||
TargetClusterName string `json:"targetClusterName"` | ||
EnableTopologySpreadConstraints bool `json:"enableTopologySpreadConstraints"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in order to safely remove the EnablePodAntiAffinity
value and replace it with EnableTopologySpreadConstraints
we'd need to go through a deprecation cycle where we temporarily support both values. It's a bit unfortunate because both values have the same intent (to spread control plane pods across nodes) even if what they actually do is slightly different.
What do you think about keeping the EnablePodAntiAffinity
value but just having it enable topology spread instead of pod anti-affinity? This would mean the name of the value is a bit of a lie, but it avoids the need for users to migrate from one value to the other.
@alpeb do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not change the meaning of a setting like that as it may confuse users ("why do I have multiple replicas in the same node even if I had enabled pod anti-affinity?"). As for the deprecation cycle, IMO it's enough to call this out in the upgrade notes; 2.12
will require to manually migrate helm configs anyways, so we might as well include this change as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will we run into any serialization/deserialization errors during upgrade because the Values struct is changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this upgrading using both the CLI and Helm, and config seems to behave well. Old enablePodAffinity
remains in the CM, but that should be fine.
Signed-off-by: Takumi Sue <[email protected]>
{{- if .Values.enableTopologySpreadConstraints }} | ||
topologySpreadConstraints: | ||
- maxSkew: 1 | ||
topologyKey: failure-domain.beta.kubernetes.io/zone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This label has been declared deprecated. Can you use topology.kubernetes.io/zone
instead?
matchExpressions: | ||
- key: {{ default "linkerd.io/control-plane-component" .label }} | ||
operator: In | ||
values: | ||
- {{ .component }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to this (same applies to rule below):
matchExpressions: | |
- key: {{ default "linkerd.io/control-plane-component" .label }} | |
operator: In | |
values: | |
- {{ .component }} | |
matchLabels: | |
- {{ default "linkerd.io/control-plane-component" .label }}: {{.component}} |
RemoteMirrorServiceAccount bool `json:"remoteMirrorServiceAccount"` | ||
RemoteMirrorServiceAccountName string `json:"remoteMirrorServiceAccountName"` | ||
TargetClusterName string `json:"targetClusterName"` | ||
EnableTopologySpreadConstraints bool `json:"enableTopologySpreadConstraints"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this upgrading using both the CLI and Helm, and config seems to behave well. Old enablePodAffinity
remains in the CM, but that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mikutas for the thorough changes. From my testing using --ha
I see pods are well distributed among hosts when installing. And they don't get blocked if the number of nodes is inferior to the number of desired replicas, unlike when using pod anti-affinity. However when upgrading from the latest edge to this branch, they usually don't get well distributed, unlike they did with the previous approach. Any ideas how we could improve that?
Hmm can we recommend users to use descheduler if necessary in upgrade note? (thats all I have for now 💭 ) @alpeb |
@mikutas Ok I see in the docs that TopologySpreadConstraints is really intended only for when the pods get created, and as things evolve there's no guarantee the constraints continue to be respected and so they recommend coupling with the Descheduler as you point out, which is a bummer 🤷♂️
@jeremychase @adleong WDYT? I think this new approach is still better because it avoids locking things down as in the scenario described in #8168, but it can be surprising it will stop being relevant after upgrades, so we should at least document that in the Helm config. And for people requiring a more strict spreading policy, well they can put up with the extra work. |
That's a disappointing wrinkle that TopologySpreadConstraints don't work like we expected. Given that, I'd be inclined to stick with PodAntiAffinity since that gives us better guarantees that control plane pods will be spread between nodes and that the control plane will be resilient to node failure. I realize that this doesn't address #8168 but it's unclear to me why one would want more control plane replicas than nodes. One replica should be more than enough to handle a node's worth of pods and HA mode isn't really appropriate for clusters with a very small number of nodes anyway. |
topologySpreadConstraints: | ||
- maxSkew: 1 | ||
topologyKey: failure-domain.beta.kubernetes.io/zone | ||
whenUnsatisfiable: ScheduleAnyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my testing, if that label is not there (which is the case in k3d for example), switching that will cause pods to not being scheduled at all. Removing that rule and leaving only the one for kubernetes.io/hostname
still causes pods to not being allocated as expected during upgrades unfortunately.
failure-domain.beta.kubernetes.io/zone is deprecated Signed-off-by: Takumi Sue <[email protected]>
Signed-off-by: Takumi Sue <[email protected]>
Fixes #8168
Signed-off-by: Takumi Sue [email protected]