-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126
Comments
When we only have 1 apiserver pod, Karpenter launches a single node so our topology.kubernetes.io/zone has 1 node and kubernetes.io/hostname has 1 node. With our configured PodTopologySpread, we allow a maxSkew of 1, meaning we allow any 2 zones to differ in their number of apiserver pods by 1 and we allow any 2 nodes to differ in their number of apiserver pods by 1. As a result, the maxSkew of 1 is satisfied when we create 2 additional apiserver pods and they're both assigned to the existing node. When we create the CP object with 3 replicas initially, Karpenter will launch three nodes and at that point the scheduler has to distribute the pods among the 3 nodes to respect the maxSkew of 1. To get our desired behavior when scaling up from to 3, we'd need the 2 additional nodes to already exist in the cluster which isn't possible because we'd launch the 2 nodes if the pods were stuck pending. We need pod anti-affinity to ensure that the the pods aren't scheduled together when there's one node since this isn't guaranteed by the PodTopologySpread. Karpenter does not support pod anti-affinity so if we configure it for apiserver pods, Karpenter will not launch nodes for those pending pods. See: aws/karpenter-provider-aws#695 (comment) Options:
|
This is the exact edge case with topology that convinced me to support antiaffinity in Karpenter. |
Ideally we'd want to replace this topology spread in the apiserver deployment spec:
With this anti-affinity:
And we'd keep the existing nodeSelectors:
|
@ellistarn Is this already merged/available in latest release? |
Built Karpenter from this PR to add pod anti-affinity support: aws/karpenter-provider-aws#1221 Used this anti-affinity spec in the apiserver deployment and it fixes the issue when scaling up from 1 apiserver to 3 where they get assigned to the same existing node.
|
If we apply the default ControlPlane object, we will launch one apiserver pod and Karpenter will launch one control plane node for this pod. If we scale up the number of replicas in the apiserver deployment from one to three, the two new apiserver pods will be launched on the existing node rather than being assigned their own nodes. Note that creating the ControlPlane object initially with three replicas would result in three nodes being launched.
Reproduction:
$ kubectl describe node ip-192-168-191-222.us-west-2.compute.internal ... Non-terminated Pods: (8 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default example-apiserver-7f588874d9-6qjtq 1 (1%) 0 (0%) 0 (0%) 0 (0%) 25s default example-apiserver-7f588874d9-7ldgm 1 (1%) 0 (0%) 0 (0%) 0 (0%) 2m3s default example-apiserver-7f588874d9-j7l7d 1 (1%) 0 (0%) 0 (0%) 0 (0%) 25s default example-authenticator-wsvxp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119s default example-controller-manager-hqrck 1 (1%) 0 (0%) 0 (0%) 0 (0%) 119s default example-scheduler-5vf8g 1 (1%) 0 (0%) 0 (0%) 0 (0%) 119s kube-system aws-node-7h2kr 25m (0%) 0 (0%) 0 (0%) 0 (0%) 77s kube-system kube-proxy-qwwvb 100m (0%) 0 (0%) 0 (0%) 0 (0%) 77s
The text was updated successfully, but these errors were encountered: