[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

natherz97 · 2022-01-28T22:00:59Z

If we apply the default ControlPlane object, we will launch one apiserver pod and Karpenter will launch one control plane node for this pod. If we scale up the number of replicas in the apiserver deployment from one to three, the two new apiserver pods will be launched on the existing node rather than being assigned their own nodes. Note that creating the ControlPlane object initially with three replicas would result in three nodes being launched.

Reproduction:

Create a default ControlPlane object which defaults to one apiserver replica:

$ cat <<EOF | kubectl apply -f -
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  name: ${GUEST_CLUSTER_NAME} # Desired Cluster name
spec: {}
EOF

Edit the object and increase number of apiserver replicas from one to three:

$ kubectl get controlplane example -o yaml
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kit.k8s.sh/v1alpha1","kind":"ControlPlane","metadata":{"annotations":{},"name":"example","namespace":"default"},"spec":{}}
  creationTimestamp: "2022-01-28T21:38:40Z"
  finalizers:
  - kit.k8s.sh/control-plane
  generation: 2
  name: example
  namespace: default
  resourceVersion: "925126"
  uid: 98474866-0d03-443d-a501-bff16a7f8a9b
spec:
  etcd:
    replicas: 3
  kubernetesVersion: "1.19"
  master:
    apiServer:
      replicas: 3
status:
...

Check placement of apiserver pods:

$ kubectl describe node ip-192-168-191-222.us-west-2.compute.internal
...
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  default                     example-apiserver-7f588874d9-6qjtq    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-apiserver-7f588874d9-7ldgm    1 (1%)        0 (0%)      0 (0%)           0 (0%)         2m3s
  default                     example-apiserver-7f588874d9-j7l7d    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-authenticator-wsvxp           0 (0%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-controller-manager-hqrck      1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-scheduler-5vf8g               1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  kube-system                 aws-node-7h2kr                        25m (0%)      0 (0%)      0 (0%)           0 (0%)         77s
  kube-system                 kube-proxy-qwwvb                      100m (0%)     0 (0%)      0 (0%)           0 (0%)         77s

natherz97 · 2022-01-28T23:59:07Z

When we only have 1 apiserver pod, Karpenter launches a single node so our topology.kubernetes.io/zone has 1 node and kubernetes.io/hostname has 1 node. With our configured PodTopologySpread, we allow a maxSkew of 1, meaning we allow any 2 zones to differ in their number of apiserver pods by 1 and we allow any 2 nodes to differ in their number of apiserver pods by 1. As a result, the maxSkew of 1 is satisfied when we create 2 additional apiserver pods and they're both assigned to the existing node.

When we create the CP object with 3 replicas initially, Karpenter will launch three nodes and at that point the scheduler has to distribute the pods among the 3 nodes to respect the maxSkew of 1. To get our desired behavior when scaling up from to 3, we'd need the 2 additional nodes to already exist in the cluster which isn't possible because we'd launch the 2 nodes if the pods were stuck pending.

We need pod anti-affinity to ensure that the the pods aren't scheduled together when there's one node since this isn't guaranteed by the PodTopologySpread. Karpenter does not support pod anti-affinity so if we configure it for apiserver pods, Karpenter will not launch nodes for those pending pods. See: aws/karpenter-provider-aws#695 (comment)

Options:

wait until there's pod anti-affinity support for Karpenter
don't support changing number of replicas after CP object creation
delete the existing nodes which host the apiservers when scaling up the number of replicas (either by directly deleting the node object or waiting for Karpenter to do it after scaling the deployment to zero). Note that this would make the KIT cluster lose all apiservers temporarily

ellistarn · 2022-01-29T00:15:06Z

This is the exact edge case with topology that convinced me to support antiaffinity in Karpenter.

natherz97 · 2022-01-29T00:21:35Z

Ideally we'd want to replace this topology spread in the apiserver deployment spec:

      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            kit.k8s.sh/app: example-apiserver
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            kit.k8s.sh/app: example-apiserver
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

With this anti-affinity:

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: topology.kubernetes.io/zone
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: topology.kubernetes.io/hostname

And we'd keep the existing nodeSelectors:


      nodeSelector:
        kit.k8s.sh/app: example-apiserver
        kit.k8s.sh/control-plane-name: example
        node.kubernetes.io/instance-type: m5.16xlarge

prateekgogia · 2022-01-31T16:15:09Z

This is the exact edge case with topology that convinced me to support antiaffinity in Karpenter.

@ellistarn Is this already merged/available in latest release?

natherz97 · 2022-01-31T20:26:37Z

Built Karpenter from this PR to add pod anti-affinity support: aws/karpenter-provider-aws#1221

Used this anti-affinity spec in the apiserver deployment and it fixes the issue when scaling up from 1 apiserver to 3 where they get assigned to the same existing node.

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: kubernetes.io/hostname

natherz97 self-assigned this Jan 28, 2022

natherz97 added 0.1 megaxl operator Label to group all tasks for operator labels Jan 28, 2022

prateekgogia added the bug Something isn't working label Feb 7, 2022

prateekgogia removed the 0.1 megaxl label Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

natherz97 commented Jan 28, 2022

natherz97 commented Jan 28, 2022

ellistarn commented Jan 29, 2022

natherz97 commented Jan 29, 2022

prateekgogia commented Jan 31, 2022

natherz97 commented Jan 31, 2022 •

edited

Loading

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

Comments

natherz97 commented Jan 28, 2022

natherz97 commented Jan 28, 2022

ellistarn commented Jan 29, 2022

natherz97 commented Jan 29, 2022

prateekgogia commented Jan 31, 2022

natherz97 commented Jan 31, 2022 • edited Loading

natherz97 commented Jan 31, 2022 •

edited

Loading