Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[operator] Scaling up number of apiserver pods will result in pods being assigned to the same node #126

Open
natherz97 opened this issue Jan 28, 2022 · 5 comments
Assignees
Labels
bug Something isn't working operator Label to group all tasks for operator

Comments

@natherz97
Copy link
Contributor

If we apply the default ControlPlane object, we will launch one apiserver pod and Karpenter will launch one control plane node for this pod. If we scale up the number of replicas in the apiserver deployment from one to three, the two new apiserver pods will be launched on the existing node rather than being assigned their own nodes. Note that creating the ControlPlane object initially with three replicas would result in three nodes being launched.

Reproduction:

  1. Create a default ControlPlane object which defaults to one apiserver replica:
$ cat <<EOF | kubectl apply -f -
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  name: ${GUEST_CLUSTER_NAME} # Desired Cluster name
spec: {}
EOF
  1. Edit the object and increase number of apiserver replicas from one to three:
$ kubectl get controlplane example -o yaml
apiVersion: kit.k8s.sh/v1alpha1
kind: ControlPlane
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kit.k8s.sh/v1alpha1","kind":"ControlPlane","metadata":{"annotations":{},"name":"example","namespace":"default"},"spec":{}}
  creationTimestamp: "2022-01-28T21:38:40Z"
  finalizers:
  - kit.k8s.sh/control-plane
  generation: 2
  name: example
  namespace: default
  resourceVersion: "925126"
  uid: 98474866-0d03-443d-a501-bff16a7f8a9b
spec:
  etcd:
    replicas: 3
  kubernetesVersion: "1.19"
  master:
    apiServer:
      replicas: 3
status:
...
  1. Check placement of apiserver pods:
$ kubectl describe node ip-192-168-191-222.us-west-2.compute.internal
...
Non-terminated Pods:          (8 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  default                     example-apiserver-7f588874d9-6qjtq    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-apiserver-7f588874d9-7ldgm    1 (1%)        0 (0%)      0 (0%)           0 (0%)         2m3s
  default                     example-apiserver-7f588874d9-j7l7d    1 (1%)        0 (0%)      0 (0%)           0 (0%)         25s
  default                     example-authenticator-wsvxp           0 (0%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-controller-manager-hqrck      1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  default                     example-scheduler-5vf8g               1 (1%)        0 (0%)      0 (0%)           0 (0%)         119s
  kube-system                 aws-node-7h2kr                        25m (0%)      0 (0%)      0 (0%)           0 (0%)         77s
  kube-system                 kube-proxy-qwwvb                      100m (0%)     0 (0%)      0 (0%)           0 (0%)         77s
@natherz97 natherz97 self-assigned this Jan 28, 2022
@natherz97 natherz97 added 0.1 megaxl operator Label to group all tasks for operator labels Jan 28, 2022
@natherz97
Copy link
Contributor Author

When we only have 1 apiserver pod, Karpenter launches a single node so our topology.kubernetes.io/zone has 1 node and kubernetes.io/hostname has 1 node. With our configured PodTopologySpread, we allow a maxSkew of 1, meaning we allow any 2 zones to differ in their number of apiserver pods by 1 and we allow any 2 nodes to differ in their number of apiserver pods by 1. As a result, the maxSkew of 1 is satisfied when we create 2 additional apiserver pods and they're both assigned to the existing node.

When we create the CP object with 3 replicas initially, Karpenter will launch three nodes and at that point the scheduler has to distribute the pods among the 3 nodes to respect the maxSkew of 1. To get our desired behavior when scaling up from to 3, we'd need the 2 additional nodes to already exist in the cluster which isn't possible because we'd launch the 2 nodes if the pods were stuck pending.

We need pod anti-affinity to ensure that the the pods aren't scheduled together when there's one node since this isn't guaranteed by the PodTopologySpread. Karpenter does not support pod anti-affinity so if we configure it for apiserver pods, Karpenter will not launch nodes for those pending pods. See: aws/karpenter-provider-aws#695 (comment)

Options:

  • wait until there's pod anti-affinity support for Karpenter
  • don't support changing number of replicas after CP object creation
  • delete the existing nodes which host the apiservers when scaling up the number of replicas (either by directly deleting the node object or waiting for Karpenter to do it after scaling the deployment to zero). Note that this would make the KIT cluster lose all apiservers temporarily

@ellistarn
Copy link
Contributor

This is the exact edge case with topology that convinced me to support antiaffinity in Karpenter.

@natherz97
Copy link
Contributor Author

Ideally we'd want to replace this topology spread in the apiserver deployment spec:

      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            kit.k8s.sh/app: example-apiserver
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - labelSelector:
          matchLabels:
            kit.k8s.sh/app: example-apiserver
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule

With this anti-affinity:

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: topology.kubernetes.io/zone
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: topology.kubernetes.io/hostname

And we'd keep the existing nodeSelectors:


      nodeSelector:
        kit.k8s.sh/app: example-apiserver
        kit.k8s.sh/control-plane-name: example
        node.kubernetes.io/instance-type: m5.16xlarge

@prateekgogia
Copy link
Member

This is the exact edge case with topology that convinced me to support antiaffinity in Karpenter.

@ellistarn Is this already merged/available in latest release?

@natherz97
Copy link
Contributor Author

natherz97 commented Jan 31, 2022

Built Karpenter from this PR to add pod anti-affinity support: aws/karpenter-provider-aws#1221

Used this anti-affinity spec in the apiserver deployment and it fixes the issue when scaling up from 1 apiserver to 3 where they get assigned to the same existing node.

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: kit.k8s.sh/app
                operator: In
                values:
                - example-apiserver
            topologyKey: kubernetes.io/hostname

@prateekgogia prateekgogia added the bug Something isn't working label Feb 7, 2022
@prateekgogia prateekgogia removed the 0.1 megaxl label Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working operator Label to group all tasks for operator
Projects
None yet
Development

No branches or pull requests

3 participants