Statefulset pod created in wrong zone #5599

Vormillion · 2024-02-05T14:14:28Z

Description

I am creating new app deployment using Helm.
There are 3 deployments with podAffinity and 1 statefulset (all single replica) - all 4 are fitting my node.

Statefulset is creating PVC using default storage class (EBS, zone eu-west-1c).
The problem is that 3 deployments are starting on node in zone eu-west-1a/b/c (randomly) so there is an issue when EBS is created in requested eu-west-1c and stateful pod together with deployments in different zone.

I saw ticket #1015 where in theory such scenario should be covered, but looks like Karpenter does not take into consideration statefulset PV setting when selecting zone.

Karpenter version: 0.33

Statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: rabbitmq
  namespace: default
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      type: rabbitmq
  serviceName: rabbitmq-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        role: rabbitmq
        type: rabbitmq
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: role
                operator: In
                values:
                - frontend
            topologyKey: kubernetes.io/hostname
      containers:
      - command:
        - bash
        - -ec
        - |
          mkdir -p /opt/bitnami/rabbitmq/.rabbitmq/
          mkdir -p /opt/bitnami/rabbitmq/etc/rabbitmq/
          touch /opt/bitnami/rabbitmq/var/lib/rabbitmq/.start
          #persist the erlang cookie in both places for server and cli tools
          echo $RABBITMQ_ERL_COOKIE > /opt/bitnami/rabbitmq/var/lib/rabbitmq/.erlang.cookie
          cp /opt/bitnami/rabbitmq/var/lib/rabbitmq/.erlang.cookie /opt/bitnami/rabbitmq/.rabbitmq/
          #change permission so only the user has access to the cookie file
          chmod 600 /opt/bitnami/rabbitmq/.rabbitmq/.erlang.cookie /opt/bitnami/rabbitmq/var/lib/rabbitmq/.erlang.cookie
          #copy the mounted configuration to both places
          cp  /opt/bitnami/rabbitmq/conf/* /opt/bitnami/rabbitmq/etc/rabbitmq
          # Apply resources limits
          ulimit -n "${RABBITMQ_ULIMIT_NOFILES}"
          #replace the default password that is generated
          sed -i "/CHANGEME/cdefault_pass=${RABBITMQ_PASSWORD//\\/\\\\}" /opt/bitnami/rabbitmq/etc/rabbitmq/rabbitmq.conf
          #api check for probes
          cat > /opt/bitnami/rabbitmq/sbin/rabbitmq-api-check <<EOF
          #!/bin/sh
          set -e
          URL=\$1
          EXPECTED=\$2
          ACTUAL=\$(curl --silent --show-error --fail "\${URL}")
          echo "\${ACTUAL}"
          test "\${EXPECTED}" = "\${ACTUAL}"
          EOF
          chmod a+x /opt/bitnami/rabbitmq/sbin/rabbitmq-api-check
          #health check for probes, handle period during rabbtmq sync
          cat > /opt/bitnami/rabbitmq/sbin/rabbitmq-health-check <<EOF
          #!/bin/sh
          START_FLAG=/opt/bitnami/rabbitmq/var/lib/rabbitmq/.start
          if [ -f \${START_FLAG} ]; then
             rabbitmqctl node_health_check
             RESULT=\$?
             if [ \$RESULT -ne 0 ]; then
                rabbitmqctl status
                exit $?
             fi
             rm -f \${START_FLAG}
             exit \${RESULT}
          fi
          rabbitmq-api-check \$1 \$2
          EOF
          chmod a+x /opt/bitnami/rabbitmq/sbin/rabbitmq-health-check
          exec rabbitmq-server
        env:
        - name: BITNAMI_DEBUG
          value: "false"
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: K8S_SERVICE_NAME
          value: rabbitmq-headless
        - name: K8S_ADDRESS_TYPE
          value: hostname
        - name: RABBITMQ_NODENAME
          value: rabbit@$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local
        - name: K8S_HOSTNAME_SUFFIX
          value: .$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local
        - name: RABBITMQ_LOGS
          value: '-'
        - name: RABBITMQ_ULIMIT_NOFILES
          value: "65536"
        - name: RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS
          value: +S 2:1
        - name: RABBITMQ_USE_LONGNAME
          value: "true"
        - name: RABBITMQ_ERL_COOKIE
          valueFrom:
            secretKeyRef:
              key: rabbitmq-erlang-cookie
              name: rabbitmq
        - name: RABBITMQ_PASSWORD
          valueFrom:
            secretKeyRef:
              key: rabbitmq-password
              name: rabbitmq
        image: xxx
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - rabbitmq-api-check "http://admin:[email protected]:15672/api/healthchecks/node"
              '{"status":"ok"}'
          failureThreshold: 6
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 20
        name: rabbitmq
        ports:
        - containerPort: 4369
          name: epmd
          protocol: TCP
        - containerPort: 5672
          name: amqp
          protocol: TCP
        - containerPort: 25672
          name: dist
          protocol: TCP
        - containerPort: 15672
          name: stats
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - rabbitmq-health-check "http://admin:[email protected]:15672/api/healthchecks/node"
              '{"status":"ok"}'
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 20
        resources:
          limits:
            cpu: 100m
            memory: 250Mi
          requests:
            cpu: 100m
            memory: 250Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /opt/bitnami/rabbitmq/conf
          name: config-volume
        - mountPath: /opt/bitnami/rabbitmq/var/lib/rabbitmq
          name: rabbitmq-data
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: regcred
      nodeSelector:
        karpenter.sh/nodepool: custompool
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        runAsUser: 1001
      serviceAccount: rabbitmq
      serviceAccountName: rabbitmq
      terminationGracePeriodSeconds: 10
      tolerations:
      - effect: NoSchedule
        key: custompool-taint
        operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: rabbitmq.conf
            path: rabbitmq.conf
          - key: enabled_plugins
            path: enabled_plugins
          name: rabbitmq-config
        name: config-volume
      - name: rabbitmq-data
        persistentVolumeClaim:
          claimName: data-rabbitmq-0
  updateStrategy:
    type: RollingUpdate

PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: rabbitmq
  name: data-rabbitmq-0
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: encrypted-gp3-rabbit
  volumeMode: Filesystem

Storageclass

allowedTopologies:
- matchLabelExpressions:
  - key: topology.ebs.csi.aws.com/zone
    values:
    - eu-west-1c
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: encrypted-gp3-rabbit
parameters:
  encrypted: "true"
  fsType: ext4
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Deployments - created 3 times with different names

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend-nginx-php
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      type: frontend-nginx-php
  strategy:
    rollingUpdate:
      maxSurge: 50%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        role: frontend
        type: frontend-nginx-php
    spec:
      containers:
      - env:
        - name: PREFIX
          value: xx
        envFrom:
        - configMapRef:
            name: xx
        image: xx
        imagePullPolicy: IfNotPresent
        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                whoami
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - /bin/sleep 1; kill -QUIT 1
        name: frontend-php
        ports:
        - containerPort: 9000
          name: php-fpm
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 2Gi
          requests:
            cpu: "1"
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        karpenter.sh/nodepool: custompool
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: custompool-taint
        operator: Exists

The text was updated successfully, but these errors were encountered:

tzneal · 2024-02-05T16:12:49Z

Can you supply the Pod, PVC and PV specs?

Vormillion · 2024-02-06T08:38:59Z

I updated my first comment.

If it makes a difference, I'm starting first 3x deployments, then starting rabbitmq statefulset.

So far I overcome this issue by strictly setting up zone eu-west-1c in my nodepool (custompool) to match storageclass allowedTopologies.

engedaam · 2024-02-07T00:03:34Z

Can you provide the karpenter logs as well?

Vormillion · 2024-02-07T13:25:32Z

Finally solved by forcing nodepool where STS will land to schedule in the same zone as EBS.

Vormillion added bug Something isn't working needs-triage Issues that need to be triaged labels Feb 5, 2024

tzneal removed the needs-triage Issues that need to be triaged label Feb 5, 2024

Vormillion closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statefulset pod created in wrong zone #5599

Statefulset pod created in wrong zone #5599

Vormillion commented Feb 5, 2024 •

edited

Loading

tzneal commented Feb 5, 2024

Vormillion commented Feb 6, 2024

engedaam commented Feb 7, 2024

Vormillion commented Feb 7, 2024

Statefulset pod created in wrong zone #5599

Statefulset pod created in wrong zone #5599

Comments

Vormillion commented Feb 5, 2024 • edited Loading

Description

tzneal commented Feb 5, 2024

Vormillion commented Feb 6, 2024

engedaam commented Feb 7, 2024

Vormillion commented Feb 7, 2024

Vormillion commented Feb 5, 2024 •

edited

Loading