Add startupProbe to statefulsets #10196

jhatcher9999 · 2021-03-29T15:15:27Z

James Hatcher (jhatcher9999) commented:

There is a potential issue in our statefulsets (STS) which is as follows:

Our STSs have an updateStrategy of rollingUpdate which means when you make certain edits to the STS, the STS will begin to do a rolling update of its pods. Examples of these updates include changing the image version, labels, annotations.
When the STS cycles through the pods, the PodStatus moves from ContainerCreating to Initialized. Then, the livenessProbe kicks in and marks the pod as Running; and then the readinessProbe kicks in and marks the pod as Ready.
As soon as this state is reached, the STS moves forward and starts to terminate and update the next pod in the STS.
At this point from the CR cluster's standpoint, the node has joined back, but the cluster may still be resolving under-replicated ranges, etc. It would be better from the CRDB perspective to wait for these types of issues to be resolved and stable before moving to taking the next pod down -- especially when the cluster is under load.

To remedy this situation, I propose that we add a startupProbe to the STS. The startupProbe is supported in k8s 1.16+. The startupProbe, when defined, delays the start of the livenessProbe and readinessProbe. Once it exits successfully, then the other probes kick in. If it doesn't exit successfully, then the pod is terminated and is subject to its restartPolicy.

Here is a startupProbe that I tested successfully.

       startupProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - |
              for i in {1..30};
              do
                UR=$(/cockroach/cockroach sql \
                   --certs-dir=/cockroach/cockroach-certs/ \
                   -e "SELECT SUM((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated FROM crdb_internal.kv_store_status S INNER JOIN crdb_internal.gossip_liveness L ON S.node_id = L.node_id WHERE L.decommissioning <> true;" \
                   --format raw \
                   --host=cockroachdb-public | awk '{if(NR>3)print}' | awk '{if(NR==1)print}'
                     );
                echo "Under-replicated ranges: $UR" >> /usr/share/message;
                if [ -z "$UR" ];
                then
                  echo "No under-replicated ranges reported.  Sleeping for 10 seconds - iteration $i" >> /usr/share/message;
                  sleep 10;
                  continue;
                fi
                if [ $UR -gt 0 ];
                then
                  echo "Sleeping for 10 seconds - iteration $i" >> /usr/share/message;
                  sleep 10;
                else
                  echo "breaking out of loop" >> /usr/share/message;
                  break;
                fi
              done
              exit 0;
          failureThreshold: 1
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

Here is a script that can be used to monitor pods in the STS as they are cycled:

echo "Node 1" && kubectl exec cockroachdb-1 -it -n <put NS here> --context <put context here> -- cat /usr/share/message && echo; \
echo "Node 2" && kubectl exec cockroachdb-2 -it -n <put NS here> --context <put context here> -- cat /usr/share/message && echo

A few things I haven't totally worked through that need further consideration and testing:

How well does it handle long-running startups (for instance, downloading an image for the first time)?
Various cluster configs (single-region, multi-region, single-node)
How does it react when running on k8s versions prior to 1.16?
Besides under-replicated ranges, are there other scenarios that the probe should consider? Resources available? Running jobs? Storage capacity? Gossip established with x% of the nodes in the cluster?

Jira Issue: DOC-1071

The text was updated successfully, but these errors were encountered:

lin-crl · 2021-03-29T20:53:40Z

Tested the script on EKS 1.17 and found EKS doesn't support alpha features in 1.17. The StartupProbe is an alpha feature in 1.16 and became beta feature in 1.18 aws/containers-roadmap#947

exalate-issue-sync · 2023-09-25T23:43:49Z

linville (mdlinville) commented:
Is the request here to update the actual StatefulSet? If so, this may need more than a DOC update.

jhatcher9999 mentioned this issue Mar 29, 2021

Consider adding an endpoint that can be probed by a k8s pod for startup health cockroachdb/cockroach#62731

Open

exalate-issue-sync bot assigned mdlinville Oct 22, 2022

exalate-issue-sync bot added the C-doc-improvement label Oct 30, 2024

exalate-issue-sync bot closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add startupProbe to statefulsets #10196

Add startupProbe to statefulsets #10196

jhatcher9999 commented Mar 29, 2021 •

edited by exalate-issue-sync bot

Loading

lin-crl commented Mar 29, 2021

exalate-issue-sync bot commented Sep 25, 2023

Add startupProbe to statefulsets #10196

Add startupProbe to statefulsets #10196

Comments

jhatcher9999 commented Mar 29, 2021 • edited by exalate-issue-sync bot Loading

lin-crl commented Mar 29, 2021

exalate-issue-sync bot commented Sep 25, 2023

jhatcher9999 commented Mar 29, 2021 •

edited by exalate-issue-sync bot

Loading