All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

xiaohongchen1991 · 2024-09-24T15:01:18Z

What happened:
When deploying changes that update the leader and worker templates, the rolling update will be triggered. During the rolling update, I observed 2 scenarios where LWS controller will reconcile all worker statefulsets and trigger "patch" event to update all worker statefulsets to use the new version at the same time. This causes service downtime since all model replicas are updated at the same time. Here are the 2 scenarios observed:

This can happen when the lws controller pod is restarted during rolling update. This happens when I need to update the LWS template and patch the underlying nodes at the same time. The rolling update is triggered for both LWS and the underlying instances/nodes. When the node running the lws controller is terminated, a new lws controller pod will be set up. We observed that the lws controller pod restart will trigger reconciliations. The pod controller will then update all worker statefulsets all at once.
Seems like the lws controller is using the default 10 hours controller SyncPeriod (https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.2/pkg/cache/cache.go#L171). For every certain period, the pod controller will reconcile the worker statefulset. If this happens during the LWS rolling update where worker template is updated, it will update all worker statefulset all at once and cause service downtime.

What you expected to happen:

I would expect the worker statefulset update follows the same order as the leader pods rolling update. For controller pod restart and controller resync, they should not trigger worker statefulset update.

The LWS rolling update is fully controlled by the leader statefulset's UpdateStrategy. For worker statefulset, currently it is not configured explicitly and uses the default RollingUpdate strategy. I switched to use OnDelete strategy for worker statefulset and it bypasses the issue. This works since the LWS is configured to use RecreateGroupOnPodRestart RestartPolicy, during leader pods rolling update, all worker pods will be restarted which will trigger worker StatefulSet update when using OnDelete UpdateStrategy. In this case, worker StatefulSet update will follow the leader pod rolling update order and ignore the pod controller reconciliation from the above 2 scenarios.

How to reproduce it (as minimally and precisely as possible):

Set up a running LWS
Update the LWS definition including changes to both leader and worker templates. Say changing the image URL or environment variables passed for both leader and worker templates
Run kubectl delete pod <lws controller pod> -n lws-system to kill the lws controller pod to force a controller pod restart (scenario 1)
After a few minutes, all worker statefulsets will get updated instead of following rolling update.

Anything else we need to know?:
Tried with the following change for pod controller (https://github.com/kubernetes-sigs/lws/blob/main/pkg/controllers/pod_controller.go#L307-L316) to bypass the issue.

modified   pkg/controllers/pod_controller.go
@@ -311,6 +311,7 @@ func constructWorkerStatefulSetApplyConfiguration(leaderPod corev1.Pod, lws lead                                                                        
       WithPodManagementPolicy(appsv1.ParallelPodManagement).                                                                                                               
       WithTemplate(&podTemplateApplyConfiguration).                                                                                                                        
       WithOrdinals(appsapplyv1.StatefulSetOrdinals().WithStart(1)).                                                                                                        
+      WithUpdateStrategy(appsapplyv1.StatefulSetUpdateStrategy().WithType(appsv1.OnDeleteStatefulSetStrategyType)).                                                        
       WithSelector(metaapplyv1.LabelSelector().                                                                                                                            
         WithMatchLabels(selectorMap))).                                                                                                                                    
     WithLabels(labelMap)

Environment: I set up the LWS using AWS EKS

Kubernetes version (use kubectl version): 1.29
LWS version (use git describe --tags --dirty --always): 0.3.0
Cloud provider or hardware configuration: AWS EKS
OS (e.g: cat /etc/os-release): Amazon Linux
Kernel (e.g. uname -a): Linux ip-21-16-69-244.ec2.internal 5.10.224-212.876.amzn2.x86_64 Upload Leaderworkerset repository #1 SMP Thu Aug 22 16:55:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

liurupeng · 2024-09-25T06:41:20Z

Great root cause and workaround! Do you want to submit a quick fix to the repo?

xiaohongchen1991 · 2024-09-25T16:11:14Z

Great root cause and workaround! Do you want to submit a quick fix to the repo?

I feel this solution to use OnDelete UpdateStrategy for worker StatefulSet only works when using RecreateGroupOnPodRestart RestartPolicy. RecreateGroupOnPodRestart RestartPolicy will force worker pod recreation and hence triggers worker StatefulSet update with OnDelete UpdateStrategy.

But if Default RestartPolicy policy is used, I feel the leader pod update won't trigger worker StatefulSet update when using OnDelete UpdateStrategy. I am not so sure if this solution can work for Default RestartPolicy as well.

liurupeng · 2024-09-27T20:47:36Z

RestartPolicy only controls how a single replica handles failure. If OnDelete is enabled for the worker Statefulset, for rolling update, it will work for all cases since each group (leader pod + worker sts) will be recreated, the worker sts will be deleted. For failure restart, if it's "default" value, the worker sts's update strategy will not impact the individual pod restart as well right? The pod should be recreated with the existing worker statefulset's pod template? let me know if I have missed anything

ahg-g · 2024-10-01T18:21:20Z

@kerthcet for thoughts as well

ahg-g · 2024-10-05T19:36:36Z

The issue is that we are patching the workers statefulset all the time:

lws/pkg/controllers/pod_controller.go

Line 149 in 274422f

patch := &unstructured.Unstructured{

So, any update to the workers pod template is being patched into the workers sts on any reconcile to the leader pod. This will happen when the controller is restarted because the restart triggers a reconcile on all pods.

Changing the update strategy of the workers statefulset to OnDelete makes sense and should fix the issue.

ahg-g · 2024-10-06T15:11:00Z

Actually what we need to do is not to patch at all in

lws/pkg/controllers/pod_controller.go

Line 149 in 274422f

patch := &unstructured.Unstructured{

we should only create if the sts doesn't exist.

I don't think there is a case where we need to patch, updates always recreate the workers sts: in any update, lws forces an update to the leaders sts, causing the each leader to be deleted and created again, the deletion of each leader triggers a delete of its corresponding workers sts because it is owned by its leader, and so the only thing we need to do is just create the workers sts if it didn't exist on leader pod reconcile.

Edwinhr716 · 2024-10-07T21:01:24Z

/assign @Edwinhr716

kerthcet · 2024-10-08T07:45:08Z

agree, only worker sts creation is what we should consider.

xiaohongchen1991 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2024

liurupeng mentioned this issue Sep 25, 2024

Release v0.5.0 requirements #221

Open

k8s-ci-robot assigned Edwinhr716 Oct 7, 2024

Edwinhr716 mentioned this issue Oct 14, 2024

Creating new workerStatefulSet instead of patching when updating LWS template #229

Merged

k8s-ci-robot closed this as completed in #229 Oct 16, 2024

xiaohongchen1991 mentioned this issue Dec 12, 2024

Leader and Worker pods use inconsistent revisions during rolling update #280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

xiaohongchen1991 commented Sep 24, 2024 •

edited

Loading

liurupeng commented Sep 25, 2024

xiaohongchen1991 commented Sep 25, 2024 •

edited

Loading

liurupeng commented Sep 27, 2024

ahg-g commented Oct 1, 2024

ahg-g commented Oct 5, 2024

ahg-g commented Oct 6, 2024 •

edited

Loading

Edwinhr716 commented Oct 7, 2024

kerthcet commented Oct 8, 2024

All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

Comments

xiaohongchen1991 commented Sep 24, 2024 • edited Loading

liurupeng commented Sep 25, 2024

xiaohongchen1991 commented Sep 25, 2024 • edited Loading

liurupeng commented Sep 27, 2024

ahg-g commented Oct 1, 2024

ahg-g commented Oct 5, 2024

ahg-g commented Oct 6, 2024 • edited Loading

Edwinhr716 commented Oct 7, 2024

kerthcet commented Oct 8, 2024

xiaohongchen1991 commented Sep 24, 2024 •

edited

Loading

xiaohongchen1991 commented Sep 25, 2024 •

edited

Loading

ahg-g commented Oct 6, 2024 •

edited

Loading