Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225

Closed
Tracked by #221
xiaohongchen1991 opened this issue Sep 24, 2024 · 8 comments · Fixed by #229
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@xiaohongchen1991
Copy link

xiaohongchen1991 commented Sep 24, 2024

What happened:
When deploying changes that update the leader and worker templates, the rolling update will be triggered. During the rolling update, I observed 2 scenarios where LWS controller will reconcile all worker statefulsets and trigger "patch" event to update all worker statefulsets to use the new version at the same time. This causes service downtime since all model replicas are updated at the same time. Here are the 2 scenarios observed:

  1. This can happen when the lws controller pod is restarted during rolling update. This happens when I need to update the LWS template and patch the underlying nodes at the same time. The rolling update is triggered for both LWS and the underlying instances/nodes. When the node running the lws controller is terminated, a new lws controller pod will be set up. We observed that the lws controller pod restart will trigger reconciliations. The pod controller will then update all worker statefulsets all at once.

  2. Seems like the lws controller is using the default 10 hours controller SyncPeriod (https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.2/pkg/cache/cache.go#L171). For every certain period, the pod controller will reconcile the worker statefulset. If this happens during the LWS rolling update where worker template is updated, it will update all worker statefulset all at once and cause service downtime.

What you expected to happen:

I would expect the worker statefulset update follows the same order as the leader pods rolling update. For controller pod restart and controller resync, they should not trigger worker statefulset update.

The LWS rolling update is fully controlled by the leader statefulset's UpdateStrategy. For worker statefulset, currently it is not configured explicitly and uses the default RollingUpdate strategy. I switched to use OnDelete strategy for worker statefulset and it bypasses the issue. This works since the LWS is configured to use RecreateGroupOnPodRestart RestartPolicy, during leader pods rolling update, all worker pods will be restarted which will trigger worker StatefulSet update when using OnDelete UpdateStrategy. In this case, worker StatefulSet update will follow the leader pod rolling update order and ignore the pod controller reconciliation from the above 2 scenarios.

How to reproduce it (as minimally and precisely as possible):

  • Set up a running LWS
  • Update the LWS definition including changes to both leader and worker templates. Say changing the image URL or environment variables passed for both leader and worker templates
  • Run kubectl delete pod <lws controller pod> -n lws-system to kill the lws controller pod to force a controller pod restart (scenario 1)
  • After a few minutes, all worker statefulsets will get updated instead of following rolling update.

Anything else we need to know?:
Tried with the following change for pod controller (https://github.com/kubernetes-sigs/lws/blob/main/pkg/controllers/pod_controller.go#L307-L316) to bypass the issue.

modified   pkg/controllers/pod_controller.go
@@ -311,6 +311,7 @@ func constructWorkerStatefulSetApplyConfiguration(leaderPod corev1.Pod, lws lead                                                                        
       WithPodManagementPolicy(appsv1.ParallelPodManagement).                                                                                                               
       WithTemplate(&podTemplateApplyConfiguration).                                                                                                                        
       WithOrdinals(appsapplyv1.StatefulSetOrdinals().WithStart(1)).                                                                                                        
+      WithUpdateStrategy(appsapplyv1.StatefulSetUpdateStrategy().WithType(appsv1.OnDeleteStatefulSetStrategyType)).                                                        
       WithSelector(metaapplyv1.LabelSelector().                                                                                                                            
         WithMatchLabels(selectorMap))).                                                                                                                                    
     WithLabels(labelMap)

Environment: I set up the LWS using AWS EKS

  • Kubernetes version (use kubectl version): 1.29
  • LWS version (use git describe --tags --dirty --always): 0.3.0
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g: cat /etc/os-release): Amazon Linux
  • Kernel (e.g. uname -a): Linux ip-21-16-69-244.ec2.internal 5.10.224-212.876.amzn2.x86_64 Upload Leaderworkerset repository #1 SMP Thu Aug 22 16:55:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
@xiaohongchen1991 xiaohongchen1991 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 24, 2024
@liurupeng
Copy link
Collaborator

Great root cause and workaround! Do you want to submit a quick fix to the repo?

@xiaohongchen1991
Copy link
Author

xiaohongchen1991 commented Sep 25, 2024

Great root cause and workaround! Do you want to submit a quick fix to the repo?

I feel this solution to use OnDelete UpdateStrategy for worker StatefulSet only works when using RecreateGroupOnPodRestart RestartPolicy. RecreateGroupOnPodRestart RestartPolicy will force worker pod recreation and hence triggers worker StatefulSet update with OnDelete UpdateStrategy.

But if Default RestartPolicy policy is used, I feel the leader pod update won't trigger worker StatefulSet update when using OnDelete UpdateStrategy. I am not so sure if this solution can work for Default RestartPolicy as well.

@liurupeng
Copy link
Collaborator

RestartPolicy only controls how a single replica handles failure. If OnDelete is enabled for the worker Statefulset, for rolling update, it will work for all cases since each group (leader pod + worker sts) will be recreated, the worker sts will be deleted. For failure restart, if it's "default" value, the worker sts's update strategy will not impact the individual pod restart as well right? The pod should be recreated with the existing worker statefulset's pod template? let me know if I have missed anything

@ahg-g
Copy link
Contributor

ahg-g commented Oct 1, 2024

@kerthcet for thoughts as well

@ahg-g
Copy link
Contributor

ahg-g commented Oct 5, 2024

The issue is that we are patching the workers statefulset all the time:

patch := &unstructured.Unstructured{

So, any update to the workers pod template is being patched into the workers sts on any reconcile to the leader pod. This will happen when the controller is restarted because the restart triggers a reconcile on all pods.

Changing the update strategy of the workers statefulset to OnDelete makes sense and should fix the issue.

@ahg-g
Copy link
Contributor

ahg-g commented Oct 6, 2024

Actually what we need to do is not to patch at all in

patch := &unstructured.Unstructured{
we should only create if the sts doesn't exist.

I don't think there is a case where we need to patch, updates always recreate the workers sts: in any update, lws forces an update to the leaders sts, causing the each leader to be deleted and created again, the deletion of each leader triggers a delete of its corresponding workers sts because it is owned by its leader, and so the only thing we need to do is just create the workers sts if it didn't exist on leader pod reconcile.

@Edwinhr716
Copy link
Contributor

/assign @Edwinhr716

@kerthcet
Copy link
Contributor

kerthcet commented Oct 8, 2024

agree, only worker sts creation is what we should consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants