-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All LWS groups get updated all together during rolling update when LWS controller pod is restarted or controller re-sync happens #225
Comments
Great root cause and workaround! Do you want to submit a quick fix to the repo? |
I feel this solution to use But if |
RestartPolicy only controls how a single replica handles failure. If OnDelete is enabled for the worker Statefulset, for rolling update, it will work for all cases since each group (leader pod + worker sts) will be recreated, the worker sts will be deleted. For failure restart, if it's "default" value, the worker sts's update strategy will not impact the individual pod restart as well right? The pod should be recreated with the existing worker statefulset's pod template? let me know if I have missed anything |
@kerthcet for thoughts as well |
The issue is that we are patching the workers statefulset all the time: lws/pkg/controllers/pod_controller.go Line 149 in 274422f
So, any update to the workers pod template is being patched into the workers sts on any reconcile to the leader pod. This will happen when the controller is restarted because the restart triggers a reconcile on all pods. Changing the update strategy of the workers statefulset to OnDelete makes sense and should fix the issue. |
Actually what we need to do is not to patch at all in lws/pkg/controllers/pod_controller.go Line 149 in 274422f
I don't think there is a case where we need to patch, updates always recreate the workers sts: in any update, lws forces an update to the leaders sts, causing the each leader to be deleted and created again, the deletion of each leader triggers a delete of its corresponding workers sts because it is owned by its leader, and so the only thing we need to do is just create the workers sts if it didn't exist on leader pod reconcile. |
/assign @Edwinhr716 |
agree, only worker sts creation is what we should consider. |
What happened:
When deploying changes that update the leader and worker templates, the rolling update will be triggered. During the rolling update, I observed 2 scenarios where LWS controller will reconcile all worker statefulsets and trigger "patch" event to update all worker statefulsets to use the new version at the same time. This causes service downtime since all model replicas are updated at the same time. Here are the 2 scenarios observed:
This can happen when the lws controller pod is restarted during rolling update. This happens when I need to update the LWS template and patch the underlying nodes at the same time. The rolling update is triggered for both LWS and the underlying instances/nodes. When the node running the lws controller is terminated, a new lws controller pod will be set up. We observed that the lws controller pod restart will trigger reconciliations. The pod controller will then update all worker statefulsets all at once.
Seems like the lws controller is using the default 10 hours controller SyncPeriod (https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.2/pkg/cache/cache.go#L171). For every certain period, the pod controller will reconcile the worker statefulset. If this happens during the LWS rolling update where worker template is updated, it will update all worker statefulset all at once and cause service downtime.
What you expected to happen:
I would expect the worker statefulset update follows the same order as the leader pods rolling update. For controller pod restart and controller resync, they should not trigger worker statefulset update.
The LWS rolling update is fully controlled by the leader statefulset's UpdateStrategy. For worker statefulset, currently it is not configured explicitly and uses the default RollingUpdate strategy. I switched to use OnDelete strategy for worker statefulset and it bypasses the issue. This works since the LWS is configured to use
RecreateGroupOnPodRestart
RestartPolicy, during leader pods rolling update, all worker pods will be restarted which will trigger worker StatefulSet update when usingOnDelete
UpdateStrategy. In this case, worker StatefulSet update will follow the leader pod rolling update order and ignore the pod controller reconciliation from the above 2 scenarios.How to reproduce it (as minimally and precisely as possible):
kubectl delete pod <lws controller pod> -n lws-system
to kill the lws controller pod to force a controller pod restart (scenario 1)Anything else we need to know?:
Tried with the following change for pod controller (https://github.com/kubernetes-sigs/lws/blob/main/pkg/controllers/pod_controller.go#L307-L316) to bypass the issue.
Environment: I set up the LWS using AWS EKS
kubectl version
): 1.29git describe --tags --dirty --always
): 0.3.0cat /etc/os-release
): Amazon Linuxuname -a
): Linux ip-21-16-69-244.ec2.internal 5.10.224-212.876.amzn2.x86_64 Upload Leaderworkerset repository #1 SMP Thu Aug 22 16:55:24 UTC 2024 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: