-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Kubernetes operator reapplies Serve config too frequently #28652
Comments
I just retested this with an Kuberay operator from the newest commit on Kuberay master branch (4ca05ab). I could see in the logs that the serve deployment requests are now sent twice every 2 seconds from the operator to the ray cluster, but 2 seconds still seems to be not enough time for ray serve to report the startup status of the serve deployments. (When I apply my workaround to the Kuberay operator code, which prevents sending serve deployment requests not more often than every 5 seconds, then the cluster starts up fine.) As you can see in this log from the Serve Controller actor, the serve deployment requests go on for more than 5 minutes. My actors do never start up until the cluster runs in the restart timeout and restarts.
The log from the kuberay operator constantly states:
Log snippet from kuberay operator log:
|
Thanks for the update. I think we know the solution. And just need to make it merged.
cc @sihanwang41 |
also cc @wilsonwang371 @Jeffwan |
Do we need to make the change in kuberay controller code? |
It should be a code change for Update to
|
Hi @Martin4R, Thank you for trying it out! I will have a test and verify the fix. If working, I will merge it. |
I'm seeing the same issue. Ray 2.2.0 and both Kuberay 0.4.0 and nightly. The above merge suggests this should have been fixed in 0.4.0 though. I can see the same Excerpt from kuberay-operator logs:
This line looks suspicious since the logs show a serve deployment had started 2s earlier:
|
Hi @pandemosth, thanks for following up with the logs. It looks like this is still an issue, so we'll take a closer look. |
I confirm the issue. Can't deploy example app because of this. |
By default, the kuberay operator expects deployments to respond within 2 seconds. Hack found from ray issue: ray-project/ray#28652
Hack found from ray issue: ray-project/ray#28652
Thanks for the share @Martin4R. I've ported this workaround to my fork and has mitigated the issue. This feels like it needs a proper mechanism to wait though. Seeing this issue on v0.4.0. Anecdotally I wanna say this happens when my app is larger, in terms of overall app bundle size. |
we're trying to host LLMs from alpa.ai via ray on a k8s cluster and are also facing this issue. One workaround for us is to deploy the cluster and a separate pod that connects to the cluster for a deployment separately but that is of course not the intended behavior. |
By default, the kuberay operator expects deployments to respond within 2 seconds. Hack found from ray issue: ray-project/ray#28652
Hack found from ray issue: ray-project/ray#28652
Hi @pascalwhoop @bewestphal @uthark @Martin4R, can you give a try with nightly wheel? We recently had a fix in kuberay to fix the long preparation time of runtime env issue. Hopefully it can fix your issues, please let me know if the issue still exists. |
The issue is still there with Kuberay 0.5.0 and Ray 2.4.0. The serve deployment request is still applied at least every 2 seconds. The Actors never manage to startup and at some point the whole cluster gets restarted. Logs from Serve Controller:
Tried then to change the rayservice_controller.go to contain "ServiceDefaultRequeueDuration = 6 * time.Second" but this did not help. Had to reapply the "currentTime"-workaround from above to make it work. |
Hi @Martin4R , thank you for trying it out. Unfortunately my change is not in 0.5.0, can you use nightly wheel to verify? |
Hi @sihanwang41, |
Hi @sihanwang41, The Kuberay clusters started up and the dashboard was available, but they did not start up any actors, not even the ServeController actor and the HTTPProxy actors which usually appear. For me it seems that the freshly started clusters never received the serve deployment request from the operator. So far we never saw a startup behavior like this before in the last 7 months we used Kuberay. So it must be something which came in recently. I saw that the bugfix in the Kuberay operator was, to remove the logic which repeatedly reapplies the serve deployment config. My guess would be, that there is some kind of race-condition when a cluster freshly starts up. In the best case it receives the first serve deployment request from the operator because the cluster is already available. In the worst case the cluster is not available yet, misses the first serve deployment request and then stays in this state without any actors because the serve deployment request is now never resent. Can you recheck if the bugfix can lead to a behavior like this? |
Hi @Martin4R , |
Hi @Martin4R , Look like that is different issue, please create a different ticket for the issue you have. Closing this ticket now. |
What happened + What you expected to happen
From a Ray Slack thread:
The user posted logs from the Serve controller and the Kubernetes operator that imply the controller was receiving new Serve config deployment requests more than once per second. E.g.:
This likely doesn't give enough time for Serve to deploy the config completely before the request is canceled and re-issued, preventing the Serve application from being deployed. This in turn likely caused the cluster to be marked unhealthy and restarted.
Ideally, the Serve application should have enough time to be started without being interrupted.
Versions / Dependencies
Ray 2.0.0 and Kuberay.
Reproduction script
See the Ray Slack thread for logs.
The user observed the issue on the FruitStand example.
Issue Severity
No response
The text was updated successfully, but these errors were encountered: