Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

Closed
shrekris-anyscale opened this issue Sep 20, 2022 · 20 comments
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@shrekris-anyscale
Copy link
Contributor

What happened + What you expected to happen

From a Ray Slack thread:

[A user is] working with Kuberay and Ray 2.0.0 and try to deploy a RayService. It was not easy to get the cluster to start up, because it failed and restarted often. [The user] managed to get it running, but [they] saw that the “task:run_graph” got aborted a lot of times on startup.

The user posted logs from the Serve controller and the Kubernetes operator that imply the controller was receiving new Serve config deployment requests more than once per second. E.g.:

INFO 2022-09-20 00:39:34,046 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,288 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,446 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,572 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,750 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.

This likely doesn't give enough time for Serve to deploy the config completely before the request is canceled and re-issued, preventing the Serve application from being deployed. This in turn likely caused the cluster to be marked unhealthy and restarted.

Ideally, the Serve application should have enough time to be started without being interrupted.

Versions / Dependencies

Ray 2.0.0 and Kuberay.

Reproduction script

See the Ray Slack thread for logs.

The user observed the issue on the FruitStand example.

Issue Severity

No response

@shrekris-anyscale shrekris-anyscale added bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue labels Sep 20, 2022
@Martin4R
Copy link

Martin4R commented Dec 7, 2022

I just retested this with an Kuberay operator from the newest commit on Kuberay master branch (4ca05ab).
The issue still exists.

I could see in the logs that the serve deployment requests are now sent twice every 2 seconds from the operator to the ray cluster, but 2 seconds still seems to be not enough time for ray serve to report the startup status of the serve deployments. (When I apply my workaround to the Kuberay operator code, which prevents sending serve deployment requests not more often than every 5 seconds, then the cluster starts up fine.)

As you can see in this log from the Serve Controller actor, the serve deployment requests go on for more than 5 minutes. My actors do never start up until the cluster runs in the restart timeout and restarts.

:actor_name:ServeController
INFO 2022-12-07 06:55:35,551 controller 108 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-9546f56f15cd6912255876a805c2298ec7a4556c696a8912f8430aa0' on node '9546f56f15cd6912255876a805c2298ec7a4556c696a8912f8430aa0' listening on '0.0.0.0:8000'
INFO 2022-12-07 06:55:35,565 controller 108 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-ed9609ee385a26802a8e2e9f00113100ea4096b00c27e4c246bc8101' on node 'ed9609ee385a26802a8e2e9f00113100ea4096b00c27e4c246bc8101' listening on '0.0.0.0:8000'
INFO 2022-12-07 06:55:36,957 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:37,021 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:37,963 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:38,017 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:40,017 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:40,064 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:42,073 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:42,142 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:44,152 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:44,219 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:46,221 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:46,272 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:48,277 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:48,331 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:50,330 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:50,384 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:52,382 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:52,438 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:54,427 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:54,883 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:56,476 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:56,546 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:58,786 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:58,905 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:00,910 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:00,955 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:01,032 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:02,956 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:03,058 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.

...

INFO 2022-12-07 07:00:14,021 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:14,142 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:16,157 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:16,249 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:16,253	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:18,222 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:18,327 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:20,350 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:20,457 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:22,441 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:22,671 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:24,538 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:24,646 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:26,648 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:26,724 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:28,742 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:28,930 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:30,801 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:30,953 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:31,290 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:32,878 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:32,884	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:33,044 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:33,124 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:33,142	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:34,967 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:35,126 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:37,046 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:37,110 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:39,112 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:39,178 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:39,181	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:41,174 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:41,285 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.

The log from the kuberay operator constantly states:

"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster  test-raycluster-85lmq with key ray/legal/legal-raycluster-85lm

Log snippet from kuberay operator log:

│ 2022-12-07T14:51:53.385Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:53.440Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:53.440Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:53.447Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:53Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:53.447Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:53.466Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:53.466Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:53.466Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:55.381Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:55.382Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:55.382Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:55.397Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:55.397Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:55.405Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:55Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:55.405Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:55.427Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:55.428Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:55.428Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:55.445Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:55.445Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:55.452Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:55Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:55.452Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:55.470Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:55.470Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:55.471Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:57.428Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:57.429Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:57.447Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:57.447Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:57.551Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:57Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:57.551Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:57.574Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:57.574Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:57.574Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704808"}, "reason │
│ 2022-12-07T14:51:57.575Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:57.576Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:57.622Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:57.622Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704808"}, "reason │
│ 2022-12-07T14:51:57.628Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:57Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:57.628Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:57.650Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:57.650Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}  

@brucez-anyscale
Copy link
Contributor

Thanks for the update. I think we know the solution. And just need to make it merged.
We need to let controller ignore state update triggered reconcile. Using

builder.WithPredicates(predicate.GenerationChangedPredicate{})

cc @sihanwang41

@brucez-anyscale
Copy link
Contributor

also cc @wilsonwang371 @Jeffwan

@wilsonwang371
Copy link
Contributor

also cc @wilsonwang371 @Jeffwan

Do we need to make the change in kuberay controller code?

@brucez-anyscale
Copy link
Contributor

brucez-anyscale commented Dec 8, 2022

also cc @wilsonwang371 @Jeffwan

Do we need to make the change in kuberay controller code?

It should be a code change for SetupWithManager

Update to

func (r *RayServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&rayv1alpha1.RayService{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
		Owns(&rayv1alpha1.RayCluster{}).
		Owns(&corev1.Service{}).
		Owns(&networkingv1.Ingress{}).
		Complete(r)
}

@sihanwang41
Copy link
Contributor

Hi @Martin4R, Thank you for trying it out! I will have a test and verify the fix. If working, I will merge it.

@pandemosth
Copy link

I'm seeing the same issue. Ray 2.2.0 and both Kuberay 0.4.0 and nightly. The above merge suggests this should have been fixed in 0.4.0 though.

I can see the same Received new config deployment request. Cancelling previous request. in Serve Controller log, and the kuberay-operator repeating the serve deployment every 2s until the cluster is restarted.

Excerpt from kuberay-operator logs:

2023-01-16T21:54:16.664Z	INFO	controllers.RayService	Reconciling the cluster component.	{"ServiceName": "default/dev"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboard agent service found": "dev-raycluster-prjd5-dashboard-svc"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboardURL": "dev-raycluster-prjd5-dashboard-svc.default.svc.cluster.local:52365"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	shouldUpdate	{"shouldUpdateServe": true, "reason": "Nothing has been cached for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	updateServeDeployment	{"config": ...}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	updateServeDeployment	{"json config": ...}
2023-01-16T21:54:17.560Z	DEBUG	controllers.RayService	updateServeDeployment	{"message": "Cached Serve config for Ray cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:17.560Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680128"}, "reason": "SubmittedServeDeployment", "message": "Controller sent API request to update Serve deployments on cluster dev-raycluster-prjd5"}
2023-01-16T21:54:17.565Z	DEBUG	controllers.RayService	getAndCheckServeStatus	{"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2023-01-16T21:54:17Z","healthLastUpdateTime":"2023-01-16T21:54:17Z"}}}
2023-01-16T21:54:17.565Z	INFO	controllers.RayService	Check serve health	{"ServiceName": "default/dev", "isHealthy": true, "isReady": false, "isActive": false}
2023-01-16T21:54:17.591Z	INFO	controllers.RayService	Mark cluster as waiting for Serve deployments	{"ServiceName": "default/dev", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "dev-raycluster-prjd5"}}
2023-01-16T21:54:17.591Z	INFO	controllers.RayService	Cluster is healthy but not ready: checking again in 2s	{"ServiceName": "default/dev"}
2023-01-16T21:54:17.592Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680157"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}
2023-01-16T21:54:19.592Z	INFO	controllers.RayService	Reconciling the cluster component.	{"ServiceName": "default/dev"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboard agent service found": "dev-raycluster-prjd5-dashboard-svc"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboardURL": "dev-raycluster-prjd5-dashboard-svc.default.svc.cluster.local:52365"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	shouldUpdate	{"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5", "cachedServeConfig": ...}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	updateServeDeployment	{"config": ...}
2023-01-16T21:54:19.596Z	DEBUG	controllers.RayService	updateServeDeployment	{"json config": ...}
2023-01-16T21:54:19.607Z	DEBUG	controllers.RayService	updateServeDeployment	{"message": "Cached Serve config for Ray cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:19.608Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680157"}, "reason": "SubmittedServeDeployment", "message": "Controller sent API request to update Serve deployments on cluster dev-raycluster-prjd5"}
2023-01-16T21:54:19.614Z	DEBUG	controllers.RayService	getAndCheckServeStatus	{"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2023-01-16T21:54:19Z","healthLastUpdateTime":"2023-01-16T21:54:17Z"}}}
2023-01-16T21:54:19.614Z	INFO	controllers.RayService	Check serve health	{"ServiceName": "default/dev", "isHealthy": true, "isReady": false, "isActive": false}
2023-01-16T21:54:19.655Z	INFO	controllers.RayService	Mark cluster as waiting for Serve deployments	{"ServiceName": "default/dev", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "dev-raycluster-prjd5"}}
2023-01-16T21:54:19.655Z	INFO	controllers.RayService	Cluster is healthy but not ready: checking again in 2s	{"ServiceName": "default/dev"}
2023-01-16T21:54:19.655Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680181"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}

This line looks suspicious since the logs show a serve deployment had started 2s earlier:

2023-01-16T21:54:19.595Z DEBUG controllers.RayService shouldUpdate {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5", "cachedServeConfig": ...}

@shrekris-anyscale
Copy link
Contributor Author

Hi @pandemosth, thanks for following up with the logs. It looks like this is still an issue, so we'll take a closer look.

@Martin4R
Copy link

If this hinders the cluster from startup, the extremely dirty temporary workaround is to compile the Kuberay operator from source with the following change to rayservice_controller.go:
kuberay_operator_temporary_quickfix

                currentTime := time.Now().UTC().Unix()
		paused := true
		if currentTime%5 == 0 {
			paused = false
		}
		shouldUpdate = !paused

This makes the gaps between serve deployment requests 5 seconds long, which in my case is enough for the actors to report their status correctly.

@shrekris-anyscale
Copy link
Contributor Author

Thanks for posting the workaround @Martin4R! cc'ing @spillai to the issue.

@uthark
Copy link

uthark commented Jan 25, 2023

I confirm the issue. Can't deploy example app because of this.

bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 9, 2023
By default, the kuberay operator expects deployments to respond within 2
 seconds.

Hack found from ray issue:
ray-project/ray#28652
bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 9, 2023
@bewestphal
Copy link
Contributor

Thanks for the share @Martin4R. I've ported this workaround to my fork and has mitigated the issue.
https://github.com/bewestphal/kuberay/tree/bewestphal/service-controller-update-speed-ten

This feels like it needs a proper mechanism to wait though. Seeing this issue on v0.4.0.

Anecdotally I wanna say this happens when my app is larger, in terms of overall app bundle size.

@pascalwhoop
Copy link

we're trying to host LLMs from alpa.ai via ray on a k8s cluster and are also facing this issue. One workaround for us is to deploy the cluster and a separate pod that connects to the cluster for a deployment separately but that is of course not the intended behavior.

bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 21, 2023
By default, the kuberay operator expects deployments to respond within 2
 seconds.

Hack found from ray issue:
ray-project/ray#28652
bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 21, 2023
@sihanwang41 sihanwang41 added the P1 Issue that should be fixed within a few weeks label Mar 23, 2023
@sihanwang41
Copy link
Contributor

Hi @pascalwhoop @bewestphal @uthark @Martin4R, can you give a try with nightly wheel? We recently had a fix in kuberay to fix the long preparation time of runtime env issue. Hopefully it can fix your issues, please let me know if the issue still exists.

@Martin4R
Copy link

The issue is still there with Kuberay 0.5.0 and Ray 2.4.0. The serve deployment request is still applied at least every 2 seconds. The Actors never manage to startup and at some point the whole cluster gets restarted.

Logs from Serve Controller:

INFO 2023-04-27 06:04:01,931 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:01,933 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:04,169 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:04,171 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:06,223 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:06,225 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:07,271 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:07,272 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:08,384 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:08,386 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:10,527 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:10,528 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,259 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,260 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,728 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,729 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:14,831 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:14,832 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:16,981 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:16,982 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:17,372 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:17,373 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:19,328 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:19,330 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:21,444 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:21,445 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:22,684 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:22,686 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:23,672 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:23,673 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.

Tried then to change the rayservice_controller.go to contain "ServiceDefaultRequeueDuration = 6 * time.Second" but this did not help. Had to reapply the "currentTime"-workaround from above to make it work.

@sihanwang41
Copy link
Contributor

The issue is still there with Kuberay 0.5.0 and Ray 2.4.0. The serve deployment request is still applied at least every 2 seconds. The Actors never manage to startup and at some point the whole cluster gets restarted.

Logs from Serve Controller:

INFO 2023-04-27 06:04:01,931 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:01,933 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:04,169 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:04,171 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:06,223 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:06,225 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:07,271 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:07,272 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:08,384 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:08,386 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:10,527 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:10,528 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,259 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,260 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,728 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,729 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:14,831 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:14,832 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:16,981 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:16,982 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:17,372 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:17,373 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:19,328 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:19,330 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:21,444 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:21,445 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:22,684 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:22,686 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:23,672 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:23,673 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.

Tried then to change the rayservice_controller.go to contain "ServiceDefaultRequeueDuration = 6 * time.Second" but this did not help. Had to reapply the "currentTime"-workaround from above to make it work.

Hi @Martin4R , thank you for trying it out. Unfortunately my change is not in 0.5.0, can you use nightly wheel to verify?

@Martin4R
Copy link

Hi @sihanwang41,
I tried now with nightly docker-image of Kuberay-operator and it works fine.
Thanks! :-)

@Martin4R
Copy link

Martin4R commented May 9, 2023

Hi @sihanwang41,
yesterday and today we saw a strange behavior on two of the Kuberay-clusters we have, which is maybe related to the fix in the Kuberay operator you mentioned above. We were running the exact same nightly Kuberay-operator version which I used in my comment above.

The Kuberay clusters started up and the dashboard was available, but they did not start up any actors, not even the ServeController actor and the HTTPProxy actors which usually appear. For me it seems that the freshly started clusters never received the serve deployment request from the operator.
One cluster did not get well even after redeployments, but suddenly it worked again, and nobody knows why.
The other cluster showed the same "actorless" startup, then got automatically restarted after the serviceUnhealthySecondThreshold/deploymentUnhealthySecondThreshold (which we set to 600 seconds) and got back to normal after this restart, without us changing anything on it's deployment or config.

So far we never saw a startup behavior like this before in the last 7 months we used Kuberay. So it must be something which came in recently.

I saw that the bugfix in the Kuberay operator was, to remove the logic which repeatedly reapplies the serve deployment config. My guess would be, that there is some kind of race-condition when a cluster freshly starts up. In the best case it receives the first serve deployment request from the operator because the cluster is already available. In the worst case the cluster is not available yet, misses the first serve deployment request and then stays in this state without any actors because the serve deployment request is now never resent.

Can you recheck if the bugfix can lead to a behavior like this?

@sihanwang41
Copy link
Contributor

Hi @Martin4R ,
If the request is not being received, operator should receive error and trigger another deploy request. We checked the dashboard health before sending requests.
Are you able to collect dashboard log and serve controller log? So that i can help to understand the scenario better.

@sihanwang41
Copy link
Contributor

Hi @Martin4R , Look like that is different issue, please create a different ticket for the issue you have. Closing this ticket now.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

10 participants