[Serve] Kubernetes operator reapplies Serve config too frequently #28652

shrekris-anyscale · 2022-09-20T21:08:21Z

What happened + What you expected to happen

[A user is] working with Kuberay and Ray 2.0.0 and try to deploy a RayService. It was not easy to get the cluster to start up, because it failed and restarted often. [The user] managed to get it running, but [they] saw that the “task:run_graph” got aborted a lot of times on startup.

The user posted logs from the Serve controller and the Kubernetes operator that imply the controller was receiving new Serve config deployment requests more than once per second. E.g.:

INFO 2022-09-20 00:39:34,046 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,288 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,446 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,572 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.
INFO 2022-09-20 00:39:34,750 controller 160 controller.py:439 - Received new config deployment request. Cancelling previous request.

This likely doesn't give enough time for Serve to deploy the config completely before the request is canceled and re-issued, preventing the Serve application from being deployed. This in turn likely caused the cluster to be marked unhealthy and restarted.

Ideally, the Serve application should have enough time to be started without being interrupted.

Versions / Dependencies

Ray 2.0.0 and Kuberay.

Reproduction script

See the Ray Slack thread for logs.

The user observed the issue on the FruitStand example.

Issue Severity

No response

Martin4R · 2022-12-07T15:27:31Z

I just retested this with an Kuberay operator from the newest commit on Kuberay master branch (4ca05ab).
The issue still exists.

I could see in the logs that the serve deployment requests are now sent twice every 2 seconds from the operator to the ray cluster, but 2 seconds still seems to be not enough time for ray serve to report the startup status of the serve deployments. (When I apply my workaround to the Kuberay operator code, which prevents sending serve deployment requests not more often than every 5 seconds, then the cluster starts up fine.)

As you can see in this log from the Serve Controller actor, the serve deployment requests go on for more than 5 minutes. My actors do never start up until the cluster runs in the restart timeout and restarts.

:actor_name:ServeController
INFO 2022-12-07 06:55:35,551 controller 108 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-9546f56f15cd6912255876a805c2298ec7a4556c696a8912f8430aa0' on node '9546f56f15cd6912255876a805c2298ec7a4556c696a8912f8430aa0' listening on '0.0.0.0:8000'
INFO 2022-12-07 06:55:35,565 controller 108 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-ed9609ee385a26802a8e2e9f00113100ea4096b00c27e4c246bc8101' on node 'ed9609ee385a26802a8e2e9f00113100ea4096b00c27e4c246bc8101' listening on '0.0.0.0:8000'
INFO 2022-12-07 06:55:36,957 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:37,021 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:37,963 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:38,017 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:40,017 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:40,064 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:42,073 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:42,142 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:44,152 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:44,219 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:46,221 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:46,272 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:48,277 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:48,331 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:50,330 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:50,384 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:52,382 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:52,438 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:54,427 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:54,883 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:56,476 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:56,546 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:58,786 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:55:58,905 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:00,910 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:00,955 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:01,032 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:02,956 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 06:56:03,058 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.

...

INFO 2022-12-07 07:00:14,021 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:14,142 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:16,157 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:16,249 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:16,253	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:18,222 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:18,327 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:20,350 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:20,457 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:22,441 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:22,671 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:24,538 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:24,646 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:26,648 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:26,724 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:28,742 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:28,930 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:30,801 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:30,953 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:31,290 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:32,878 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:32,884	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:33,044 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:33,124 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:33,142	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:34,967 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:35,126 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:37,046 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:37,110 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:39,112 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:39,178 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
2022-12-07 07:00:39,181	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more Information.
INFO 2022-12-07 07:00:41,174 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.
INFO 2022-12-07 07:00:41,285 controller 108 controller.py:456 - Received new config deployment request. Cancelling previous request.

The log from the kuberay operator constantly states:

"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster  test-raycluster-85lmq with key ray/legal/legal-raycluster-85lm

Log snippet from kuberay operator log:

│ 2022-12-07T14:51:53.385Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:53.440Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:53.440Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:53.447Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:53Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:53.447Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:53.466Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:53.466Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:53.466Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:55.381Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:55.382Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:55.382Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:55.383Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:55.397Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:55.397Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704780"}, "reason │
│ 2022-12-07T14:51:55.405Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:55Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:55.405Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:55.427Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:55.427Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:55.428Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:55.428Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:55.429Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:55.445Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:55.445Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:55.452Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:55Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:55.452Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:55.470Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:55.470Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:55.471Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:57.428Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:57.429Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:57.430Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:57.447Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:57.447Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704795"}, "reason │
│ 2022-12-07T14:51:57.551Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:57Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:57.551Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:57.574Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:57.574Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}                                                                                          │
│ 2022-12-07T14:51:57.574Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704808"}, "reason │
│ 2022-12-07T14:51:57.575Z    INFO    controllers.RayService    Reconciling the cluster component.    {"ServiceName": "ray/test"}                                                                                                              │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboard agent service found": "test-raycluster-85lmq-dashboard-svc"}                                                                           │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    fetchDashboardAgentURL     {"dashboardURL": "test-raycluster-85lmq-dashboard-svc.ray.svc.cluster.local:52365"}                                                                │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    shouldUpdate    {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lm │
│ 2022-12-07T14:51:57.575Z    DEBUG    controllers.RayService    updateServeDeployment    {"config": {"importPath":"inference.cluster.testcluster.testcluster.graphs","runtimeEnv":"pip: [\n  \"fastapi\",\n  \"deepface==0.0.75\",\n  \"pyth │
│ 2022-12-07T14:51:57.576Z    DEBUG    controllers.RayService    updateServeDeployment    {"json config": "{\"import_path\":\"inference.cluster.testcluster.testcluster.graphs\",\"runtime_env\":{\"pip\":[\"fastapi\",\"deepface==0.0.75\",\ │
│ 2022-12-07T14:51:57.622Z    DEBUG    controllers.RayService    updateServeDeployment    {"message": "Cached Serve config for Ray cluster test-raycluster-85lmq with key ray/test/test-raycluster-85lmq"}                                   │
│ 2022-12-07T14:51:57.622Z    DEBUG    events    Normal    {"object": {"kind":"RayService","namespace":"ray","name":"test","uid":"241547c1-7a51-41de-b917-24242fb756fe","apiVersion":"ray.io/v1alpha1","resourceVersion":"100704808"}, "reason │
│ 2022-12-07T14:51:57.628Z    DEBUG    controllers.RayService    getAndCheckServeStatus    {"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2022-12-07T14:51:57Z","healthLastUpdateTime":"2022-12-07T14 │
│ 2022-12-07T14:51:57.628Z    INFO    controllers.RayService    Check serve health    {"ServiceName": "ray/test", "isHealthy": true, "isReady": false, "isActive": false}                                                                      │
│ 2022-12-07T14:51:57.650Z    INFO    controllers.RayService    Mark cluster as waiting for Serve deployments    {"ServiceName": "ray/test", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "ray", "name": │
│ 2022-12-07T14:51:57.650Z    INFO    controllers.RayService    Cluster is healthy but not ready: checking again in 2s    {"ServiceName": "ray/test"}

brucez-anyscale · 2022-12-08T04:27:43Z

Thanks for the update. I think we know the solution. And just need to make it merged.
We need to let controller ignore state update triggered reconcile. Using

builder.WithPredicates(predicate.GenerationChangedPredicate{})

cc @sihanwang41

brucez-anyscale · 2022-12-08T04:28:06Z

also cc @wilsonwang371 @Jeffwan

wilsonwang371 · 2022-12-08T04:36:03Z

also cc @wilsonwang371 @Jeffwan

Do we need to make the change in kuberay controller code?

brucez-anyscale · 2022-12-08T05:10:33Z

also cc @wilsonwang371 @Jeffwan

Do we need to make the change in kuberay controller code?

It should be a code change for SetupWithManager

Update to

func (r *RayServiceReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&rayv1alpha1.RayService{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
		Owns(&rayv1alpha1.RayCluster{}).
		Owns(&corev1.Service{}).
		Owns(&networkingv1.Ingress{}).
		Complete(r)
}

sihanwang41 · 2022-12-08T06:39:55Z

Hi @Martin4R, Thank you for trying it out! I will have a test and verify the fix. If working, I will merge it.

pandemosth · 2023-01-16T23:04:04Z

I'm seeing the same issue. Ray 2.2.0 and both Kuberay 0.4.0 and nightly. The above merge suggests this should have been fixed in 0.4.0 though.

I can see the same Received new config deployment request. Cancelling previous request. in Serve Controller log, and the kuberay-operator repeating the serve deployment every 2s until the cluster is restarted.

Excerpt from kuberay-operator logs:

2023-01-16T21:54:16.664Z	INFO	controllers.RayService	Reconciling the cluster component.	{"ServiceName": "default/dev"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboard agent service found": "dev-raycluster-prjd5-dashboard-svc"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboardURL": "dev-raycluster-prjd5-dashboard-svc.default.svc.cluster.local:52365"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	shouldUpdate	{"shouldUpdateServe": true, "reason": "Nothing has been cached for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	updateServeDeployment	{"config": ...}
2023-01-16T21:54:16.666Z	DEBUG	controllers.RayService	updateServeDeployment	{"json config": ...}
2023-01-16T21:54:17.560Z	DEBUG	controllers.RayService	updateServeDeployment	{"message": "Cached Serve config for Ray cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:17.560Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680128"}, "reason": "SubmittedServeDeployment", "message": "Controller sent API request to update Serve deployments on cluster dev-raycluster-prjd5"}
2023-01-16T21:54:17.565Z	DEBUG	controllers.RayService	getAndCheckServeStatus	{"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2023-01-16T21:54:17Z","healthLastUpdateTime":"2023-01-16T21:54:17Z"}}}
2023-01-16T21:54:17.565Z	INFO	controllers.RayService	Check serve health	{"ServiceName": "default/dev", "isHealthy": true, "isReady": false, "isActive": false}
2023-01-16T21:54:17.591Z	INFO	controllers.RayService	Mark cluster as waiting for Serve deployments	{"ServiceName": "default/dev", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "dev-raycluster-prjd5"}}
2023-01-16T21:54:17.591Z	INFO	controllers.RayService	Cluster is healthy but not ready: checking again in 2s	{"ServiceName": "default/dev"}
2023-01-16T21:54:17.592Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680157"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}
2023-01-16T21:54:19.592Z	INFO	controllers.RayService	Reconciling the cluster component.	{"ServiceName": "default/dev"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboard agent service found": "dev-raycluster-prjd5-dashboard-svc"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	fetchDashboardAgentURL 	{"dashboardURL": "dev-raycluster-prjd5-dashboard-svc.default.svc.cluster.local:52365"}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	shouldUpdate	{"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5", "cachedServeConfig": ...}
2023-01-16T21:54:19.595Z	DEBUG	controllers.RayService	updateServeDeployment	{"config": ...}
2023-01-16T21:54:19.596Z	DEBUG	controllers.RayService	updateServeDeployment	{"json config": ...}
2023-01-16T21:54:19.607Z	DEBUG	controllers.RayService	updateServeDeployment	{"message": "Cached Serve config for Ray cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5"}
2023-01-16T21:54:19.608Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680157"}, "reason": "SubmittedServeDeployment", "message": "Controller sent API request to update Serve deployments on cluster dev-raycluster-prjd5"}
2023-01-16T21:54:19.614Z	DEBUG	controllers.RayService	getAndCheckServeStatus	{"statusMap": {}, "serveStatuses": {"app_status":{"status":"DEPLOYING","lastUpdateTime":"2023-01-16T21:54:19Z","healthLastUpdateTime":"2023-01-16T21:54:17Z"}}}
2023-01-16T21:54:19.614Z	INFO	controllers.RayService	Check serve health	{"ServiceName": "default/dev", "isHealthy": true, "isReady": false, "isActive": false}
2023-01-16T21:54:19.655Z	INFO	controllers.RayService	Mark cluster as waiting for Serve deployments	{"ServiceName": "default/dev", "rayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "dev-raycluster-prjd5"}}
2023-01-16T21:54:19.655Z	INFO	controllers.RayService	Cluster is healthy but not ready: checking again in 2s	{"ServiceName": "default/dev"}
2023-01-16T21:54:19.655Z	DEBUG	events	Normal	{"object": {"kind":"RayService","namespace":"default","name":"dev","uid":"88554a22-bba3-45d3-99db-a318527cc7c4","apiVersion":"ray.io/v1alpha1","resourceVersion":"125680181"}, "reason": "ServiceNotReady", "message": "The service is not ready yet. Controller will perform a round of actions in 2s."}

This line looks suspicious since the logs show a serve deployment had started 2s earlier:

2023-01-16T21:54:19.595Z DEBUG controllers.RayService shouldUpdate {"shouldUpdateServe": true, "reason": "No Serve deployments have started deploying for cluster dev-raycluster-prjd5 with key default/dev/dev-raycluster-prjd5", "cachedServeConfig": ...}

shrekris-anyscale · 2023-01-17T21:30:52Z

Hi @pandemosth, thanks for following up with the logs. It looks like this is still an issue, so we'll take a closer look.

Martin4R · 2023-01-18T13:36:15Z

If this hinders the cluster from startup, the extremely dirty temporary workaround is to compile the Kuberay operator from source with the following change to rayservice_controller.go:

                currentTime := time.Now().UTC().Unix()
		paused := true
		if currentTime%5 == 0 {
			paused = false
		}
		shouldUpdate = !paused

This makes the gaps between serve deployment requests 5 seconds long, which in my case is enough for the actors to report their status correctly.

shrekris-anyscale · 2023-01-18T17:59:34Z

Thanks for posting the workaround @Martin4R! cc'ing @spillai to the issue.

uthark · 2023-01-25T21:45:40Z

I confirm the issue. Can't deploy example app because of this.

By default, the kuberay operator expects deployments to respond within 2 seconds. Hack found from ray issue: ray-project/ray#28652

Hack found from ray issue: ray-project/ray#28652

bewestphal · 2023-03-09T19:53:02Z

Thanks for the share @Martin4R. I've ported this workaround to my fork and has mitigated the issue.
https://github.com/bewestphal/kuberay/tree/bewestphal/service-controller-update-speed-ten

This feels like it needs a proper mechanism to wait though. Seeing this issue on v0.4.0.

Anecdotally I wanna say this happens when my app is larger, in terms of overall app bundle size.

pascalwhoop · 2023-03-10T10:26:40Z

we're trying to host LLMs from alpa.ai via ray on a k8s cluster and are also facing this issue. One workaround for us is to deploy the cluster and a separate pod that connects to the cluster for a deployment separately but that is of course not the intended behavior.

By default, the kuberay operator expects deployments to respond within 2 seconds. Hack found from ray issue: ray-project/ray#28652

Hack found from ray issue: ray-project/ray#28652

sihanwang41 · 2023-04-13T21:30:26Z

Hi @pascalwhoop @bewestphal @uthark @Martin4R, can you give a try with nightly wheel? We recently had a fix in kuberay to fix the long preparation time of runtime env issue. Hopefully it can fix your issues, please let me know if the issue still exists.

Martin4R · 2023-04-27T15:32:16Z

The issue is still there with Kuberay 0.5.0 and Ray 2.4.0. The serve deployment request is still applied at least every 2 seconds. The Actors never manage to startup and at some point the whole cluster gets restarted.

Logs from Serve Controller:

INFO 2023-04-27 06:04:01,931 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:01,933 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:04,169 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:04,171 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:06,223 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:06,225 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:07,271 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:07,272 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:08,384 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:08,386 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:10,527 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:10,528 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,259 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,260 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,728 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,729 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:14,831 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:14,832 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:16,981 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:16,982 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:17,372 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:17,373 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:19,328 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:19,330 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:21,444 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:21,445 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:22,684 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:22,686 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:23,672 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:23,673 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.

Tried then to change the rayservice_controller.go to contain "ServiceDefaultRequeueDuration = 6 * time.Second" but this did not help. Had to reapply the "currentTime"-workaround from above to make it work.

sihanwang41 · 2023-04-27T16:23:08Z

The issue is still there with Kuberay 0.5.0 and Ray 2.4.0. The serve deployment request is still applied at least every 2 seconds. The Actors never manage to startup and at some point the whole cluster gets restarted.

Logs from Serve Controller:

INFO 2023-04-27 06:04:01,931 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:01,933 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:04,169 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:04,171 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:06,223 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:06,225 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:07,271 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:07,272 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:08,384 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:08,386 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:10,527 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:10,528 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,259 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,260 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:12,728 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:12,729 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:14,831 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:14,832 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:16,981 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:16,982 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:17,372 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:17,373 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:19,328 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:19,330 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:21,444 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:21,445 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:22,684 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:22,686 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.
INFO 2023-04-27 06:04:23,672 controller 272 controller.py:582 - Starting deploy_serve_application task for application .
INFO 2023-04-27 06:04:23,673 controller 272 application_state.py:345 - Received new config deployment for  request. Cancelling previous request.

Tried then to change the rayservice_controller.go to contain "ServiceDefaultRequeueDuration = 6 * time.Second" but this did not help. Had to reapply the "currentTime"-workaround from above to make it work.

Hi @Martin4R , thank you for trying it out. Unfortunately my change is not in 0.5.0, can you use nightly wheel to verify?

Martin4R · 2023-04-27T16:47:34Z

Hi @sihanwang41,
I tried now with nightly docker-image of Kuberay-operator and it works fine.
Thanks! :-)

Martin4R · 2023-05-09T22:52:33Z

Hi @sihanwang41,
yesterday and today we saw a strange behavior on two of the Kuberay-clusters we have, which is maybe related to the fix in the Kuberay operator you mentioned above. We were running the exact same nightly Kuberay-operator version which I used in my comment above.

The Kuberay clusters started up and the dashboard was available, but they did not start up any actors, not even the ServeController actor and the HTTPProxy actors which usually appear. For me it seems that the freshly started clusters never received the serve deployment request from the operator.
One cluster did not get well even after redeployments, but suddenly it worked again, and nobody knows why.
The other cluster showed the same "actorless" startup, then got automatically restarted after the serviceUnhealthySecondThreshold/deploymentUnhealthySecondThreshold (which we set to 600 seconds) and got back to normal after this restart, without us changing anything on it's deployment or config.

So far we never saw a startup behavior like this before in the last 7 months we used Kuberay. So it must be something which came in recently.

I saw that the bugfix in the Kuberay operator was, to remove the logic which repeatedly reapplies the serve deployment config. My guess would be, that there is some kind of race-condition when a cluster freshly starts up. In the best case it receives the first serve deployment request from the operator because the cluster is already available. In the worst case the cluster is not available yet, misses the first serve deployment request and then stays in this state without any actors because the serve deployment request is now never resent.

Can you recheck if the bugfix can lead to a behavior like this?

sihanwang41 · 2023-05-10T22:53:59Z

Hi @Martin4R ,
If the request is not being received, operator should receive error and trigger another deploy request. We checked the dashboard health before sending requests.
Are you able to collect dashboard log and serve controller log? So that i can help to understand the scenario better.

sihanwang41 · 2023-05-31T18:38:35Z

Hi @Martin4R , Look like that is different issue, please create a different ticket for the issue you have. Closing this ticket now.
Thanks!

shrekris-anyscale added bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue labels Sep 20, 2022

shrekris-anyscale assigned sihanwang41, shrekris-anyscale and brucez-anyscale Sep 20, 2022

sihanwang41 mentioned this issue Dec 8, 2022

[RayService] Skip update events without change ray-project/kuberay#811

Merged

4 tasks

kevin85421 mentioned this issue Dec 8, 2022

[Feature] Add RateLimiters for controllers ray-project/kuberay#820

Closed

2 tasks

bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 9, 2023

ray-operator rayservice: 10 second updates for cluster deploy

c831780

Hack found from ray issue: ray-project/ray#28652

bewestphal added a commit to bewestphal/kuberay that referenced this issue Mar 21, 2023

ray-operator rayservice: 10 second updates for cluster deploy

8c423ee

Hack found from ray issue: ray-project/ray#28652

sihanwang41 added the P1 Issue that should be fixed within a few weeks label Mar 23, 2023

akshay-anyscale unassigned brucez-anyscale and shrekris-anyscale Mar 29, 2023

akshay-anyscale assigned kevin85421 May 17, 2023

sihanwang41 closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

shrekris-anyscale commented Sep 20, 2022

Martin4R commented Dec 7, 2022 •

edited

Loading

brucez-anyscale commented Dec 8, 2022

brucez-anyscale commented Dec 8, 2022

wilsonwang371 commented Dec 8, 2022

brucez-anyscale commented Dec 8, 2022 •

edited

Loading

sihanwang41 commented Dec 8, 2022

pandemosth commented Jan 16, 2023

shrekris-anyscale commented Jan 17, 2023

Martin4R commented Jan 18, 2023

shrekris-anyscale commented Jan 18, 2023

uthark commented Jan 25, 2023

bewestphal commented Mar 9, 2023

pascalwhoop commented Mar 10, 2023

sihanwang41 commented Apr 13, 2023

Martin4R commented Apr 27, 2023

sihanwang41 commented Apr 27, 2023

Martin4R commented Apr 27, 2023

Martin4R commented May 9, 2023

sihanwang41 commented May 10, 2023

sihanwang41 commented May 31, 2023

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

[Serve] Kubernetes operator reapplies Serve config too frequently #28652

Comments

shrekris-anyscale commented Sep 20, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Martin4R commented Dec 7, 2022 • edited Loading

brucez-anyscale commented Dec 8, 2022

brucez-anyscale commented Dec 8, 2022

wilsonwang371 commented Dec 8, 2022

brucez-anyscale commented Dec 8, 2022 • edited Loading

sihanwang41 commented Dec 8, 2022

pandemosth commented Jan 16, 2023

shrekris-anyscale commented Jan 17, 2023

Martin4R commented Jan 18, 2023

shrekris-anyscale commented Jan 18, 2023

uthark commented Jan 25, 2023

bewestphal commented Mar 9, 2023

pascalwhoop commented Mar 10, 2023

sihanwang41 commented Apr 13, 2023

Martin4R commented Apr 27, 2023

sihanwang41 commented Apr 27, 2023

Martin4R commented Apr 27, 2023

Martin4R commented May 9, 2023

sihanwang41 commented May 10, 2023

sihanwang41 commented May 31, 2023

Martin4R commented Dec 7, 2022 •

edited

Loading

brucez-anyscale commented Dec 8, 2022 •

edited

Loading