[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634

kevin85421 · 2022-10-13T23:55:25Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

TODO

Reproduction script

TODO

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

…configuration framework (#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. #638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters #730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.

DmitriGekhtman · 2022-12-01T23:06:10Z

cc @sihanwang41 @shrekris-anyscale
re: The potential impact on Ray Serve FT

wilsonwang371 · 2023-01-04T02:34:34Z

This might be because worker health check failed which results in the worker failure.

GCS failure -> worker raylet failure -> Kuberay detected health problem and restarted worker.

DmitriGekhtman · 2023-01-04T02:38:42Z

Ah, then we should look into the worker health check. Probably @iycheng is the most knowledgeable there?

…configuration framework (ray-project#759) Refactors for integration tests -- Test operator chart: This PR uses the kuberay-operator chart to install KubeRay operator. Hence, the operator chart is tested. Refactor: class CONST and class KubernetesClusterManager should be singleton classes. However, the singleton design pattern is not encouraged, so we need to consider it thoroughly before we convert these two classes into singleton classes. Refactor: Replace os with subprocess. The following paragraph is from Python's official documentation. The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes. Skip test_kill_head due to [Bug] Head pod is deleted rather than restarted when gcs_server on head pod is killed. ray-project#638 [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed ray-project#634. Refactor: Replace all existing k8s api clients with K8S_CLUSTER_MANAGER. Refactor and relieve flakiness of test_ray_serve_work working_dir is out-of-date (See this comment for more details), but the tests pass sometimes due to the error of the original test logic. => Solution: Update working_dir in ray-service.yaml.template. To elaborate, the error of the test logic mentioned above is that it only checks the exit code rather than STDOUT. When Pods are READY and RUNNING, RayService still needs tens of seconds to be ready for serving requests. The time.sleep(60) function is a workaround, and should be removed when [RayService] Track whether Serve app is ready before switching clusters ray-project#730 is merged. Remove NodePort service in RayServiceTestCase. Use a curl Pod to communicate with Ray via ClusterIP service directly. Originally, using Docker container with network_mode='host' and NodePort service is very weird for me. Refactor: remove useless RayService template ray-service-cluster-update.yaml.template and ray-service-serve-update.yaml.template. The original buggy test logic only checks the exit code rather than the STDOUT of the curl commands. Hence, the different templates are useless in RayServiceTestCase. Refactor: Because APIServer is not tested by any test case, remove everything related to APIServer docker image in the compatibility test.

kevin85421 added the bug Something isn't working label Oct 13, 2022

This was referenced Nov 5, 2022

[Bug] Compatibility tests of FT features are (still) flakey #691

Closed

Improve test quality by either improving or removing unreliable test #692

Closed

kevin85421 mentioned this issue Dec 1, 2022

[Feature] Refactor test framework & test kuberay-operator chart with configuration framework #759

Merged

4 tasks

kevin85421 self-assigned this Dec 1, 2022

kevin85421 added this to the v0.5.0 release milestone Dec 1, 2022

DmitriGekhtman added the P1 Issue that should be fixed within a few weeks label Dec 1, 2022

DmitriGekhtman assigned wilsonwang371 and fishbone Dec 1, 2022

This was referenced Dec 23, 2022

[Feature] Remove Docker container and NodePort from compatibility test #844

Merged

[Feature] Connect to RayCluster via GCS port rather than Ray client in compatibility test #848

Open

kevin85421 modified the milestones: v0.5.0 release, v0.6.0 release Apr 6, 2023

kevin85421 mentioned this issue Apr 19, 2023

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

Merged

4 tasks

kevin85421 closed this as completed in #1036 Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634

[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634

kevin85421 commented Oct 13, 2022

DmitriGekhtman commented Dec 1, 2022

wilsonwang371 commented Jan 4, 2023

DmitriGekhtman commented Jan 4, 2023

[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634

[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed #634

Comments

kevin85421 commented Oct 13, 2022

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented Dec 1, 2022

wilsonwang371 commented Jan 4, 2023

DmitriGekhtman commented Jan 4, 2023