From 9a760fe51f38e594c95d9b348e85aae2e2e4ccb8 Mon Sep 17 00:00:00 2001 From: "chenyu.jiang" Date: Mon, 3 Apr 2023 10:39:58 -0700 Subject: [PATCH] update docs Signed-off-by: chenyu.jiang --- docs/best-practice/worker-head-reconnection.md | 12 ++++++++---- docs/guidance/gcs-ft.md | 4 ++-- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md index 40ed82716c3..f77a42e43e8 100644 --- a/docs/best-practice/worker-head-reconnection.md +++ b/docs/best-practice/worker-head-reconnection.md @@ -6,6 +6,9 @@ For a `RayCluster` with a head and several workers, if a worker is crashed, it w ## Explanation +> **note** +It was an issue that only happened with old version In the Kuberay version under 0.3.0, we recommand you try the latest version + When the head pod was deleted, it will be recreated with a new IP by KubeRay controller,and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in `ReconnectGcsServer`, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay: ``` @@ -18,13 +21,14 @@ RAY_CONFIG(int32_t, ping_gcs_rpc_server_max_retries, 600) https://github.com/ray-project/ray/blob/98be9fb5e08befbd6cac3ffbcaa477c5117b0eef/src/ray/gcs/gcs_client/gcs_client.cc#L294-L295 ``` -It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in all stable ray versions (including 1.9.1). This has been reduced to 60s in recent commit in master. +It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in stable ray versions under 1.9.1. This has been reduced to 60s in recent commit under Kuberay 0.3.0. -## Best Practice +## Solution +We recommand to use the latest Kuberay version. After 0.5.0, the GCS Fault-Tolerance (FT) feature is now stable and it will resolve the problem. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md). -The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md) +## Best Practice -To reduce the chances of a lost worker-head connection, there are two other options: +For older version (Kuberay <=0.4.0, ray <=2.1.0). To reduce the chances of a lost worker-head connection, there are two other options: - Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head. diff --git a/docs/guidance/gcs-ft.md b/docs/guidance/gcs-ft.md index ade2fc1bb2d..0166e990fc2 100644 --- a/docs/guidance/gcs-ft.md +++ b/docs/guidance/gcs-ft.md @@ -1,6 +1,6 @@ -## Ray GCS Fault Tolerance (GCS FT) (Alpha Release) +## Ray GCS Fault Tolerance (GCS FT) (Beta release) -> Note: This feature is alpha. +> **Note** This feature is beta. Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures without affecting important services such as detached Actors & RayServe deployments.