From 9a760fe51f38e594c95d9b348e85aae2e2e4ccb8 Mon Sep 17 00:00:00 2001
From: "chenyu.jiang" <chenyu.jiang@bytedance.com>
Date: Mon, 3 Apr 2023 10:39:58 -0700
Subject: [PATCH 1/4] update docs

Signed-off-by: chenyu.jiang <chenyu.jiang@bytedance.com>
---
 docs/best-practice/worker-head-reconnection.md | 12 ++++++++----
 docs/guidance/gcs-ft.md                        |  4 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md
index 40ed82716c3..f77a42e43e8 100644
--- a/docs/best-practice/worker-head-reconnection.md
+++ b/docs/best-practice/worker-head-reconnection.md
@@ -6,6 +6,9 @@ For a `RayCluster` with a head and several workers, if a worker is crashed, it w
 
 ## Explanation
 
+> **note**
+It was an issue that only happened with old version In the Kuberay version under 0.3.0, we recommand you try the latest version  
+
 When the head pod was deleted, it will be recreated with a new IP by KubeRay controller，and the GCS server address is changed accordingly. The Raylets of all workers will try to get GCS address from Redis in `ReconnectGcsServer`, but the redis_clients always use the previous head IP, so they will always fail to get new GCS address. The Raylets will not exit until max retries are reached. There are two configurations determining this long delay:
 
 ```
@@ -18,13 +21,14 @@ RAY_CONFIG(int32_t, ping_gcs_rpc_server_max_retries, 600)
 https://github.com/ray-project/ray/blob/98be9fb5e08befbd6cac3ffbcaa477c5117b0eef/src/ray/gcs/gcs_client/gcs_client.cc#L294-L295
 ```
 
-It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in all stable ray versions (including 1.9.1). This has been reduced to 60s in recent commit in master. 
+It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in stable ray versions under 1.9.1. This has been reduced to 60s in recent commit under Kuberay 0.3.0.
 
-## Best Practice
+## Solution
+We recommand to use the latest Kuberay version. After 0.5.0, the GCS Fault-Tolerance (FT) feature is now stable and it will resolve the problem. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md).
 
-The GCS Fault-Tolerance (FT) feature is alpha release. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md)
+## Best Practice
 
-To reduce the chances of a lost worker-head connection, there are two other options:
+For older version (Kuberay <=0.4.0, ray <=2.1.0). To reduce the chances of a lost worker-head connection, there are two other options:
 
 - Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head. 
 
diff --git a/docs/guidance/gcs-ft.md b/docs/guidance/gcs-ft.md
index ade2fc1bb2d..0166e990fc2 100644
--- a/docs/guidance/gcs-ft.md
+++ b/docs/guidance/gcs-ft.md
@@ -1,6 +1,6 @@
-## Ray GCS Fault Tolerance (GCS FT) (Alpha Release)
+## Ray GCS Fault Tolerance (GCS FT) （Beta release）
 
-> Note: This feature is alpha.
+> **Note** This feature is beta.
 
 Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures
 without affecting important services such as detached Actors & RayServe deployments.

From f88232b578b6f05622e04c1d71b50cb055be7380 Mon Sep 17 00:00:00 2001
From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com>
Date: Wed, 5 Apr 2023 13:47:38 -0700
Subject: [PATCH 2/4] Update docs/guidance/gcs-ft.md

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com>
---
 docs/guidance/gcs-ft.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/guidance/gcs-ft.md b/docs/guidance/gcs-ft.md
index 0166e990fc2..d9c73378e4b 100644
--- a/docs/guidance/gcs-ft.md
+++ b/docs/guidance/gcs-ft.md
@@ -1,6 +1,6 @@
 ## Ray GCS Fault Tolerance (GCS FT) （Beta release）
 
-> **Note** This feature is beta.
+> **Note**: This feature is beta.
 
 Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerant GCS failures and recover from failures
 without affecting important services such as detached Actors & RayServe deployments.

From 6f55a840a9b32ba0b6eccf74ab9c1568a6a45635 Mon Sep 17 00:00:00 2001
From: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com>
Date: Wed, 5 Apr 2023 13:47:58 -0700
Subject: [PATCH 3/4] Update docs/best-practice/worker-head-reconnection.md

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Chenyu Jiang <38214590+scarlet25151@users.noreply.github.com>
---
 docs/best-practice/worker-head-reconnection.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md
index f77a42e43e8..56b93522afc 100644
--- a/docs/best-practice/worker-head-reconnection.md
+++ b/docs/best-practice/worker-head-reconnection.md
@@ -24,7 +24,7 @@ https://github.com/ray-project/ray/blob/98be9fb5e08befbd6cac3ffbcaa477c5117b0eef
 It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in stable ray versions under 1.9.1. This has been reduced to 60s in recent commit under Kuberay 0.3.0.
 
 ## Solution
-We recommand to use the latest Kuberay version. After 0.5.0, the GCS Fault-Tolerance (FT) feature is now stable and it will resolve the problem. To enable GCS FT, please refer to [Ray GCS Fault Tolerance](https://github.com/ray-project/kuberay/blob/master/docs/guidance/gcs-ft.md).
+We recommend using the latest version of KubeRay. After version 0.5.0, the GCS Fault-Tolerance feature is now in beta and can help resolve this reconnection issue."
 
 ## Best Practice
 

From 6af2f665c9b75ee33686e5ebe65ef7c4e5d1e6f5 Mon Sep 17 00:00:00 2001
From: Kai-Hsun Chen <kaihsun@apache.org>
Date: Wed, 5 Apr 2023 13:53:08 -0700
Subject: [PATCH 4/4] Update docs/best-practice/worker-head-reconnection.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
---
 docs/best-practice/worker-head-reconnection.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/best-practice/worker-head-reconnection.md b/docs/best-practice/worker-head-reconnection.md
index 56b93522afc..9a67c24b6e3 100644
--- a/docs/best-practice/worker-head-reconnection.md
+++ b/docs/best-practice/worker-head-reconnection.md
@@ -24,7 +24,7 @@ https://github.com/ray-project/ray/blob/98be9fb5e08befbd6cac3ffbcaa477c5117b0eef
 It retries 600 times and each interval is 1s, resulting in total 600s timeout, i.e. 10 min. So immediately after 10-min wait for retries, each client exits and gets restarted while connecting to the new head IP. This issue exists in stable ray versions under 1.9.1. This has been reduced to 60s in recent commit under Kuberay 0.3.0.
 
 ## Solution
-We recommend using the latest version of KubeRay. After version 0.5.0, the GCS Fault-Tolerance feature is now in beta and can help resolve this reconnection issue."
+We recommend using the latest version of KubeRay. After version 0.5.0, the GCS Fault-Tolerance feature is now in beta and can help resolve this reconnection issue.
 
 ## Best Practice