[RayService][Observability] Add more logging for RayService troublesh…

…ooting (#1230) Add more logging for RayService troubleshooting
ray-project · Jul 12, 2023 · 384a921 · 384a921
1 parent a0e59be
commit 384a921
Show file tree

Hide file tree

Showing 2 changed files with 147 additions and 16 deletions.
diff --git a/docs/guidance/rayservice-troubleshooting.md b/docs/guidance/rayservice-troubleshooting.md
@@ -25,6 +25,7 @@ You can check the status and events of the RayService CR to see if there are any
 ### Method 3: Check logs of Ray Pods
 
 You can also check the Ray Serve logs directly by accessing the log files on the pods. These log files contain system level logs from the Serve controller and HTTP proxy as well as access logs and user-level logs. See [Ray Serve Logging](https://docs.ray.io/en/latest/serve/production-guide/monitoring.html#ray-logging) and [Ray Logging](https://docs.ray.io/en/latest/ray-observability/user-guides/configure-logging.html#configure-logging) for more details.
+
 ```bash
 kubectl exec -it $RAY_POD -n $YOUR_NAMESPACE -- bash
 # Check the logs under /tmp/ray/session_latest/logs/serve/
@@ -133,7 +134,7 @@ Put "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:5
 ```
 
 For RayService, the KubeRay operator submits a request to the RayCluster for creating Serve applications once the head Pod is ready.
-It's important to note that the Dashboard and GCS may take a few seconds to start up after the head Pod is ready.
+It's important to note that the Dashboard, Dashboard Agent and GCS may take a few seconds to start up after the head Pod is ready.
 As a result, the request may fail a few times initially before the necessary components are fully operational.
 
 If you continue to encounter this issue after 1 minute, there are several possible causes:
@@ -150,3 +151,103 @@ Some common issues related to `runtime_env`:
 * The `working_dir` points to a private AWS S3 bucket, but the Ray Pods do not have the necessary permissions to access the bucket.
 
 * The NetworkPolicy blocks the traffic between the Ray Pods and the external URLs specified in `runtime_env`.
+
+### Issue 7: Failed to get Serve application statuses.
+
+You may encounter the following error message when KubeRay tries to get Serve application statuses:
+
+```
+Get "http://${HEAD_SVC_FQDN}:52365/api/serve/applications/": dial tcp $HEAD_IP:52365: connect: connection refused"
+```
+
+As mentioned in [Issue 5](#issue-5-fail-to-create--update-serve-applications), the KubeRay operator submits a `Put` request to the RayCluster for creating Serve applications once the head Pod is ready.
+After the successful submission of the `Put` request to the dashboard agent, a `Get` request is sent to the dashboard agent port (i.e., 52365). 
+The successful submission indicates that all the necessary components, including the dashboard agent, are fully operational. 
+Therefore, unlike Issue 5, the failure of the `Get` request is not expected.
+
+If you consistently encounter this issue, there are several possible causes:
+
+* The dashboard agent process on the head Pod is not running. You can check the `dashboard_agent.log` file located at `/tmp/ray/session_latest/logs/` on the head Pod for more information. In addition, you can also perform an experiment to reproduce this cause by manually killing the dashboard agent process on the head Pod.
+  ```bash
+  # Step 1: Log in to the head Pod
+  kubectl exec -it $HEAD_POD -n $YOUR_NAMESPACE -- bash
+
+  # Step 2: Check the PID of the dashboard agent process
+  ps aux
+  # [Example output]
+  # ray          156 ... 0:03 /.../python -u /.../ray/dashboard/agent.py --
+
+  # Step 3: Kill the dashboard agent process
+  kill 156
+
+  # Step 4: Check the logs
+  cat /tmp/ray/session_latest/logs/dashboard_agent.log
+
+  # [Example output]
+  # 2023-07-10 11:24:31,962 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:31 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
+  # 2023-07-10 11:24:34,001 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:33 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
+  # 2023-07-10 11:24:36,043 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:36 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
+  # 2023-07-10 11:24:38,082 INFO web_log.py:206 -- 10.244.0.5 [10/Jul/2023:18:24:38 +0000] "GET /api/serve/applications/ HTTP/1.1" 200 13940 "-" "Go-http-client/1.1"
+  # 2023-07-10 11:24:38,590 WARNING agent.py:531 -- Exiting with SIGTERM immediately...
+
+  # Step 5: Open a new terminal and check the logs of the KubeRay operator
+  kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log
+
+  # [Example output]
+  # Get \"http://rayservice-sample-raycluster-rqlsl-head-svc.default.svc.cluster.local:52365/api/serve/applications/\": dial tcp 10.96.7.154:52365: connect: connection refused
+  ```
+
+### Issue 8: A loop of restarting the RayCluster occurs when the Kubernetes cluster runs out of resources.
+
+> Note: Currently, the KubeRay operator does not have a clear plan to handle situations where the Kubernetes cluster runs out of resources.
+Therefore, we recommend ensuring that the Kubernetes cluster has sufficient resources to accommodate the serve application.
+
+If the status of a serve application remains non-`RUNNING` for more than `serviceUnhealthySecondThreshold` seconds,
+the KubeRay operator will consider the RayCluster as unhealthy and initiate the preparation of a new RayCluster.
+A common cause of this issue is that the Kubernetes cluster does not have enough resources to accommodate the serve application.
+In such cases, the KubeRay operator may continue to restart the RayCluster, leading to a loop of restarts.
+
+We can also perform an experiment to reproduce this situation:
+
+* A Kubernetes cluster with an 8-CPUs node
+* [ray-service.insufficient-resources.yaml](https://gist.github.com/kevin85421/6a7779308aa45b197db8015aca0c1faf)
+  * RayCluster:
+    * The cluster has 1 head Pod with 4 physical CPUs, but `num-cpus` is set to 0 in `rayStartParams` to prevent any serve replicas from being scheduled on the head Pod.
+    * The cluster also has 1 worker Pod with 1 CPU by default.
+  * `serveConfigV2` specifies 5 serve deployments, each with 1 replica and a requirement of 1 CPU.
+
+```bash
+# Step 1: Get the number of CPUs available on the node
+kubectl get nodes -o custom-columns=NODE:.metadata.name,ALLOCATABLE_CPU:.status.allocatable.cpu
+
+# [Example output]
+# NODE                 ALLOCATABLE_CPU
+# kind-control-plane   8
+
+# Step 2: Install a KubeRay operator.
+
+# Step 3: Create a RayService with autoscaling enabled.
+kubectl apply -f ray-service.insufficient-resources.yaml
+
+# Step 4: The Kubernetes cluster will not have enough resources to accommodate the serve application.
+kubectl describe rayservices.ray.io rayservice-sample -n $YOUR_NAMESPACE
+
+# [Example output]
+# fruit_app_DAGDriver:
+#   Health Last Update Time:  2023-07-11T02:10:02Z
+#   Last Update Time:         2023-07-11T02:10:35Z
+#   Message:                  Deployment "fruit_app_DAGDriver" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 1.0}, resources available: {}.
+#   Status:                   UPDATING
+
+# Step 5: A new RayCluster will be created after `serviceUnhealthySecondThreshold` (300s here) seconds.
+# Check the logs of the KubeRay operator to find the reason for restarting the RayCluster.
+kubectl logs $KUBERAY_OPERATOR_POD -n $YOUR_NAMESPACE | tee operator-log
+
+# [Example output]
+# 2023-07-11T02:14:58.109Z	INFO	controllers.RayService	Restart RayCluster	{"appName": "fruit_app", "restart reason": "The status of the serve application fruit_app has not been RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster."}
+# 2023-07-11T02:14:58.109Z	INFO	controllers.RayService	Restart RayCluster	{"deploymentName": "fruit_app_FruitMarket", "appName": "fruit_app", "restart reason": "The status of the serve deployment fruit_app_FruitMarket or the serve application fruit_app has not been HEALTHY/RUNNING for more than 300.000000 seconds. Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster. The message of the serve deployment is: Deployment \"fruit_app_FruitMarket\" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {\"CPU\": 1.0}, resources available: {}."}
+# .
+# .
+# .
+# 2023-07-11T02:14:58.122Z	INFO	controllers.RayService	Restart RayCluster	{"ServiceName": "default/rayservice-sample", "AvailableWorkerReplicas": 1, "DesiredWorkerReplicas": 5, "restart reason": "The serve application is unhealthy, restarting the cluster. If the AvailableWorkerReplicas is not equal to DesiredWorkerReplicas, this may imply that the Autoscaler does not have enough resources to scale up the cluster. Hence, the serve application does not have enough resources to run. Please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details.", "RayCluster": {"apiVersion": "ray.io/v1alpha1", "kind": "RayCluster", "namespace": "default", "name": "rayservice-sample-raycluster-hvd9f"}}
+```
diff --git a/ray-operator/controllers/ray/rayservice_controller.go b/ray-operator/controllers/ray/rayservice_controller.go
@@ -719,33 +719,40 @@ func (r *RayServiceReconciler) updateServeDeployment(ctx context.Context, raySer
 	}
 }
 
-// getAndCheckServeStatus gets Serve applications' and deployments' statuses,
-// updates health timestamps, and checks if the RayCluster is overall healthy.
-// It takes as one of its inputs `serveConfigType`, which is used to decide whether
-// to query the single-application Serve REST API or the multi-application Serve REST API
-// It's return values should be interpreted as
-// (Serve app healthy?, Serve app ready?, error if any)
+// `getAndCheckServeStatus` gets Serve applications' and deployments' statuses, updates health timestamps,
+// and checks if the RayCluster is overall healthy. It takes as one of its inputs `serveConfigType`, which
+// is used to decide whether to query the single-application Serve REST API or the multi-application Serve
+// REST API. It's return values should be interpreted as:
+//
+// (Serve app healthy?, Serve app ready?, error if failed to get Serve statuses)
 func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashboardClient utils.RayDashboardClientInterface, rayServiceServeStatus *rayv1alpha1.RayServiceStatus, serveConfigType utils.RayServeConfigType, unhealthySecondThreshold *int32) (bool, bool, error) {
-	// If the unhealthySecondThreshold value is non-nil, then we will use that value.
-	// Otherwise, we will use the value ServiceUnhealthySecondThreshold which can be set in a test
-	// This is used for testing purposes.
+	// If the `unhealthySecondThreshold` value is non-nil, then we will use that value. Otherwise, we will use the value ServiceUnhealthySecondThreshold
+	// which can be set in a test. This is used for testing purposes.
 	serviceUnhealthySecondThreshold := ServiceUnhealthySecondThreshold
 	if unhealthySecondThreshold != nil {
 		serviceUnhealthySecondThreshold = float64(*unhealthySecondThreshold)
 	}
 
+	// TODO (kevin85421): Separate the logic for retrieving Serve application statuses and checking Serve application statuses into two separate functions.
+	// Currently, the handling logic for `isHealthy` and `isReady` between these two behaviors is inconsistent. This can cause potential issues in the future.
 	var serveAppStatuses map[string]*utils.ServeApplicationStatus
 	var err error
 	if serveConfigType == utils.SINGLE_APP {
 		var singleApplicationStatus *utils.ServeApplicationStatus
 		if singleApplicationStatus, err = dashboardClient.GetSingleApplicationStatus(ctx); err != nil {
-			r.Log.Error(err, "Failed to get Serve deployment statuses from dashboard!")
+			err = fmt.Errorf(
+				"Failed to get Serve deployment statuses from the head's dashboard agent port (the head service's port with the name `dashboard-agent`). "+
+					"If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. "+
+					"err: %v", err)
 			return false, false, err
 		}
 		serveAppStatuses = map[string]*utils.ServeApplicationStatus{common.DefaultServeAppName: singleApplicationStatus}
 	} else if serveConfigType == utils.MULTI_APP {
 		if serveAppStatuses, err = dashboardClient.GetMultiApplicationStatus(ctx); err != nil {
-			r.Log.Error(err, "Failed to get Serve deployment statuses from dashboard!")
+			err = fmt.Errorf(
+				"Failed to get Serve application statuses from the dashboard agent (the head service's port with the name `dashboard-agent`). "+
+					"If you observe this error consistently, please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. "+
+					"err: %v", err)
 			return false, false, err
 		}
 	} else {
@@ -780,8 +787,12 @@ func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashb
 			if prevApplicationStatus.Status != rayv1alpha1.ApplicationStatusEnum.RUNNING {
 				if prevApplicationStatus.HealthLastUpdateTime != nil {
 					applicationStatus.HealthLastUpdateTime = prevApplicationStatus.HealthLastUpdateTime
-
 					if time.Since(prevApplicationStatus.HealthLastUpdateTime.Time).Seconds() > serviceUnhealthySecondThreshold {
+						r.Log.Info("Restart RayCluster", "appName", appName, "restart reason",
+							fmt.Sprintf(
+								"The status of the serve application %s has not been RUNNING for more than %f seconds. "+
+									"Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster. ",
+								appName, serviceUnhealthySecondThreshold))
 						isHealthy = false
 					}
 				}
@@ -804,7 +815,15 @@ func (r *RayServiceReconciler) getAndCheckServeStatus(ctx context.Context, dashb
 					if prevStatus.Status != rayv1alpha1.DeploymentStatusEnum.HEALTHY {
 						deploymentStatus.HealthLastUpdateTime = prevStatus.HealthLastUpdateTime
 
-						if prevStatus.HealthLastUpdateTime != nil && time.Since(prevStatus.HealthLastUpdateTime.Time).Seconds() > serviceUnhealthySecondThreshold {
+						if !isHealthy || (prevStatus.HealthLastUpdateTime != nil && time.Since(prevStatus.HealthLastUpdateTime.Time).Seconds() > serviceUnhealthySecondThreshold) {
+							// TODO (kevin85421): Without `!isHealthy`, this `if` statement is almost impossible to be reached because the `HealthLastUpdateTime` of a serve deployment
+							// is always later than the `HealthLastUpdateTime` of the serve application. Hence, the restart is always triggered by the serve application. If we
+							// can confirm that `isHealthy = false` is always set by the serve application check, we can remove the `time.Since` check here.
+							r.Log.Info("Restart RayCluster", "deploymentName", deploymentName, "appName", appName, "restart reason",
+								fmt.Sprintf(
+									"The status of the serve deployment %s or the serve application %s has not been HEALTHY/RUNNING for more than %f seconds. "+
+										"Hence, KubeRay operator labels the RayCluster unhealthy and will prepare a new RayCluster. "+
+										"The message of the serve deployment is: %s", deploymentName, appName, serviceUnhealthySecondThreshold, deploymentStatus.Message))
 							isHealthy = false
 						}
 					}
@@ -1094,8 +1113,19 @@ func (r *RayServiceReconciler) reconcileServe(ctx context.Context, rayServiceIns
 			return ctrl.Result{RequeueAfter: ServiceDefaultRequeueDuration}, false, false, err
 		}
 
-		logger.Info("Mark cluster as unhealthy", "rayCluster", rayClusterInstance)
-		r.Recorder.Eventf(rayServiceInstance, "Normal", "Restarting", "The cluster will restart after %s", ServiceRestartRequeueDuration)
+		availableWorkerReplicas := rayClusterInstance.Status.AvailableWorkerReplicas
+		desiredWorkerReplicas := rayClusterInstance.Status.DesiredWorkerReplicas
+		logger.Info(
+			"Restart RayCluster", "AvailableWorkerReplicas", availableWorkerReplicas, "DesiredWorkerReplicas", desiredWorkerReplicas,
+			"restart reason",
+			"The serve application is unhealthy, restarting the cluster. If the AvailableWorkerReplicas is not equal to DesiredWorkerReplicas, "+
+				"this may imply that the Autoscaler does not have enough resources to scale up the cluster. Hence, the serve application does not "+
+				"have enough resources to run. Please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details.",
+			"RayCluster", rayClusterInstance)
+		r.Recorder.Eventf(
+			rayServiceInstance, "Normal", "Restarting",
+			"Please check https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md for more details. The cluster will restart after %s", ServiceRestartRequeueDuration,
+			"AvailableWorkerReplicas", availableWorkerReplicas, "DesiredWorkerReplicas", desiredWorkerReplicas)
 		// Wait a while for the cluster delete
 		return ctrl.Result{RequeueAfter: ServiceRestartRequeueDuration}, false, false, nil
 	}