[Feature][Docs] Explain how to specify container command for head pod (…

…#912) Users want to execute some commands at two timings: (1) Before `ray start` (2) After `ray start`
ray-project · Feb 22, 2023 · 0564748 · 0564748
1 parent 4714892
commit 0564748
Show file tree

Hide file tree

Showing 4 changed files with 264 additions and 0 deletions.
diff --git a/docs/guidance/pod-command.md b/docs/guidance/pod-command.md
@@ -0,0 +1,148 @@
+# Specify container commands for Ray head/worker Pods
+You can execute commands on the head/worker pods at two timings:
+
+* (1) **Before `ray start`**: As an example, you can set up some environment variables that will be used by `ray start`.
+
+* (2) **After `ray start` (RayCluster is ready)**: As an example, you can launch a Ray serve deployment when the RayCluster is ready.
+
+## Current KubeRay operator behavior for container commands
+* The current behavior for container commands is not finalized, and **may be updated in the future**.
+* See [code](https://github.com/ray-project/kuberay/blob/47148921c7d14813aea26a7974abda7cf22bbc52/ray-operator/controllers/ray/common/pod.go#L301-L326) for more details.
+
+## Timing 1: Before `ray start`
+Currently, for timing (1), we can set the container's `Command` and `Args` in RayCluster specification to reach the goal.
+
+```yaml
+# ray-operator/config/samples/ray-cluster.head-command.yaml
+    rayStartParams:
+        ...
+    #pod template
+    template:
+      spec:
+        containers:
+        - name: ray-head
+          image: rayproject/ray:2.2.0
+          resources:
+            ...
+          ports:
+            ...
+          # `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
+          command: ["echo 123"]
+          args: ["456"]
+```
+* Ray head Pod
+  * `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
+  * `spec.containers.0.args` contains two parts:
+    * (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together.
+    * (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
+    * To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.
+
+* Example
+    ```sh
+    # Prerequisite: There is a KubeRay operator in the Kubernetes cluster.
+
+    # Path: kuberay/
+    kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml
+
+    # Check ${RAYCLUSTER_HEAD_POD}
+    kubectl get pod -l ray.io/node-type=head
+
+    # Check `spec.containers.0.command` and `spec.containers.0.args`.
+    kubectl describe pod ${RAYCLUSTER_HEAD_POD}
+
+    # Command:
+    #   /bin/bash
+    #   -lc
+    #   --
+    # Args:
+    #    echo 123  456  && ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --num-cpus=1  --block  --metrics-export-port=8080  --memory=2147483648
+    ```
+
+
+## Timing 2: After `ray start` (RayCluster is ready)
+We have two solutions to execute commands after the RayCluster is ready. The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 1.
+
+### Solution 1: Container command (Recommended)
+As we mentioned in the section "Timing 1: Before `ray start`", user-specified command will be executed before the `ray start` command. Hence, we can execute the `ray_cluster_resources.sh` in background by updating `headGroupSpec.template.spec.containers.0.command` in `ray-cluster.head-command.yaml`.
+
+```yaml
+# ray-operator/config/samples/ray-cluster.head-command.yaml
+# Parentheses for the command is required.
+command: ["(/home/ray/samples/ray_cluster_resources.sh&)"]
+
+# ray_cluster_resources.sh
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: ray-example
+data:
+  ray_cluster_resources.sh: |
+    #!/bin/bash
+
+    # wait for ray cluster to finish initialization
+    while true; do
+        ray health-check 2>/dev/null
+        if [ "$?" = "0" ]; then
+            break
+        else
+            echo "INFO: waiting for ray head to start"
+            sleep 1
+        fi
+    done
+
+    # Print the resources in the ray cluster after the cluster is ready.
+    python -c "import ray; ray.init(); print(ray.cluster_resources())"
+
+    echo "INFO: Print Ray cluster resources"
+```
+
+* Example
+    ```sh
+    # Path: kuberay/
+    # (1) Update `command` to ["(/home/ray/samples/ray_cluster_resources.sh&)"]
+    # (2) Comment out `postStart` and `args`.
+    kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml
+
+    # Check ${RAYCLUSTER_HEAD_POD}
+    kubectl get pod -l ray.io/node-type=head
+
+    # Check the logs
+    kubectl logs ${RAYCLUSTER_HEAD_POD}
+
+    # INFO: waiting for ray head to start
+    # .
+    # . => Cluster initialization
+    # .
+    # 2023-02-16 18:44:43,724 INFO worker.py:1231 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
+    # 2023-02-16 18:44:43,724 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.244.0.26:6379...
+    # 2023-02-16 18:44:43,735 INFO worker.py:1535 -- Connected to Ray cluster. View the dashboard at http://10.244.0.26:8265
+    # {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
+    # INFO: Print Ray cluster resources
+    ```
+
+### Solution 2: postStart hook
+```yaml
+# ray-operator/config/samples/ray-cluster.head-command.yaml
+lifecycle:
+  postStart:
+    exec:
+      command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
+```
+* We execute the script `ray_cluster_resources.sh` via the postStart hook. Based on [this document](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks), there is no guarantee that the hook will execute before the container ENTRYPOINT. Hence, we need to wait for RayCluster to finish initialization in `ray_cluster_resources.sh`.
+
+* Example
+    ```sh
+    # Path: kuberay/
+    kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml
+
+    # Check ${RAYCLUSTER_HEAD_POD}
+    kubectl get pod -l ray.io/node-type=head
+
+    # Forward the port of Dashboard
+    kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265
+
+    # Open the browser and check the Dashboard (${YOUR_IP}:8265/#/job).
+    # You shold see a SUCCEEDED job with the following Entrypoint:
+    # 
+    # `python -c "import ray; ray.init(); print(ray.cluster_resources())"`
+    ```
diff --git a/helm-chart/ray-cluster/templates/raycluster-cluster.yaml b/helm-chart/ray-cluster/templates/raycluster-cluster.yaml
@@ -52,6 +52,12 @@ spec:
             lifecycle:
             {{- toYaml .Values.head.lifecycle | nindent 14 }}
             {{- end }}
+            {{- if .Values.head.command }}
+            command: {{- toYaml .Values.head.command | nindent 14}}
+            {{- end }}
+            {{- if .Values.head.args }}
+            args: {{- toYaml .Values.head.args | nindent 14}}
+            {{- end }}
           {{- if .Values.head.sidecarContainers }}
           {{- toYaml .Values.head.sidecarContainers | nindent 10 }}
           {{- end }}
@@ -112,6 +118,12 @@ spec:
             lifecycle:
             {{- toYaml $values.lifecycle | nindent 14 }}
             {{- end }}
+            {{- if $values.command }}
+            command: {{- toYaml $values.command | nindent 14}}
+            {{- end }}
+            {{- if $values.args }}
+            args: {{- toYaml $values.args | nindent 14}}
+            {{- end }}
           {{- if $values.sidecarContainers }}
           {{- toYaml $values.sidecarContainers | nindent 10 }}
           {{- end }}
@@ -172,6 +184,12 @@ spec:
             lifecycle:
             {{- toYaml .Values.worker.lifecycle | nindent 14 }}
             {{- end }}
+            {{- if .Values.worker.command }}
+            command: {{- toYaml .Values.worker.command | nindent 14}}
+            {{- end }}
+            {{- if .Values.worker.args }}
+            args: {{- toYaml .Values.worker.args | nindent 14}}
+            {{- end }}
           {{- if .Values.worker.sidecarContainers }}
           {{- toYaml .Values.worker.sidecarContainers | nindent 10 }}
           {{- end }}

diff --git a/helm-chart/ray-cluster/values.yaml b/helm-chart/ray-cluster/values.yaml
@@ -85,6 +85,10 @@ head:
   # sidecarContainers specifies additional containers to attach to the Ray pod.
   # Follows standard K8s container spec.
   sidecarContainers: []
+  # See docs/guidance/pod-command.md for more details about how to specify
+  # container command for head Pod.
+  command: []
+  args: []
 
 
 worker:
@@ -141,6 +145,10 @@ worker:
   # sidecarContainers specifies additional containers to attach to the Ray pod.
   # Follows standard K8s container spec.
   sidecarContainers: []
+  # See docs/guidance/pod-command.md for more details about how to specify
+  # container command for worker Pod.
+  command: []
+  args: []
 
 # The map's key is used as the groupName.
 # For example, key:small-group in the map below
@@ -198,6 +206,10 @@ additionalWorkerGroups:
       - mountPath: /tmp/ray
         name: log-volume
     sidecarContainers: []
+    # See docs/guidance/pod-command.md for more details about how to specify
+    # container command for worker Pod.
+    command: []
+    args: []
 
 # Configuration for Head's Kubernetes Service
 service:

diff --git a/ray-operator/config/samples/ray-cluster.head-command.yaml b/ray-operator/config/samples/ray-cluster.head-command.yaml
@@ -0,0 +1,86 @@
+# This example config is used to describe how does head command work.
+# See kuberay/docs/guidance/pod-command.md for more details.
+apiVersion: ray.io/v1alpha1
+kind: RayCluster
+metadata:
+  labels:
+    controller-tools.k8s.io: "1.0"
+    # An unique identifier for the head node and workers of this cluster.
+  name: raycluster-mini
+spec:
+  rayVersion: '2.2.0' # should match the Ray version in the image of the containers
+  # Ray head pod template
+  headGroupSpec:
+    serviceType: ClusterIP # optional
+    # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
+    rayStartParams:
+      dashboard-host: '0.0.0.0'
+      num-cpus: '1' # can be auto-completed from the limits
+      block: 'true'
+    #pod template
+    template:
+      spec:
+        containers:
+        - name: ray-head
+          image: rayproject/ray:2.2.0
+          resources:
+            limits:
+              cpu: 1
+              memory: 2Gi
+            requests:
+              cpu: 500m
+              memory: 2Gi
+          ports:
+          - containerPort: 6379
+            name: gcs-server
+          - containerPort: 8265 # Ray dashboard
+            name: dashboard
+          - containerPort: 10001
+            name: client
+          # `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
+          command: ["echo 123"]
+          args: ["456"]
+          # The script ray_cluster_resources.sh will wait until the cluster is ready and print 
+          # the resources in the ray cluster. Users can execute the script in either (1) `command`
+          # or (2) `postStart`. 
+          # 
+          # command: ["(/home/ray/samples/ray_cluster_resources.sh&)"]
+          lifecycle:
+            postStart:
+              exec:
+                command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
+          volumeMounts:
+            - mountPath: /home/ray/samples
+              name: ray-example-configmap
+        volumes:
+          - name: ray-example-configmap
+            configMap:
+              name: ray-example
+              defaultMode: 0777
+              items:
+                - key: ray_cluster_resources.sh
+                  path: ray_cluster_resources.sh
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: ray-example
+data:
+  ray_cluster_resources.sh: |
+    #!/bin/bash
+
+    # wait for ray cluster to finish initialization
+    while true; do
+        ray health-check 2>/dev/null
+        if [ "$?" = "0" ]; then
+            break
+        else
+            echo "INFO: waiting for ray head to start"
+            sleep 1
+        fi
+    done
+
+    # Print the resources in the ray cluster after the cluster is ready.
+    python -c "import ray; ray.init(); print(ray.cluster_resources())"
+
+    echo "INFO: Print Ray cluster resources"