Skip to content

Commit

Permalink
[release] Redirect users to Ray website (ray-project#1431)
Browse files Browse the repository at this point in the history
[release] Redirect users to Ray website
kevin85421 committed Oct 17, 2023
1 parent 9794249 commit 11bfdfa
Showing 20 changed files with 20 additions and 2,903 deletions.
61 changes: 1 addition & 60 deletions docs/guidance/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,60 +1 @@
# Frequently Asked Questions

Welcome to the Frequently Asked Questions page for KubeRay. This document addresses common inquiries.
If you don't find an answer to your question here, please don't hesitate to connect with us via our [community channels](https://github.com/ray-project/kuberay#getting-involved).

# Contents
- [Worker init container](#worker-init-container)
- [Cluster domain](#cluster-domain)
- [RayService](#rayservice)

## Worker init container

The KubeRay operator will inject a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod.
This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head.
The init container will use `ray health-check` to check the GCS server status continuously.

The default worker init container may not work for all use cases, or users may want to customize the init container.

### 1. Init container troubleshooting

Some common causes for the worker init container to stuck in `Init:0/1` status are:

* The GCS server process has failed in the head Pod. Please inspect the log directory `/tmp/ray/session_latest/logs/` in the head Pod for errors related to the GCS server.
* The `ray` executable is not included in the `$PATH` for the image, so the init container will fail to run `ray health-check`.
* The `CLUSTER_DOMAIN` environment variable is not set correctly. See the section [cluster domain](#cluster-domain) for more details.
* The worker init container shares the same ***ImagePullPolicy***, ***SecurityContext***, ***Env***, ***VolumeMounts***, and ***Resources*** as the worker Pod template. Sharing these settings is possible to cause a deadlock. See [#1130](https://github.com/ray-project/kuberay/issues/1130) for more details.

If the init container remains stuck in `Init:0/1` status for 2 minutes, we will stop redirecting the output messages to `/dev/null` and instead print them to the worker Pod logs.
To troubleshoot further, you can inspect the logs using `kubectl logs`.

### 2. Disable the init container injection

If you want to customize the worker init container, you can disable the init container injection and add your own.
To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable from KubeRay v0.5.2).
Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable.
Once disabled, you can add your custom init container to the worker Pod template.

## Cluster domain

In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head.
The FQDN of the head service is `${HEAD_SVC}.${NAMESPACE}.svc.${CLUSTER_DOMAIN}`.
The default [cluster domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction) is `cluster.local`, which works for most Kubernetes clusters.
However, it's important to note that some clusters may have a different cluster domain.
You can check the cluster domain of your Kubernetes cluster by checking `/etc/resolv.conf` in a Pod.

To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator.
Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91).
For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details.

## RayService

RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts
or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](rayservice-troubleshooting.md) for more details.

## Questions

### Why are my changes to RayCluster/RayJob CR not taking effect?

Currently, only modifications to the `replicas` field in `RayCluster/RayJob` CR are supported. Changes to other fields may not take effect or could lead to unexpected results.
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/troubleshooting.html#kuberay-troubleshootin-guides).
112 changes: 1 addition & 111 deletions docs/guidance/autoscaler.md
Original file line number Diff line number Diff line change
@@ -1,111 +1 @@
## Autoscaler (beta)

Ray Autoscaler integration is beta since KubeRay 0.3.0 and Ray 2.0.0.
While autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases.

See the [official Ray documentation](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html) for even more information about Ray autoscaling on Kubernetes.

### Prerequisite

* Follow this [document](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.

### Deploy a cluster with autoscaling enabled

Next, to deploy a sample autoscaling Ray cluster, run
```
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/release-0.5/ray-operator/config/samples/ray-cluster.autoscaler.yaml
```

See the above config file for details on autoscaling configuration.

!!! note

Ray container resource requests and limits in the example configuration above are too small
to be used in production. For typical use-cases, you should use large Ray pods. If possible,
each Ray pod should be sized to take up its entire K8s node. We don't recommend
allocating less than 8 gigabytes of memory for Ray containers running in production.
For an autoscaling configuration more suitable for production, see
[ray-cluster.autoscaler.large.yaml](https://raw.githubusercontent.com/ray-project/kuberay/release-0.5/ray-operator/config/samples/ray-cluster.autoscaler.large.yaml).

The output of `kubectl get pods` should indicate the presence of
a Ray head pod with two containers,
the Ray container and the autoscaler container.
You should also see a Ray worker pod with a single Ray container.


```
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s
```

Check the autoscaler container's logs to confirm that the autoscaler is healthy.
Here's an example of logs from a healthy autoscaler.
```
kubectl logs -f raycluster-autoscaler-head-mgwwk autoscaler
2022-03-10 07:51:22,616 INFO monitor.py:226 -- Starting autoscaler metrics server on port 44217
2022-03-10 07:51:22,621 INFO monitor.py:243 -- Monitor: Started
2022-03-10 07:51:22,824 INFO node_provider.py:143 -- Creating KuberayNodeProvider.
2022-03-10 07:51:22,825 INFO autoscaler.py:282 -- StandardAutoscaler: {'provider': {'type': 'kuberay', 'namespace': 'default', 'disable_node_updaters': True, 'disable_launch_config_check': True}, 'cluster_name': 'raycluster-autoscaler', 'head_node_type': 'head-group', 'available_node_types': {'head-group': {'min_workers': 0, 'max_workers': 0, 'node_config': {}, 'resources': {'CPU': 1}}, 'small-group': {'min_workers': 1, 'max_workers': 300, 'node_config': {}, 'resources': {'CPU': 1}}}, 'max_workers': 300, 'idle_timeout_minutes': 5, 'upscaling_speed': 1, 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': [], 'auth': {}, 'head_node': {}, 'worker_nodes': {}}
2022-03-10 07:51:23,027 INFO autoscaler.py:327 --
======== Autoscaler status: 2022-03-10 07:51:23.027271 ========
Node status
---------------------------------------------------------------
Healthy:
1 head-group
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/1.0 CPU
0.00/0.931 GiB memory
0.00/0.200 GiB object_store_memory
Demands:
(no resource demands)
```

#### Notes

1. To enable autoscaling, set your RayCluster CR's `spec.enableInTreeAutoscaling` field to true.
The operator will then automatically inject a preconfigured autoscaler container to the head pod.
The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box.
The operator will also configure an empty-dir logging volume for the Ray head pod. The volume will be mounted into the Ray and
autoscaler containers; this is necessary to support the event logging introduced in [Ray PR #13434](https://github.com/ray-project/ray/pull/13434).

```
spec:
enableInTreeAutoscaling: true
```
2. If your RayCluster CR's `spec.rayVersion` field is at least `2.0.0`, the autoscaler container will use the same image as the Ray container.
For Ray versions older than 2.0.0, the image `rayproject/ray:2.0.0` will be used to run the autoscaler.
3. Autoscaling functionality is supported only with Ray versions at least as new as 1.11.0. Autoscaler support
is beta as of Ray 2.0.0 and KubeRay 0.3.0; while autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases.
### Test autoscaling
Let's now try out the autoscaler. Run the following commands to scale up the cluster:
```
export HEAD_POD=$(kubectl get pods -o custom-columns=POD:metadata.name | grep raycluster-autoscaler-head)
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=4)"
```
You should then see two extra Ray nodes (pods) scale up to satisfy the 4 CPU demand.
```
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s
raycluster-autoscaler-worker-small-group-4d255 1/1 Running 0 40s
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s
raycluster-autoscaler-worker-small-group-qzhvg 1/1 Running 0 40s
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaling).
71 changes: 1 addition & 70 deletions docs/guidance/aws-eks-gpu-cluster.md
Original file line number Diff line number Diff line change
@@ -1,70 +1 @@
# Start Amazon EKS Cluster with GPUs for KubeRay

## Step 1: Create a Kubernetes cluster on Amazon EKS

Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#)
to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster.

## Step 2: Create node groups for the Amazon EKS cluster

Follow "Step 3: Create nodes" in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to create node groups. The following section provides more detailed information.

### Create a CPU node group

Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU
workers, such as the KubeRay operator, Ray head, and CoreDNS Pods.

Here's a common configuration that works for most KubeRay examples in the docs:
* Instance type: [**m5.xlarge**](https://aws.amazon.com/ec2/instance-types/m5/) (4 vCPU; 16 GB RAM)
* Disk size: 256 GB
* Desired size: 1, Min size: 0, Max size: 1

### Create a GPU node group

Create a GPU node group for Ray GPU workers.

1. Here's a common configuration that works for most KubeRay examples in the docs:
* AMI type: Bottlerocket NVIDIA (BOTTLEROCKET_x86_64_NVIDIA)
* Instance type: [**g5.xlarge**](https://aws.amazon.com/ec2/instance-types/g5/) (1 GPU; 24 GB GPU Memory; 4 vCPUs; 16 GB RAM)
* Disk size: 1024 GB
* Desired size: 1, Min size: 0, Max size: 1

> **Note:** If you encounter permission issues with `kubectl`, follow "Step 2: Configure your computer to communicate with your cluster"
in the [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#).

2. Please install the NVIDIA device plugin. Note: You don't need this if you used `BOTTLEROCKET_x86_64_NVIDIA` image in above step
* Install the DaemonSet for NVIDIA device plugin to run GPU enabled containers in your Amazon EKS cluster. You can refer to the [Amazon EKS optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami)
or [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) repository for more details.
* If the GPU nodes have taints, add `tolerations` to `nvidia-device-plugin.yml` to enable the DaemonSet to schedule Pods on the GPU nodes.

```sh
# Install the DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml

# Verify that your nodes have allocatable GPUs. If the GPU node fails to detect GPUs,
# please verify whether the DaemonSet schedules the Pod on the GPU node.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

# Example output:
# NAME GPU
# ip-....us-west-2.compute.internal 4
# ip-....us-west-2.compute.internal <none>
```

3. Add a Kubernetes taint to prevent scheduling CPU Pods on this GPU node group. For KubeRay examples, add the following taint to the GPU nodes: `Key: ray.io/node-type, Value: worker, Effect: NoSchedule`, and include the corresponding `tolerations` for GPU Ray worker Pods.

> Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it.
## Step 3: Verify the node groups

> **Note:** If you encounter permission issues with `eksctl`, navigate to your AWS account's webpage and copy the
credential environment variables, including `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`,
from the "Command line or programmatic access" page.

```sh
eksctl get nodegroup --cluster ${YOUR_EKS_NAME}

# CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE
# ${YOUR_EKS_NAME} cpu-node-group ACTIVE 2023-06-05T21:31:49Z 0 1 1 m5.xlarge AL2_x86_64 eks-cpu-node-group-... managed
# ${YOUR_EKS_NAME} gpu-node-group ACTIVE 2023-06-05T22:01:44Z 0 1 1 g5.12xlarge BOTTLEROCKET_x86_64_NVIDIA eks-gpu-node-group-... managed
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/k8s-cluster-setup.html#kuberay-k8s-setup).
75 changes: 1 addition & 74 deletions docs/guidance/gcp-gke-gpu-cluster.md
Original file line number Diff line number Diff line change
@@ -1,74 +1 @@
# Start Google Cloud GKE Cluster with GPUs for KubeRay

## Step 1: Create a Kubernetes cluster on GKE

Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you will need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 CPU node in the `us-west1-b` zone. In this example, we use the `e2-standard-4` machine type, which has 4 vCPUs and 16 GB RAM.

```sh
gcloud container clusters create kuberay-gpu-cluster \
--num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \
--zone=us-west1-b --machine-type e2-standard-4
```

> Note: You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list).
## Step 2: Create a GPU node pool

Run the following command to create a GPU node pool for Ray GPU workers.
(You can also create it from the Google Cloud Console; see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#create_a_node_pool_with_node_taints) for more details.)

```sh
gcloud container node-pools create gpu-node-pool \
--accelerator type=nvidia-l4-vws,count=1 \
--zone us-west1-b \
--cluster kuberay-gpu-cluster \
--num-nodes 1 \
--min-nodes 0 \
--max-nodes 1 \
--enable-autoscaling \
--machine-type g2-standard-4 \
--node-taints=ray.io/node-type=worker:NoSchedule
```

The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. In this example, we use the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-4` has 1 GPU, 24 GB GPU Memory, 4 vCPUs and 16 GB RAM.

The taint `ray.io/node-type=worker:NoSchedule` prevents CPU-only Pods such as the Kuberay operator, Ray head, and CoreDNS Pods from being scheduled on this GPU node pool. This is because GPUs are expensive, so we want to use this node pool for Ray GPU workers only.

Concretely, any Pod that does not have the following toleration will not be scheduled on this GPU node pool:

```yaml
tolerations:
- key: ray.io/node-type
operator: Equal
value: worker
effect: NoSchedule
```
For more on taints and tolerations, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
## Step 3: Configure `kubectl` to connect to the cluster

Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them.

```sh
gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-west1-b
```

For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl).

## Step 4: Install NVIDIA GPU device drivers

This step is required for GPU support on GKE. See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) for more details.

```sh
# Install NVIDIA GPU device driver
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
# Verify that your nodes have allocatable GPUs
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# Example output:
# NAME GPU
# gke-kuberay-gpu-cluster-gpu-node-pool-xxxxx 1
# gke-kuberay-gpu-cluster-default-pool-xxxxx <none>
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/k8s-cluster-setup.html#kuberay-k8s-setup).
119 changes: 1 addition & 118 deletions docs/guidance/gcs-ft.md
Original file line number Diff line number Diff line change
@@ -1,118 +1 @@
## Ray GCS Fault Tolerance (GCS FT) (Beta release)

> **Note**: This feature is beta.
Ray GCS FT enables GCS server to use external storage backend. As a result, Ray clusters can tolerate GCS failures and recover from failures
without affecting important services such as detached Actors & RayServe deployments.

### Prerequisite

* Ray 2.0 is required.
* You need to support external Redis server for Ray. (Redis HA cluster is highly recommended.)

### Enable Ray GCS FT

To enable Ray GCS FT in your newly KubeRay-managed Ray cluster, you need to enable it by adding an annotation to the
RayCluster YAML file.

```yaml
...
kind: RayCluster
metadata:
annotations:
ray.io/ft-enabled: "true" # <- add this annotation enable GCS FT
ray.io/external-storage-namespace: "my-raycluster-storage-namespace" # <- optional, to specify the external storage namespace
...
```
An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

When annotation `ray.io/ft-enabled` is added with a `true` value, KubeRay will enable Ray GCS FT feature. This feature
contains several components:

1. Newly created Ray cluster has `Readiness Probe` and `Liveness Probe` added to all the head/worker nodes.
2. KubeRay Operator controller watches for `Event` object changes which can notify in case of readiness probe failures and mark them as `Unhealthy`.
3. KubeRay Operator controller kills and recreate any `Unhealthy` Ray head/worker node.

### Implementation Details

#### Readiness Probe vs Liveness Probe

These are the two types of probes we used in Ray GCS FT.

The readiness probe is used to notify KubeRay in case of failures in the corresponding Ray cluster. KubeRay can try its best to
recover the Ray cluster. If KubeRay cannot recover the failed head/worker node, the liveness probe gets in, delete the old pod
and create a new pod.

By default, the liveness probe gets involved later than the readiness probe. The liveness probe is our last resort to recover the
Ray cluster. However, in our current implementation, for the readiness probe failures, we also kill & recreate the corresponding pod that runs head/worker node.

Currently, the readiness probe and the liveness probe are using the same command to do the work. In the future, we may run
different commands for the readiness probe and the liveness probe.

On Ray head node, we access a local Ray dashboard http endpoint and a Raylet http endpoint to make sure this head node is in
healthy state. Since Ray dashboard does not reside Ray worker node, we only check the local Raylet http endpoint to make sure
the worker node is healthy.

#### Ray GCS FT Annotation

Our Ray GCS FT feature checks if an annotation called `ray.io/ft-enabled` is set to `true` in `RayCluster` YAML file. If so, KubeRay
will also add such annotation to the pod whenever the head/worker node is created.

#### Use External Redis Cluster

To use external Redis cluster as the backend storage(required by Ray GCS FT),
you need to add `RAY_REDIS_ADDRESS` environment variable to the head node template.

Also, you can specify a storage namespace for your Ray cluster by using an annotation `ray.io/external-storage-namespace`

An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml)

To use SSL/TLS in the connection, you add `rediss://` as the prefix of the redis address instead of the `redis://` prefix. This feature is only available in Ray 2.2 and above.

You can also specify additional environment variables in the head pod to customize the SSL configuration:

- `RAY_REDIS_CA_CERT` The location of the CA certificate (optional)
- `RAY_REDIS_CA_PATH` Path of trusted certificates (optional)
- `RAY_REDIS_CLIENT_CERT` File name of client certificate file (optional)
- `RAY_REDIS_CLIENT_KEY` File name of client private key (optional)
- `RAY_REDIS_SERVER_NAME` Server name to request (SNI) (optional)


#### KubeRay Operator Controller

KubeRay Operator controller watches for new `Event` reconcile call. If this Event object is to notify the failed readiness probe,
controller checks if this pod has `ray.io/ft-enabled` set to `true`. If this pod has this annotation set to true, that means this pod
belongs to a Ray cluster that has Ray GCS FT enabled.

After this, the controller will try to recover the failed pod. If controller cannot recover it, an annotation named
`ray.io/health-state` with a value `Unhealthy` is added to this pod.

In every KubeRay Operator controller reconcile loop, it monitors any pod in Ray cluster that has `Unhealthy` value in annotation
`ray.io/health-state`. If any pod is found, this pod is deleted and gets recreated.

#### External Storage Namespace

External storage namespaces can be used to share a single storage backend among multiple Ray clusters. By default, `ray.io/external-storage-namespace`
uses the RayCluster UID as its value when GCS FT is enabled. Or if the user wants to use customized external storage namespace,
the user can add `ray.io/external-storage-namespace` annotation to RayCluster yaml file.

Whenever `ray.io/external-storage-namespace` annotation is set, the head/worker node will have `RAY_external_storage_namespace` environment
variable set which Ray can pick up later.

#### Known issues and limitations

1. For now, Ray head/worker node that fails the readiness probe recovers itself by restarting itself. More fine-grained control and recovery mechanisms are expected in the future.

### Test Ray GCS FT

Currently, two tests are responsible for ensuring Ray GCS FT is working correctly.

1. Detached actor test
2. RayServe test

In detached actor test, a detached actor is created at first. Then, the head node is killed. KubeRay brings back another
head node replacement pod. However, the detached actor is still expected to be available. (Note: the client that creates
the detached actor does not exist and will retry in case of Ray cluster returns failure)

In RayServe test, a simple RayServe app is deployed on the Ray cluster. In case of GCS server crash, the RayServe app
continues to be accessible after the head node recovery.
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-gcs-ft).
137 changes: 1 addition & 136 deletions docs/guidance/ingress.md
Original file line number Diff line number Diff line change
@@ -1,136 +1 @@
## Ingress Usage

Here we provide some examples to show how to use ingress to access your Ray cluster.

* [Example: AWS Application Load Balancer (ALB) Ingress support on AWS EKS](#example-aws-application-load-balancer-alb-ingress-support-on-aws-eks)
* [Example: Manually setting up NGINX Ingress on KinD](#example-manually-setting-up-nginx-ingress-on-kind)


> :warning: **Only expose Ingresses to authorized users.** The Ray Dashboard provides read and write access to the Ray Cluster. Anyone with access to this Ingress can execute arbitrary code on the Ray Cluster.
### Example: AWS Application Load Balancer (ALB) Ingress support on AWS EKS
#### Prerequisite
* Follow the document [Getting started with Amazon EKS – AWS Management Console and AWS CLI](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#eks-configure-kubectl) to create an EKS cluster.

* Follow the [installation instructions](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/installation/) to set up the [AWS Load Balancer controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller). Note that the repository maintains a webpage for each release. Please make sure you use the latest installation instructions.

* (Optional) Try [echo server example](https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/examples/echo_server.md) in the [aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller) repository.

* (Optional) Read [how-it-works.md](https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/docs/how-it-works.md) to understand the mechanism of [aws-load-balancer-controller](https://github.com/kubernetes-sigs/aws-load-balancer-controller).

#### Instructions
```sh
# Step 1: Install KubeRay operator and CRD
pushd helm-chart/kuberay-operator/
helm install kuberay-operator .
popd

# Step 2: Install a RayCluster
pushd helm-chart/ray-cluster
helm install ray-cluster .
popd

# Step 3: Edit the `ray-operator/config/samples/ray-cluster-alb-ingress.yaml`
#
# (1) Annotation `alb.ingress.kubernetes.io/subnets`
# 1. Please include at least two subnets.
# 2. One Availability Zone (ex: us-west-2a) can only have at most 1 subnet.
# 3. In this example, you need to select public subnets (subnets that "Auto-assign public IPv4 address" is Yes on AWS dashboard)
#
# (2) Set the name of head pod service to `spec...backend.service.name`
eksctl get cluster ${YOUR_EKS_CLUSTER} # Check subnets on the EKS cluster

# Step 4: Check ingress created by Step 4.
kubectl describe ingress ray-cluster-ingress

# [Example]
# Name: ray-cluster-ingress
# Labels: <none>
# Namespace: default
# Address: k8s-default-rayclust-....${REGION_CODE}.elb.amazonaws.com
# Default backend: default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
# Rules:
# Host Path Backends
# ---- ---- --------
# *
# / ray-cluster-kuberay-head-svc:8265 (192.168.185.157:8265)
# Annotations: alb.ingress.kubernetes.io/scheme: internet-facing
# alb.ingress.kubernetes.io/subnets: ${SUBNET_1},${SUBNET_2}
# alb.ingress.kubernetes.io/tags: Environment=dev,Team=test
# alb.ingress.kubernetes.io/target-type: ip
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Normal SuccessfullyReconciled 39m ingress Successfully reconciled

# Step 6: Check ALB on AWS (EC2 -> Load Balancing -> Load Balancers)
# The name of the ALB should be like "k8s-default-rayclust-......".

# Step 7: Check Ray Dashboard by ALB DNS Name. The name of the DNS Name should be like
# "k8s-default-rayclust-.....us-west-2.elb.amazonaws.com"

# Step 8: Delete the ingress, and AWS Load Balancer controller will remove ALB.
# Check ALB on AWS to make sure it is removed.
kubectl delete ingress ray-cluster-ingress
```

### Example: Manually setting up NGINX Ingress on KinD
```sh
# Step 1: Create a KinD cluster with `extraPortMappings` and `node-labels`
# Reference for the setting up of kind cluster: https://kind.sigs.k8s.io/docs/user/ingress/
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
extraPortMappings:
- containerPort: 80
hostPort: 80
protocol: TCP
- containerPort: 443
hostPort: 443
protocol: TCP
EOF

# Step 2: Install NGINX ingress controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
sleep 10 # Wait for the Kubernetes API Server to create the related resources
kubectl wait --namespace ingress-nginx \
--for=condition=ready pod \
--selector=app.kubernetes.io/component=controller \
--timeout=90s

# Step 3: Install KubeRay operator
pushd helm-chart/kuberay-operator
helm install kuberay-operator .
popd

# Step 4: Install RayCluster and create an ingress separately.
# If you want to change ingress settings, you can edit the ingress portion in
# `ray-operator/config/samples/ray-cluster.separate-ingress.yaml`.
# More information about change of setting was documented in https://github.com/ray-project/kuberay/pull/699
# and `ray-operator/config/samples/ray-cluster.separate-ingress.yaml`
kubectl apply -f ray-operator/config/samples/ray-cluster.separate-ingress.yaml

# Step 5: Check the ingress created in Step 4.
kubectl describe ingress raycluster-ingress-head-ingress

# [Example]
# ...
# Rules:
# Host Path Backends
# ---- ---- --------
# *
# /raycluster-ingress/(.*) raycluster-ingress-head-svc:8265 (10.244.0.11:8265)
# Annotations: nginx.ingress.kubernetes.io/rewrite-target: /$1

# Step 6: Check `<ip>/raycluster-ingress/` on your browser. You will see the Ray Dashboard.
# [Note] The forward slash at the end of the address is necessary. `<ip>/raycluster-ingress`
# will report "404 Not Found".
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/ingress.html#kuberay-ingress).
110 changes: 1 addition & 109 deletions docs/guidance/kubeflow-integration.md
Original file line number Diff line number Diff line change
@@ -1,109 +1 @@
> Credit: This manifest refers a lot to the engineering blog ["Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine"](https://cloud.google.com/blog/products/ai-machine-learning/build-a-ml-platform-with-kubeflow-and-ray-on-gke) from Google Cloud.
# Kubeflow: an interactive development solution

The [Kubeflow](https://www.kubeflow.org/) project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

# Requirements
* Dependencies
* `kustomize`: v3.2.0 (Kubeflow manifest is sensitive to `kustomize` version.)
* `Kubernetes`: v1.23

* Computing resources:
* 16GB RAM
* 8 CPUs

# Example: Use Kubeflow to provide an interactive development envirzonment
![image](../images/architecture.svg)

## Step 1: Create a Kubernetes cluster with Kind.
```sh
# Kubeflow is sensitive to Kubernetes version and Kustomize version.
kind create cluster --image=kindest/node:v1.23.0
kustomize version --short
# 3.2.0
```

## Step 2: Install Kubeflow v1.6-branch
* This example installs Kubeflow with the [v1.6-branch](https://github.com/kubeflow/manifests/tree/v1.6-branch).

* Install all Kubeflow official components and all common services using [one command](https://github.com/kubeflow/manifests/tree/v1.6-branch#install-with-a-single-command).
* If you do not want to install all components, you can comment out **KNative**, **Katib**, **Tensorboards Controller**, **Tensorboard Web App**, **Training Operator**, and **KServe** from [example/kustomization.yaml](https://github.com/kubeflow/manifests/blob/v1.6-branch/example/kustomization.yaml).

## Step 3: Install KubeRay operator
* Follow this [document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.

## Step 4: Install RayCluster
```sh
# Create a RayCluster CR, and the KubeRay operator will reconcile a Ray cluster
# with 1 head Pod and 1 worker Pod.
helm install raycluster kuberay/ray-cluster --version 0.6.0 --set image.tag=2.2.0-py38-cpu

# Check RayCluster
kubectl get pod -l ray.io/cluster=raycluster-kuberay
# NAME READY STATUS RESTARTS AGE
# raycluster-kuberay-head-bz77b 1/1 Running 0 64s
# raycluster-kuberay-worker-workergroup-8gr5q 1/1 Running 0 63s
```

* This step uses `rayproject/ray:2.2.0-py38-cpu` as its image. Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. This image uses:
* Python 3.8.13
* Ray 2.2.0

## Step 5: Forward the port of Istio's Ingress-Gateway
* Follow the [instructions](https://github.com/kubeflow/manifests/tree/v1.6-branch#port-forward) to forward the port of Istio's Ingress-Gateway and log in to Kubeflow Central Dashboard.

## Step 6: Create a JupyterLab via Kubeflow Central Dashboard
* Click "Notebooks" icon in the left panel.
* Click "New Notebook"
* Select `kubeflownotebookswg/jupyter-scipy:v1.6.1` as OCI image.
* Click "Launch"
* Click "CONNECT" to connect into the JupyterLab instance.

## Step 7: Use Ray client in the JupyterLab to connect to the RayCluster
> Warning: Ray client has some known [limitations](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html#things-to-know) and is not actively maintained.
* As mentioned in Step 4, Ray is very sensitive to the Python versions and Ray versions between the server (RayCluster) and client (JupyterLab) sides. Open a terminal in the JupyterLab:
```sh
# Check Python version. The version's MAJOR and MINOR should match with RayCluster (i.e. Python 3.8)
python --version
# Python 3.8.10

# Install Ray 2.2.0
pip install -U ray[default]==2.2.0
```
* Connect to RayCluster via Ray client.
```python
# Open a new .ipynb page.
import ray
# ray://${RAYCLUSTER_HEAD_SVC}.${NAMESPACE}.svc.cluster.local:${RAY_CLIENT_PORT}
ray.init(address="ray://raycluster-kuberay-head-svc.default.svc.cluster.local:10001")
print(ray.cluster_resources())
# {'node:10.244.0.41': 1.0, 'memory': 3000000000.0, 'node:10.244.0.40': 1.0, 'object_store_memory': 805386239.0, 'CPU': 2.0}
# Try Ray task
@ray.remote
def f(x):
return x * x
futures = [f.remote(i) for i in range(4)]
print(ray.get(futures)) # [0, 1, 4, 9]
# Try Ray actor
@ray.remote
class Counter(object):
def __init__(self):
self.n = 0
def increment(self):
self.n += 1
def read(self):
return self.n
counters = [Counter.remote() for i in range(4)]
[c.increment.remote() for c in counters]
futures = [c.read.remote() for c in counters]
print(ray.get(futures)) # [1, 1, 1, 1]
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/kubeflow.html).
43 changes: 1 addition & 42 deletions docs/guidance/mobilenet-rayservice.md
Original file line number Diff line number Diff line change
@@ -1,42 +1 @@
# Serve a MobileNet image classifier using RayService

> **Note:** The Python files for the Ray Serve application and its client are in the repository [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples).
## Step 1: Create a Kubernetes cluster with Kind.

```sh
kind create cluster --image=kindest/node:v1.23.0
```

## Step 2: Install KubeRay operator

Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.
Please note that the YAML file in this example uses `serveConfigV2`, which is supported starting from KubeRay v0.6.0.

## Step 3: Install a RayService

```sh
# path: ray-operator/config/samples/
kubectl apply -f ray-service.mobilenet.yaml
```

* The [mobilenet.py](https://github.com/ray-project/serve_config_examples/blob/master/mobilenet/mobilenet.py) file requires `tensorflow` as a dependency. Hence, the YAML file uses `rayproject/ray-ml:2.5.0` instead of `rayproject/ray:2.5.0`.
* `python-multipart` is required for the request parsing function `starlette.requests.form()`, so the YAML file includes `python-multipart` in the runtime environment.

## Step 4: Forward the port of Serve

```sh
kubectl port-forward svc/rayservice-mobilenet-serve-svc 8000
```

Note that the Serve service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running.

## Step 5: Send a request to the ImageClassifier

* Step 5.1: Prepare an image file.
* Step 5.2: Update `image_path` in [mobilenet_req.py](https://github.com/ray-project/serve_config_examples/blob/master/mobilenet/mobilenet_req.py)
* Step 5.3: Send a request to the `ImageClassifier`.
```sh
python mobilenet_req.py
# sample output: {"prediction":["n02099601","golden_retriever",0.17944198846817017]}
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/examples/mobilenet-rayservice.html#kuberay-mobilenet-rayservice-example).
151 changes: 1 addition & 150 deletions docs/guidance/pod-command.md
Original file line number Diff line number Diff line change
@@ -1,150 +1 @@
# Specify container commands for Ray head/worker Pods
You can execute commands on the head/worker pods at two timings:

* (1) **Before `ray start`**: As an example, you can set up some environment variables that will be used by `ray start`.

* (2) **After `ray start` (RayCluster is ready)**: As an example, you can launch a Ray serve deployment when the RayCluster is ready.

## Current KubeRay operator behavior for container commands
* The current behavior for container commands is not finalized, and **may be updated in the future**.
* See [code](https://github.com/ray-project/kuberay/blob/47148921c7d14813aea26a7974abda7cf22bbc52/ray-operator/controllers/ray/common/pod.go#L301-L326) for more details.

## Timing 1: Before `ray start`
Currently, for timing (1), we can set the container's `Command` and `Args` in RayCluster specification to reach the goal.

```yaml
# ray-operator/config/samples/ray-cluster.head-command.yaml
rayStartParams:
...
#pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.6.3
resources:
...
ports:
...
# `command` and `args` will become a part of `spec.containers.0.args` in the head Pod.
command: ["echo 123"]
args: ["456"]
```
* Ray head Pod
* `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
* `spec.containers.0.args` contains two parts:
* (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together.
* (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
* To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.

* Example
```sh
# Prerequisite: There is a KubeRay operator in the Kubernetes cluster.
# Path: kuberay/
kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml
# Check ${RAYCLUSTER_HEAD_POD}
kubectl get pod -l ray.io/node-type=head
# Check `spec.containers.0.command` and `spec.containers.0.args`.
kubectl describe pod ${RAYCLUSTER_HEAD_POD}

# Command:
# /bin/bash
# -lc
# --
# Args:
# echo 123 456 && ulimit -n 65536; ray start --head --dashboard-host=0.0.0.0 --num-cpus=1 --block --metrics-export-port=8080 --memory=2147483648
```


## Timing 2: After `ray start` (RayCluster is ready)
We have two solutions to execute commands after the RayCluster is ready. The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 1.

### Solution 1: Container command (Recommended)
As we mentioned in the section "Timing 1: Before `ray start`", user-specified command will be executed before the `ray start` command. Hence, we can execute the `ray_cluster_resources.sh` in background by updating `headGroupSpec.template.spec.containers.0.command` in `ray-cluster.head-command.yaml`.

```yaml
# ray-operator/config/samples/ray-cluster.head-command.yaml
# Parentheses for the command is required.
command: ["(/home/ray/samples/ray_cluster_resources.sh&)"]
# ray_cluster_resources.sh
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example
data:
ray_cluster_resources.sh: |
#!/bin/bash
# wait for ray cluster to finish initialization
while true; do
ray health-check 2>/dev/null
if [ "$?" = "0" ]; then
break
else
echo "INFO: waiting for ray head to start"
sleep 1
fi
done
# Print the resources in the ray cluster after the cluster is ready.
python -c "import ray; ray.init(); print(ray.cluster_resources())"
echo "INFO: Print Ray cluster resources"
```

* Example
```sh
# Path: kuberay/
# (1) Update `command` to ["(/home/ray/samples/ray_cluster_resources.sh&)"]
# (2) Comment out `postStart` and `args`.
kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml

# Check ${RAYCLUSTER_HEAD_POD}
kubectl get pod -l ray.io/node-type=head

# Check the logs
kubectl logs ${RAYCLUSTER_HEAD_POD}

# INFO: waiting for ray head to start
# .
# . => Cluster initialization
# .
# 2023-02-16 18:44:43,724 INFO worker.py:1231 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
# 2023-02-16 18:44:43,724 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.244.0.26:6379...
# 2023-02-16 18:44:43,735 INFO worker.py:1535 -- Connected to Ray cluster. View the dashboard at http://10.244.0.26:8265
# {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
# INFO: Print Ray cluster resources
```

### Solution 2: postStart hook
```yaml
# ray-operator/config/samples/ray-cluster.head-command.yaml
lifecycle:
postStart:
exec:
command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
```

* We execute the script `ray_cluster_resources.sh` via the postStart hook. Based on [this document](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks), there is no guarantee that the hook will execute before the container ENTRYPOINT. Hence, we need to wait for RayCluster to finish initialization in `ray_cluster_resources.sh`.

* Example
```sh
# Path: kuberay/
kubectl apply -f ray-operator/config/samples/ray-cluster.head-command.yaml
# Check ${RAYCLUSTER_HEAD_POD}
kubectl get pod -l ray.io/node-type=head
# Forward the port of Dashboard
kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265
# Open the browser and check the Dashboard (${YOUR_IP}:8265/#/job).
# You shold see a SUCCEEDED job with the following Entrypoint:
#
# `python -c "import ray; ray.init(); print(ray.cluster_resources())"`
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/pod-command.html#kuberay-pod-command).
122 changes: 1 addition & 121 deletions docs/guidance/pod-security.md
Original file line number Diff line number Diff line change
@@ -1,121 +1 @@
# Pod Security

Kubernetes defines three different Pod Security Standards, including `privileged`, `baseline`, and `restricted`, to broadly
cover the security spectrum. The `privileged` standard allows users to do known privilege escalations, and thus it is not
safe enough for security-critical applications.

This document describes how to configure RayCluster YAML file to apply `restricted` Pod security standard. The following
references can help you understand this document better:

* [Kubernetes - Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted)
* [Kubernetes - Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/)
* [Kubernetes - Auditing](https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/)
* [KinD - Auditing](https://kind.sigs.k8s.io/docs/user/auditing/)

# Step 1: Create a KinD cluster
```bash
# Path: kuberay/
kind create cluster --config ray-operator/config/security/kind-config.yaml --image=kindest/node:v1.24.0
```
The `kind-config.yaml` enables audit logging with the audit policy defined in `audit-policy.yaml`. The `audit-policy.yaml`
defines an auditing policy to listen to the Pod events in the namespace `pod-security`. With this policy, we can check
whether our Pods violate the policies in `restricted` standard or not.

The feature [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) is firstly
introduced in Kubernetes v1.22 (alpha) and becomes stable in Kubernetes v1.25. In addition, KubeRay currently supports
Kubernetes from v1.19 to v1.24. (At the time of writing, we have not tested KubeRay with Kubernetes v1.25). Hence, I use **Kubernetes v1.24** in this step.

# Step 2: Check the audit logs
```bash
docker exec kind-control-plane cat /var/log/kubernetes/kube-apiserver-audit.log
```
The log should be empty because the namespace `pod-security` does not exist.

# Step 3: Create the `pod-security` namespace
```bash
kubectl create ns pod-security
kubectl label --overwrite ns pod-security \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/warn-version=latest \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/audit-version=latest \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=latest
```
With the `pod-security.kubernetes.io` labels, the built-in Kubernetes Pod security admission controller will apply the
`restricted` Pod security standard to all Pods in the namespace `pod-security`. The label
`pod-security.kubernetes.io/enforce=restricted` means that the Pod will be rejected if it violate the policies defined in
`restricted` security standard. See [Pod Security Admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) for more details about the labels.

# Step 4: Install the KubeRay operator
```bash
# Update the field securityContext in helm-chart/kuberay-operator/values.yaml
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault

# Path: kuberay/helm-chart/kuberay-operator
helm install -n pod-security kuberay-operator .
```

# Step 5: Create a RayCluster (Choose either Step 5.1 or Step 5.2)
* If you choose Step 5.1, no Pod will be created in the namespace `pod-security`.
* If you choose Step 5.2, Pods can be created successfully.

## Step 5.1: Create a RayCluster without proper `securityContext` configurations
```bash
# Path: kuberay/ray-operator/config/samples
kubectl apply -n pod-security -f ray-cluster.complete.yaml

# Wait 20 seconds and check audit logs for the error messages.
docker exec kind-control-plane cat /var/log/kubernetes/kube-apiserver-audit.log

# Example error messagess
# "pods \"raycluster-complete-head-fkbf5\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"ray-head\" must set securityContext.allowPrivilegeEscalation=false) ...

kubectl get pod -n pod-security
# NAME READY STATUS RESTARTS AGE
# kuberay-operator-8b6d55dbb-t8msf 1/1 Running 0 62s

# Clean up the RayCluster
kubectl delete rayclusters.ray.io -n pod-security raycluster-complete
# raycluster.ray.io "raycluster-complete" deleted
```
No Pod will be created in the namespace `pod-security`, and check audit logs for error messages.

## Step 5.2: Create a RayCluster with proper `securityContext` configurations
```bash
# Path: kuberay/ray-operator/config/security
kubectl apply -n pod-security -f ray-cluster.pod-security.yaml

# Wait for the RayCluster convergence and check audit logs for the messages.
docker exec kind-control-plane cat /var/log/kubernetes/kube-apiserver-audit.log

# Forward the dashboard port
kubectl port-forward --address 0.0.0.0 svc/raycluster-pod-security-head-svc -n pod-security 8265:8265

# Log in to the head Pod
kubectl exec -it -n pod-security ${YOUR_HEAD_POD} -- bash

# (Head Pod) Run a sample job in the Pod
python3 samples/xgboost_example.py

# Check the job status in the dashboard on your browser.
# http://127.0.0.1:8265/#/job => The job status should be "SUCCEEDED".

# (Head Pod) Make sure Python dependencies can be installed under `restricted` security standard
pip3 install jsonpatch
echo $? # Check the exit code of `pip3 install jsonpatch`. It should be 0.

# Clean up the RayCluster
kubectl delete -n pod-security -f ray-cluster.pod-security.yaml
# raycluster.ray.io "raycluster-pod-security" deleted
# configmap "xgboost-example" deleted
```
One head Pod and one worker Pod will be created as specified in `ray-cluster.pod-security.yaml`.
First, we log in to the head Pod, run a XGBoost example script, and check the job
status in the dashboard. Next, we use `pip` to install a Python dependency (i.e. `jsonpatch`), and the exit code of the `pip` command should be 0.
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/pod-security.html).
65 changes: 1 addition & 64 deletions docs/guidance/profiling.md
Original file line number Diff line number Diff line change
@@ -1,64 +1 @@
# Profiling with KubeRay

## Stack trace and CPU profiling
[py-spy](https://github.com/benfred/py-spy/tree/master) is a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. This section describes how to configure RayCluster YAML file to enable py-spy and see Stack Trace and CPU Flame Graph via Ray Dashboard.

### **Prerequisite**
py-spy requires the `SYS_PTRACE` capability to read process memory. However, Kubernetes omits this capability by default. To enable profiling, add the following to the `template.spec.containers` for both the head and worker Pods.

```bash
securityContext:
capabilities:
add:
- SYS_PTRACE
```
**Notes:**
- Adding `SYS_PTRACE` is forbidden under `baseline` and `restricted` Pod Security Standards. See [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) for more details.

### **Steps to deploy and test the RayCluster with `SYS_PTRACE` capability**

1. **Create a KinD cluster**:
```bash
kind create cluster
```

2. **Install the KubeRay operator**:

Follow the steps in [Installation Guide](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md#install-crds-and-kuberay-operator).

3. **Create a RayCluster with `SYS_PTRACE` capability**:
```bash
# Path: kuberay/ray-operator/config/samples
kubectl apply -f ray-cluster.py-spy.yaml
```

4. **Forward the dashboard port**:
```bash
kubectl port-forward --address 0.0.0.0 svc/raycluster-py-spy-head-svc 8265:8265
```

5. **Run a sample job within the head Pod**:
```bash
# Log in to the head Pod
kubectl exec -it ${YOUR_HEAD_POD} -- bash
# (Head Pod) Run a sample job in the Pod
# `long_running_task` includes a `while True` loop to ensure the task remains actively running indefinitely.
# This allows you ample time to view the Stack Trace and CPU Flame Graph via Ray Dashboard.
python3 samples/long_running_task.py
```
**Notes:**
- If you're running your own examples and encounter the error `Failed to write flamegraph: I/O error: No stack counts found` when viewing CPU Flame Graph, it might be due to the process being idle. Notably, using the `sleep` function can lead to this state. In such situations, py-spy filters out the idle stack traces. Refer to this [issue](https://github.com/benfred/py-spy/issues/321#issuecomment-731848950) for more information.
6. **Profile using Ray Dashboard**:
- Visit http://localhost:8265/#/cluster.
- Click `Stack Trace` for `ray::long_running_task`.
![StackTrace](../images/stack_trace.png)
- Click `CPU Flame Graph` for `ray::long_running_task`.
![FlameGraph](../images/cpu_flame_graph.png)
- For additional details on using the profiler, refer the [Ray Observability Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/optimize-performance.html#python-cpu-profiling-in-the-dashboard).
7. **Clean up the RayCluster**:
```bash
kubectl delete -f ray-cluster.py-spy.yaml
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/pyspy.html#kuberay-pyspy-integration).
353 changes: 1 addition & 352 deletions docs/guidance/prometheus-grafana.md

Large diffs are not rendered by default.

151 changes: 1 addition & 150 deletions docs/guidance/rayjob.md
Original file line number Diff line number Diff line change
@@ -1,150 +1 @@
# Ray Job (alpha)

> Note: This is the alpha version of Ray Job Support in KubeRay. There will be ongoing improvements for Ray Job in the future releases.
## Prerequisites

* Ray 1.10 or higher
* KubeRay v0.3.0+. (v0.6.0+ is recommended)

## What is a RayJob?

A RayJob manages 2 things:

* Ray Cluster: Manages resources in a Kubernetes cluster.
* Job: Manages jobs in a Ray Cluster.

### What does the RayJob provide?

* **Kubernetes-native support for Ray clusters and Ray Jobs.** You can use a Kubernetes config to define a Ray cluster and job, and use `kubectl` to create them. The cluster can be deleted automatically once the job is finished.

## Deploy KubeRay

Make sure your KubeRay operator version is at least v0.3.0.
The latest released KubeRay version is recommended.
For installation instructions, please follow [the documentation](../deploy/installation.md).

## Run an example Job

There is one example config file to deploy a RayJob included here:
[ray_v1alpha1_rayjob.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml)

```shell
# Create a RayJob.
$ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml
```

```shell
# List running RayJobs.
$ kubectl get rayjob
NAME AGE
rayjob-sample 7s
```

```shell
# RayJob sample will also create a raycluster.
# raycluster will create few resources including pods and services. You can use the following commands to check them:
$ kubectl get rayclusters
$ kubectl get pod
```

## RayJob Configuration

* `entrypoint` - The shell command to run for this job.
* `rayClusterSpec` - The spec for the Ray cluster to run the job on.
* `jobId` - _(Optional)_ Job ID to specify for the job. If not provided, one will be generated.
* `metadata` - _(Optional)_ Arbitrary user-provided metadata for the job.
* `runtimeEnvYAML` - _(Optional)_ The runtime environment configuration provided as a multi-line YAML string. _(New in KubeRay version 1.0.)_
* `shutdownAfterJobFinishes` - _(Optional)_ whether to recycle the cluster after the job finishes. Defaults to false.
* `ttlSecondsAfterFinished` - _(Optional)_ TTL to clean up the cluster. This only works if `shutdownAfterJobFinishes` is set.
* `submitterPodTemplate` - _(Optional)_ Pod template spec for the pod that runs `ray job submit` against the Ray cluster.
* `runtimeEnv` - [DEPRECATED] _(Optional)_ base64-encoded string of the runtime env json string.
* `entrypointNumCpus` - _(Optional)_ Specifies the quantity of CPU cores to reserve for the entrypoint command.
* `entrypointNumGpus` - _(Optional)_ Specifies the number of GPUs to reserve for the entrypoint command.
* `entrypointResources` - _(Optional)_ A json formatted dictionary to specify custom resources and their quantity.

## RayJob Observability

You can use `kubectl logs` to check the operator logs or the head/worker nodes logs.
You can also use `kubectl describe rayjobs rayjob-sample` to check the states and event logs of your RayJob instance:

```text
Status:
Dashboard URL: rayjob-sample-raycluster-v6qcq-head-svc.default.svc.cluster.local:8265
End Time: 2023-07-11T17:39:56Z
Job Deployment Status: Complete
Job Id: rayjob-sample-66z5m
Job Status: SUCCEEDED
Message: Job finished successfully.
Observed Generation: 2
Ray Cluster Name: rayjob-sample-raycluster-v6qcq
Ray Cluster Status:
Available Worker Replicas: 1
Desired Worker Replicas: 1
Endpoints:
Client: 10001
Dashboard: 8265
Gcs - Server: 6379
Metrics: 8080
Serve: 8000
Head:
Pod IP: 10.244.0.6
Service IP: 10.96.31.68
Last Update Time: 2023-07-11T17:39:32Z
Max Worker Replicas: 5
Min Worker Replicas: 1
Observed Generation: 1
State: ready
Start Time: 2023-07-11T17:39:39Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 3m37s rayjob-controller Created cluster rayjob-sample-raycluster-v6qcq
Normal Created 2m11s rayjob-controller Created k8s job rayjob-sample
Normal Deleted 107s rayjob-controller Deleted cluster rayjob-sample-raycluster-v6qcq
```

If the job doesn't run successfully, the above `describe` command will provide information about that too:

```text
Status:
Dashboard URL: rayjob-sample-raycluster-2h7ds-head-svc.default.svc.cluster.local:8265
End Time: 2023-07-11T17:51:31Z
Job Deployment Status: Complete
Job Id: rayjob-sample-prbts
Job Status: FAILED
Message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
python: can't open file '/home/ray/samples/sample_code.ppy': [Errno 2] No such file or directory
Observed Generation: 2
Ray Cluster Name: rayjob-sample-raycluster-2h7ds
Ray Cluster Status:
Available Worker Replicas: 1
Desired Worker Replicas: 1
Endpoints:
Client: 10001
Dashboard: 8265
Gcs - Server: 6379
Metrics: 8080
Serve: 8000
Head:
Pod IP: 10.244.0.7
Service IP: 10.96.24.232
Last Update Time: 2023-07-11T17:51:12Z
Max Worker Replicas: 5
Min Worker Replicas: 1
Observed Generation: 1
State: ready
Start Time: 2023-07-11T17:51:16Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Created 3m57s rayjob-controller Created cluster rayjob-sample-raycluster-2h7ds
Normal Created 2m31s rayjob-controller Created k8s job rayjob-sample
```

## Delete the RayJob instance

```shell
kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/getting-started/rayjob-quick-start.html).
130 changes: 1 addition & 129 deletions docs/guidance/rayserve-dev-doc.md
Original file line number Diff line number Diff line change
@@ -1,129 +1 @@
# Developing Ray Serve Python scripts on a RayCluster

In this tutorial, you will learn how to effectively debug your Ray Serve scripts against a RayCluster, enabling enhanced observability and faster iteration speed compared to developing the script directly with a RayService.
Many RayService issues are related to the Ray Serve Python scripts, so it is important to ensure the correctness of the scripts before deploying them to a RayService.
This tutorial will show you how to develop a Ray Serve Python script for a MobileNet image classifier on a RayCluster.
You can deploy and serve the classifier on your local Kind cluster without requiring a GPU.
Please refer to [ray-service.mobilenet.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-service.mobilenet.yaml) and [mobilenet-rayservice.md](https://github.com/ray-project/kuberay/blob/master/docs/guidance/mobilenet-rayservice.md) for more details.


# Step 1: Install a KubeRay cluster

Follow this [document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.

# Step 2: Create a RayCluster CR

```sh
helm install raycluster kuberay/ray-cluster --version 0.6.0-rc.0
```

# Step 3: Log in to the head Pod

```sh
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl exec -it $HEAD_POD -- bash
```

# Step 4: Prepare your Ray Serve Python scripts and run the Ray Serve application

```sh
# Execute the following command in the head Pod
git clone https://github.com/ray-project/serve_config_examples.git
cd serve_config_examples

# Try to launch the Ray Serve application
serve run mobilenet.mobilenet:app
# [Error message]
# from tensorflow.keras.preprocessing import image
# ModuleNotFoundError: No module named 'tensorflow'
```

* `serve run mobilenet.mobilenet:app`: The first `mobilenet` is the name of the directory in the `serve_config_examples/`,
the second `mobilenet` is the name of the Python file in the directory `mobilenet/`, and `app` is the name of the variable representing Ray Serve application within the Python file. See the section "import_path" in [rayservice-troubleshooting.md](https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayservice-troubleshooting.md) for more details.

# Step 5: Change the Ray image from `rayproject/ray:${RAY_VERSION}` to `rayproject/ray-ml:${RAY_VERSION}`

```sh
# Uninstall RayCluster
helm uninstall raycluster

# Install the RayCluster CR with the Ray image `rayproject/ray-ml:${RAY_VERSION}`
helm install raycluster kuberay/ray-cluster --version 0.6.0-rc.0 --set image.repository=rayproject/ray-ml
```

The error message in Step 4 indicates that the Ray image `rayproject/ray:${RAY_VERSION}` does not have the TensorFlow package.
Due to the significant size of TensorFlow, we have opted to use an image with TensorFlow as the base instead of installing it within the Ray [runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments).
In this Step, we will change the Ray image from `rayproject/ray:${RAY_VERSION}` to `rayproject/ray-ml:${RAY_VERSION}`.

# Step 6: Repeat Step 3 and Step 4

```sh
# Repeat Step 3 and Step 4 to log in to the new head Pod and run the Ray Serve application.
# You should successfully launch the Ray Serve application this time.
serve run mobilenet.mobilenet:app

# [Example output]
# (ServeReplica:default_ImageClassifier pid=139, ip=10.244.0.8) Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224.h5
# 8192/14536120 [..............................] - ETA: 0s)
# 4202496/14536120 [=======>......................] - ETA: 0s)
# 12902400/14536120 [=========================>....] - ETA: 0s)
# 14536120/14536120 [==============================] - 0s 0us/step
# 2023-07-17 14:04:43,737 SUCC scripts.py:424 -- Deployed Serve app successfully.
```

# Step 7: Submit a request to the Ray Serve application

```sh
# (On your local machine) Forward the serve port of the head Pod
kubectl port-forward --address 0.0.0.0 $HEAD_POD 8000

# Clone the repository on your local machine
git clone https://github.com/ray-project/serve_config_examples.git
cd serve_config_examples/mobilenet

# Prepare a sample image file. `stable_diffusion_example.png` is a cat image generated by the Stable Diffusion model.
curl -O https://raw.githubusercontent.com/ray-project/kuberay/master/docs/images/stable_diffusion_example.png

# Update `image_path` in `mobilenet_req.py` to the path of `stable_diffusion_example.png`
# Send a request to the Ray Serve application.
python3 mobilenet_req.py

# [Error message]
# Unexpected error, traceback: ray::ServeReplica:default_ImageClassifier.handle_request() (pid=139, ip=10.244.0.8)
# File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/utils.py", line 254, in wrap_to_ray_error
# raise exception
# File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/_private/replica.py", line 550, in invoke_single
# result = await method_to_call(*args, **kwargs)
# File "./mobilenet/mobilenet.py", line 24, in __call__
# File "/home/ray/anaconda3/lib/python3.7/site-packages/starlette/requests.py", line 256, in _get_form
# ), "The `python-multipart` library must be installed to use form parsing."
# AssertionError: The `python-multipart` library must be installed to use form parsing..
```

`python-multipart` is required for the request parsing function `starlette.requests.form()`, so the error message is reported when we send a request to the Ray Serve application.

# Step 8: Restart the Ray Serve application with runtime environment.

```sh
# In the head Pod, stop the Ray Serve application
serve shutdown

# Check the Ray Serve application status
serve status
# [Example output]
# There are no applications running on this cluster.

# Launch the Ray Serve application with runtime environment.
serve run mobilenet.mobilenet:app --runtime-env-json='{"pip": ["python-multipart==0.0.6"]}'

# (On your local machine) Submit a request to the Ray Serve application again, and you should get the correct prediction.
python3 mobilenet_req.py
# [Example output]
# {"prediction": ["n02123159", "tiger_cat", 0.2994779646396637]}
```

# Step 9: Create a RayService YAML file

In the previous steps, we found that the Ray Serve application can be successfully launched using the Ray image `rayproject/ray-ml:${RAY_VERSION}` and the [runtime environment](https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments) `python-multipart==0.0.6`.
Therefore, we can create a RayService YAML file with the same Ray image and runtime environment.
For more details, please refer to [ray-service.mobilenet.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-service.mobilenet.yaml) and [mobilenet-rayservice.md](https://github.com/ray-project/kuberay/blob/master/docs/guidance/mobilenet-rayservice.md).
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayserve-dev-doc.html#kuberay-dev-serve).
294 changes: 1 addition & 293 deletions docs/guidance/rayservice-troubleshooting.md

Large diffs are not rendered by default.

291 changes: 1 addition & 290 deletions docs/guidance/rayservice.md

Large diffs are not rendered by default.

65 changes: 1 addition & 64 deletions docs/guidance/stable-diffusion-rayservice.md
Original file line number Diff line number Diff line change
@@ -1,64 +1 @@
# Serve a StableDiffusion text-to-image model using RayService

> **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repo
and [the Ray documentation](https://docs.ray.io/en/latest/serve/tutorials/stable-diffusion.html).

## Step 1: Create a Kubernetes cluster with GPUs

Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) or [gcp-gke-gpu-cluster.md](./gcp-gke-gpu-cluster.md) to create a Kubernetes cluster with 1 CPU node and 1 GPU node.

## Step 2: Install KubeRay operator

Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.
Please note that the YAML file in this example uses `serveConfigV2`, which is supported starting from KubeRay v0.6.0.

## Step 3: Install a RayService

```sh
# path: ray-operator/config/samples/
kubectl apply -f ray-service.stable-diffusion.yaml
```

This RayService configuration contains some important settings:

* The `tolerations` for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set `nvidia.com/gpu: 1` in the Pod's resource configurations.
```yaml
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
```
* It includes `diffusers` in `runtime_env` since this package is not included by default in the `ray-ml` image.

## Step 4: Forward the port of Serve

First get the service name from this command.

```sh
kubectl get services
```

Then, port forward to the serve.

```sh
kubectl port-forward svc/stable-diffusion-serve-svc 8000
```

Note that the RayService's Kubernetes service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running.

## Step 5: Send a request to the text-to-image model

```sh
# Step 5.1: Download `stable_diffusion_req.py`
curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/stable_diffusion/stable_diffusion_req.py

# Step 5.2: Set your `prompt` in `stable_diffusion_req.py`.

# Step 5.3: Send a request to the Stable Diffusion model.
python stable_diffusion_req.py
# Check output.png
```

![image](../images/stable_diffusion_example.png)
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/examples/stable-diffusion-rayservice.html#kuberay-stable-diffusion-rayservice-example).
70 changes: 1 addition & 69 deletions docs/guidance/text-summarizer-rayservice.md
Original file line number Diff line number Diff line change
@@ -1,69 +1 @@
# Serve a text summarizer using RayService

> **Note:** The Python files for the Ray Serve application and its client are in the [ray-project/serve_config_examples](https://github.com/ray-project/serve_config_examples) repo.
## Step 1: Create a Kubernetes cluster with GPUs

Follow [aws-eks-gpu-cluster.md](./aws-eks-gpu-cluster.md) or [gcp-gke-gpu-cluster.md](./gcp-gke-gpu-cluster.md) to create a Kubernetes cluster with 1 CPU node and 1 GPU node.

## Step 2: Install KubeRay operator

Follow [this document](../../helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository.
Please note that the YAML file in this example uses `serveConfigV2`, which is supported starting from KubeRay v0.6.0.

## Step 3: Install a RayService

```sh
# path: ray-operator/config/samples/
kubectl apply -f ray-service.text-sumarizer.yaml
```

This RayService configuration contains some important settings:

* The `tolerations`` for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set `nvidia.com/gpu: 1` in the Pod's resource configurations.
```yaml
# Please add the following taints to the GPU node.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
```
## Step 4: Forward the port of Serve
First get the service name from this command.
```sh
kubectl get services
```

Then, port forward to the serve.

```sh
kubectl port-forward svc/text-summarizer-serve-svc 8000
```

Note that the RayService's Kubernetes service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running.

## Step 5: Send a request to the text_summarizer model

```sh
# Step 5.1: Download `text_summarizer_req.py`
curl -LO https://raw.githubusercontent.com/ray-project/serve_config_examples/master/text_summarizer/text_summarizer_req.py

# Step 5.2: Send a request to the Summarizer model.
python text_summarizer_req.py
# Check printed to console
```

## Step 6: Delete your service

```sh
# path: ray-operator/config/samples/
kubectl delete -f ray-service.text-sumarizer.yaml
```

## Step 7: Uninstall your kuberay operator

Follow [this document](../../helm-chart/kuberay-operator/README.md) to uninstall the latest stable KubeRay operator via Helm repository.
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/examples/text-summarizer-rayservice.html).
178 changes: 1 addition & 177 deletions docs/guidance/tls.md
Original file line number Diff line number Diff line change
@@ -1,177 +1 @@
# TLS Authentication

Ray can be configured to use TLS on its gRPC channels. This means that
connecting to the Ray head will require an appropriate
set of credentials and also that data exchanged between various processes
(client, head, workers) will be encrypted ([Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html?highlight=tls#tls-authentication)).

This document provides detailed instructions for generating a public-private
key pair and CA certificate for configuring KubeRay.

> Warning: Enabling TLS will cause a performance hit due to the extra
overhead of mutual authentication and encryption. Testing has shown that
this overhead is large for small workloads and becomes relatively smaller
for large workloads. The exact overhead will depend on the nature of your
workload.

# Prerequisites

To fully understand this document, it's highly recommended that you have a
solid understanding of the following concepts:

* private/public key
* CA (certificate authority)
* CSR (certificate signing request)
* self-signed certificate

This [YouTube video](https://youtu.be/T4Df5_cojAs) is a good start.

# TL;DR

> Please note that this document is designed to support KubeRay version 0.5.0 or later. If you are using an older version of KubeRay, some of the instructions or configurations may not apply or may require additional modifications.
> Warning: Please note that the `ray-cluster.tls.yaml` file is intended for demo purposes only. It is crucial that you **do not** store
your CA private key in a Kubernetes Secret in your production environment.

```sh
# Install v0.6.0 KubeRay operator
# `ray-cluster.tls.yaml` will cover from Step 1 to Step 3 (path: kuberay/)
kubectl apply -f ray-operator/config/samples/ray-cluster.tls.yaml

# Jump to Step 4 "Verify TLS authentication" to verify the connection.
```

`ray-cluster.tls.yaml` will create:

* A Kubernetes Secret containing the CA's private key (`ca.key`) and self-signed certificate (`ca.crt`) (**Step 1**)
* A Kubernetes ConfigMap containing the scripts `gencert_head.sh` and `gencert_worker.sh`, which allow Ray Pods to generate private keys
(`tls.key`) and self-signed certificates (`tls.crt`) (**Step 2**)
* A RayCluster with proper TLS environment variables configurations (**Step 3**)

The certificate (`tls.crt`) for a Ray Pod is encrypted using the CA's private key (`ca.key`). Additionally, all Ray Pods have the CA's public key included in `ca.crt`, which allows them to decrypt certificates from other Ray Pods.

# Step 1: Generate a private key and self-signed certificate for CA

In this document, a self-signed certificate is used, but users also have the
option to choose a publicly trusted certificate authority (CA) for their TLS
authentication.

```sh
# Step 1-1: Generate a self-signed certificate and a new private key file for CA.
openssl req -x509 \
-sha256 -days 3650 \
-nodes \
-newkey rsa:2048 \
-subj "/CN=*.kuberay.com/C=US/L=San Francisco" \
-keyout ca.key -out ca.crt

# Step 1-2: Check the CA's public key from the self-signed certificate.
openssl x509 -in ca.crt -noout -text

# Step 1-3
# Method 1: Use `cat $FILENAME | base64` to encode `ca.key` and `ca.crt`.
# Then, paste the encoding strings to the Kubernetes Secret in `ray-cluster.tls.yaml`.

# Method 2: Use kubectl to encode the certifcate as Kubernetes Secret automatically.
# (Note: You should comment out the Kubernetes Secret in `ray-cluster.tls.yaml`.)
kubectl create secret generic ca-tls --from-file=ca.key --from-file=ca.crt
```

* `ca.key`: CA's private key
* `ca.crt`: CA's self-signed certificate

This step is optional because the `ca.key` and `ca.crt` files have
already been included in the Kubernetes Secret specified in [ray-cluster.tls.yaml](../../ray-operator/config/samples/ray-cluster.tls.yaml).

# Step 2: Create separate private key and self-signed certificate for Ray Pods

In [ray-cluster.tls.yaml](../../ray-operator/config/samples/ray-cluster.tls.yaml), each Ray
Pod (both head and workers) generates its own private key file (`tls.key`) and self-signed
certificate file (`tls.crt`) in its init container. We generate separate files for each Pod
because worker Pods do not have deterministic DNS names, and we cannot use the same
certificate across different Pods.

In the YAML file, you'll find a ConfigMap named `tls` that contains two shell scripts:
`gencert_head.sh` and `gencert_worker.sh`. These scripts are used to generate the private key
and self-signed certificate files (`tls.key` and `tls.crt`) for the Ray head and worker Pods.
An alternative approach for users is to prebake the shell scripts directly into the docker image that's utilized
by the init containers, rather than relying on a ConfigMap.

Please find below a brief explanation of what happens in each of these scripts:
1. A 2048-bit RSA private key is generated and saved as `/etc/ray/tls/tls.key`.
2. A Certificate Signing Request (CSR) is generated using the private key file (`tls.key`)
and the `csr.conf` configuration file.
3. A self-signed certificate (`tls.crt`) is generated using the private key of the
Certificate Authority (`ca.key`) and the previously generated CSR.

The only difference between `gencert_head.sh` and `gencert_worker.sh` is the `[ alt_names ]`
section in `csr.conf` and `cert.conf`. The worker Pods use the fully qualified domain name
(FQDN) of the head Kubernetes Service to establish a connection with the head Pod.
Therefore, the `[alt_names]` section for the head Pod needs to include the FQDN of the head
Kubernetes Service. By the way, the head Pod uses `$POD_IP` to communicate with worker Pods.

```sh
# gencert_head.sh
[alt_names]
DNS.1 = localhost
DNS.2 = $FQ_RAY_IP
IP.1 = 127.0.0.1
IP.2 = $POD_IP

# gencert_worker.sh
[alt_names]
DNS.1 = localhost
IP.1 = 127.0.0.1
IP.2 = $POD_IP
```

In [Kubernetes networking model](https://github.com/kubernetes/design-proposals-archive/blob/main/network/networking.md#pod-to-pod), the IP that a Pod sees itself as is the same IP that others see it as. That's why Ray Pods can self-register for the certificates.

# Step 3: Configure environment variables for Ray TLS authentication

To enable TLS authentication in your Ray cluster, set the following environment variables:

- `RAY_USE_TLS`: Either 1 or 0 to use/not-use TLS. If this is set to 1 then all of the environment variables below must be set. Default: 0.
- `RAY_TLS_SERVER_CERT`: Location of a certificate file which is presented to other endpoints so as to achieve mutual authentication (i.e. `tls.crt`).
- `RAY_TLS_SERVER_KEY`: Location of a private key file which is the cryptographic means to prove to other endpoints that you are the authorized user of a given certificate (i.e. `tls.key`).
- `RAY_TLS_CA_CERT`: Location of a CA certificate file which allows TLS to decide whether an endpoint’s certificate has been signed by the correct authority (i.e. `ca.crt`).

For more information on how to configure Ray with TLS authentication, please refer to [Ray's document](https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication).

# Step 4: Verify TLS authentication

```sh
# Log in to the worker Pod
kubectl exec -it ${WORKER_POD} -- bash

# Since the head Pod has the certificate of $FQ_RAY_IP, the connection to the worker Pods
# will be established successfully, and the exit code of the ray health-check command
# should be 0.
ray health-check --address $FQ_RAY_IP:6379
echo $? # 0

# Since the head Pod has the certificate of $RAY_IP, the connection will fail and an error
# message similar to the following will be displayed: "Peer name raycluster-tls-head-svc is
# not in peer certificate".
ray health-check --address $RAY_IP:6379

# If you add `DNS.3 = $RAY_IP` to the [alt_names] section in `gencert_head.sh`,
# the head Pod will generate the certificate of $RAY_IP.
#
# For KubeRay versions prior to 0.5.0, this step is necessary because Ray workers in earlier
# versions use $RAY_IP to connect with Ray head.
```

# Step 5: Connect to the cluster with Ray client using TLS for interactive development
To learn more, please check [interactive development](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/ray-client.html#ray-client-interactive-development) and [TLS authentication](https://docs.ray.io/en/latest/ray-core/configure.html?highlight=tls#tls-authentication) for more detail.

For instructions on connecting the Ray cluster from a Pod:
```
# Create a client pod and connect to cluster
kubectl apply -f ray-operator/config/samples/ray-pod.tls.yaml
kubectl logs ray-client-tls
```
Verify the output similar to:
```
{'CPU': 2.0, 'node:10.254.20.20': 1.0, 'object_store_memory': 771128524.0, 'memory': 3000000000.0, 'node:10.254.16.25': 1.0}
```
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/tls.html#kuberay-tls).
325 changes: 1 addition & 324 deletions docs/guidance/volcano-integration.md
Original file line number Diff line number Diff line change
@@ -1,324 +1 @@
# KubeRay integration with Volcano

[Volcano](https://github.com/volcano-sh/volcano) is a batch scheduling system built on Kubernetes. It provides a suite of mechanisms (gang scheduling, job queues, fair scheduling policies) currently missing from Kubernetes that are commonly required by many classes of batch and elastic workloads. KubeRay's Volcano integration enables more efficient scheduling of Ray pods in multi-tenant Kubernetes environments.

Note that this is a new feature. Feedback and contributions welcome.

## Setup

### Step 1: Create a Kubernetes cluster with KinD
```shell
kind create cluster
```

### Step 2: Install Volcano

Volcano needs to be successfully installed in your Kubernetes cluster before enabling Volcano integration with KubeRay.
Refer to the [Quick Start Guide](https://github.com/volcano-sh/volcano#quick-start-guide) for Volcano installation instructions.

### Step 3: Install KubeRay Operator with Batch Scheduling

Deploy the KubeRay Operator with the `--enable-batch-scheduler` flag to enable Volcano batch scheduling support.

When installing KubeRay Operator via Helm, you should either set `batchScheduler.enabled` to `true` in your
[`values.yaml`](https://github.com/ray-project/kuberay/blob/753dc05dbed5f6fe61db3a43b34a1b350f26324c/helm-chart/kuberay-operator/values.yaml#L48)
file:
```shell
# values.yaml file
batchScheduler:
enabled: true
```

**or** pass `--set batchScheduler.enabled=true` flag when running on the command line:
```shell
# Install Helm chart with --enable-batch-scheduler flag set to true
helm install kuberay-operator kuberay/kuberay-operator --version ${KUBERAY_VERSION} --set batchScheduler.enabled=true
```

Follow the [KubeRay installation documentation](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator.

### Step 4: Install a RayCluster with Volcano scheduler

RayCluster custom resource must include label `ray.io/scheduler-name: volcano` to submit the cluster pods to Volcano for scheduling.

```shell
# Path: kuberay/ray-operator/config/samples
# Includes label `ray.io/scheduler-name: volcano` in the metadata.labels
kubectl apply -f ray-cluster.volcano-scheduler.yaml

# Check RayCluster
kubectl get pod -l ray.io/cluster=test-cluster-0
# NAME READY STATUS RESTARTS AGE
# test-cluster-0-head-jj9bg 1/1 Running 0 36s
```

In addition, the following labels can also be provided in the RayCluster metadata:

- `ray.io/priority-class-name`: the cluster priority class as defined by Kubernetes [here](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#priorityclass).
- This label will only work after the creation of a `PriorityClass` resource
- ```shell
labels:
ray.io/scheduler-name: volcano
ray.io/priority-class-name: <replace with correct PriorityClass resource name>
```
- `volcano.sh/queue-name`: the Volcano [queue](https://volcano.sh/en/docs/queue/) name the cluster will be submitted to.
- This label will only work after the creation of a `Queue` resource
- ```shell
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: <replace with correct Queue resource name>
```

If autoscaling is enabled, `minReplicas` will be used for gang scheduling, otherwise the desired `replicas` will be used.

### Step 5: Use Volcano for batch scheduling

If you need some guidance, check out [examples](https://github.com/volcano-sh/volcano/tree/master/example) available.

## Example

Before going through the example, remove any ray clusters running to ensure successful run through of the example below.
```shell
kubectl delete raycluster --all
```

### Gang scheduling

In this example, we'll walk through how gang scheduling works with Volcano and KubeRay.
First, let's create a queue with a capacity of 4 CPUs and 6Gi of RAM:

```shell
kubectl create -f - <<EOF
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: kuberay-test-queue
spec:
weight: 1
capability:
cpu: 4
memory: 6Gi
EOF
```

The **weight** in the definition above indicates the relative weight of a queue in cluster resource division. This is useful in cases where the total **capability** of all the queues in your cluster exceeds the total available resources, forcing the queues to share among themselves. Queues with higher weight will be allocated a proportionally larger share of the total resources.

The **capability** is a hard constraint on the maximum resources the queue will support at any given time. It can be updated as needed to allow more or fewer workloads to run at a time.

Next we'll create a RayCluster with a head node (1 CPU + 2Gi of RAM) and two workers (1 CPU + 1Gi of RAM each), for a total of 3 CPU and 4Gi of RAM:
```shell
# Path: kuberay/ray-operator/config/samples
# Includes labels `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` in the metadata.labels
kubectl apply -f ray-cluster.volcano-scheduler-queue.yaml
```
Because our queue has a capacity of 4 CPU and 6Gi of RAM, this resource should schedule successfully without any issues. We can verify this by checking the status of our cluster's Volcano PodGroup to see that the phase is `Running` and the last status is `Scheduled`:

```shell
kubectl get podgroup ray-test-cluster-0-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:43:30Z"
# generation: 2
# name: ray-test-cluster-0-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-0
# uid: 7979b169-f0b0-42b7-8031-daef522d25cf
# resourceVersion: "4427347"
# uid: 78902d3d-b490-47eb-ba12-d6f8b721a579
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:43:31Z"
# reason: tasks in gang are ready to be scheduled
# status: "True"
# transitionID: f89f3062-ebd7-486b-8763-18ccdba1d585
# type: Scheduled
# phase: Running
```

And checking the status of our queue to see that we have 1 running job:

```shell
kubectl get queue kuberay-test-queue -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: Queue
# metadata:
# creationTimestamp: "2022-12-01T04:43:21Z"
# generation: 1
# name: kuberay-test-queue
# resourceVersion: "4427348"
# uid: a6c4f9df-d58c-4da8-8a58-e01c93eca45a
# spec:
# capability:
# cpu: 4
# memory: 6Gi
# reclaimable: true
# weight: 1
# status:
# reservation: {}
# running: 1
# state: Open
```

Next, we'll add an additional RayCluster with the same configuration of head / worker nodes, but a different name:
```shell
# Path: kuberay/ray-operator/config/samples
# Includes labels `ray.io/scheduler-name: volcano` and `volcano.sh/queue-name: kuberay-test-queue` in the metadata.labels
# Replaces the name to test-cluster-1
sed 's/test-cluster-0/test-cluster-1/' ray-cluster.volcano-scheduler-queue.yaml | kubectl apply -f-
```
Now check the status of its PodGroup to see that its phase is `Pending` and the last status is `Unschedulable`:
```shell
kubectl get podgroup ray-test-cluster-1-pg -o yaml
# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:48:18Z"
# generation: 2
# name: ray-test-cluster-1-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-1
# uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
# resourceVersion: "4427976"
# uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:48:19Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Undetermined'
# reason: NotEnoughResources
# status: "True"
# transitionID: 3956b64f-fc52-4779-831e-d379648eecfc
# type: Unschedulable
# phase: Pending
```

Because our new cluster requires more CPU and RAM than our queue will allow, even though we could fit one of the pods with the remaining 1 CPU and 2Gi of RAM, none of the cluster's pods will be placed until there is enough room for all the pods. Without using Volcano for gang scheduling in this way, one of the pods would ordinarily be placed, leading to the cluster being partially allocated, and some jobs (like [Horovod](https://github.com/horovod/horovod) training) getting stuck waiting for resources to become available.

We can see the effect this has on scheduling the pods for our new RayCluster, which are listed as `Pending`:

```shell
kubectl get pods

# NAME READY STATUS RESTARTS AGE
# test-cluster-0-worker-worker-ddfbz 1/1 Running 0 7m
# test-cluster-0-head-vst5j 1/1 Running 0 7m
# test-cluster-0-worker-worker-57pc7 1/1 Running 0 6m59s
# test-cluster-1-worker-worker-6tzf7 0/1 Pending 0 2m12s
# test-cluster-1-head-6668q 0/1 Pending 0 2m12s
# test-cluster-1-worker-worker-n5g8k 0/1 Pending 0 2m12s
```

If we dig into the pod details, we'll see that this is indeed because Volcano cannot schedule the gang:

```shell
kubectl describe pod test-cluster-1-head-6668q | tail -n 3

# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning FailedScheduling 4m5s volcano 3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending, 3 minAvailable; Pending: 3 Undetermined
```

Let's go ahead and delete the first RayCluster to clear up space in the queue:

```shell
kubectl delete raycluster test-cluster-0
```

The PodGroup for the second cluster has moved to the `Running` state, as there are now enough resources available to schedule the entire set of pods:

```shell
kubectl get podgroup ray-test-cluster-1-pg -o yaml

# apiVersion: scheduling.volcano.sh/v1beta1
# kind: PodGroup
# metadata:
# creationTimestamp: "2022-12-01T04:48:18Z"
# generation: 9
# name: ray-test-cluster-1-pg
# namespace: test
# ownerReferences:
# - apiVersion: ray.io/v1alpha1
# blockOwnerDeletion: true
# controller: true
# kind: RayCluster
# name: test-cluster-1
# uid: b3cf83dc-ef3a-4bb1-9c42-7d2a39c53358
# resourceVersion: "4428864"
# uid: 9087dd08-8f48-4592-a62e-21e9345b0872
# spec:
# minMember: 3
# minResources:
# cpu: "3"
# memory: 4Gi
# queue: kuberay-test-queue
# status:
# conditions:
# - lastTransitionTime: "2022-12-01T04:54:04Z"
# message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
# 3 minAvailable; Pending: 3 Undetermined'
# reason: NotEnoughResources
# status: "True"
# transitionID: db90bbf0-6845-441b-8992-d0e85f78db77
# type: Unschedulable
# - lastTransitionTime: "2022-12-01T04:55:10Z"
# reason: tasks in gang are ready to be scheduled
# status: "True"
# transitionID: 72bbf1b3-d501-4528-a59d-479504f3eaf5
# type: Scheduled
# phase: Running
# running: 3
```

Checking the pods again, we see that the second cluster is now up and running:

```shell
kubectl get pods

# NAME READY STATUS RESTARTS AGE
# test-cluster-1-worker-worker-n5g8k 1/1 Running 0 9m4s
# test-cluster-1-head-6668q 1/1 Running 0 9m4s
# test-cluster-1-worker-worker-6tzf7 1/1 Running 0 9m4s
```

Finally, we'll clean up the remaining cluster and queue:

```shell
kubectl delete raycluster test-cluster-1
kubectl delete queue kuberay-test-queue
```

## Questions

Reach out to @tgaddair for questions regarding usage of this integration.
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/k8s-ecosystem/volcano.html).

0 comments on commit 11bfdfa

Please sign in to comment.