-
Notifications
You must be signed in to change notification settings - Fork 446
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
[release] Redirect users to Ray website
- Loading branch information
1 parent
9794249
commit 314494c
Showing
20 changed files
with
20 additions
and
2,903 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,60 +1 @@ | ||
# Frequently Asked Questions | ||
|
||
Welcome to the Frequently Asked Questions page for KubeRay. This document addresses common inquiries. | ||
If you don't find an answer to your question here, please don't hesitate to connect with us via our [community channels](https://github.com/ray-project/kuberay#getting-involved). | ||
|
||
# Contents | ||
- [Worker init container](#worker-init-container) | ||
- [Cluster domain](#cluster-domain) | ||
- [RayService](#rayservice) | ||
|
||
## Worker init container | ||
|
||
The KubeRay operator will inject a default [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) into every worker Pod. | ||
This init container is responsible for waiting until the Global Control Service (GCS) on the head Pod is ready before establishing a connection to the head. | ||
The init container will use `ray health-check` to check the GCS server status continuously. | ||
|
||
The default worker init container may not work for all use cases, or users may want to customize the init container. | ||
|
||
### 1. Init container troubleshooting | ||
|
||
Some common causes for the worker init container to stuck in `Init:0/1` status are: | ||
|
||
* The GCS server process has failed in the head Pod. Please inspect the log directory `/tmp/ray/session_latest/logs/` in the head Pod for errors related to the GCS server. | ||
* The `ray` executable is not included in the `$PATH` for the image, so the init container will fail to run `ray health-check`. | ||
* The `CLUSTER_DOMAIN` environment variable is not set correctly. See the section [cluster domain](#cluster-domain) for more details. | ||
* The worker init container shares the same ***ImagePullPolicy***, ***SecurityContext***, ***Env***, ***VolumeMounts***, and ***Resources*** as the worker Pod template. Sharing these settings is possible to cause a deadlock. See [#1130](https://github.com/ray-project/kuberay/issues/1130) for more details. | ||
|
||
If the init container remains stuck in `Init:0/1` status for 2 minutes, we will stop redirecting the output messages to `/dev/null` and instead print them to the worker Pod logs. | ||
To troubleshoot further, you can inspect the logs using `kubectl logs`. | ||
|
||
### 2. Disable the init container injection | ||
|
||
If you want to customize the worker init container, you can disable the init container injection and add your own. | ||
To disable the injection, set the `ENABLE_INIT_CONTAINER_INJECTION` environment variable in the KubeRay operator to `false` (applicable from KubeRay v0.5.2). | ||
Please refer to [#1069](https://github.com/ray-project/kuberay/pull/1069) and the [KubeRay Helm chart](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L83-L87) for instructions on how to set the environment variable. | ||
Once disabled, you can add your custom init container to the worker Pod template. | ||
|
||
## Cluster domain | ||
|
||
In KubeRay, we use Fully Qualified Domain Names (FQDNs) to establish connections between workers and the head. | ||
The FQDN of the head service is `${HEAD_SVC}.${NAMESPACE}.svc.${CLUSTER_DOMAIN}`. | ||
The default [cluster domain](https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#introduction) is `cluster.local`, which works for most Kubernetes clusters. | ||
However, it's important to note that some clusters may have a different cluster domain. | ||
You can check the cluster domain of your Kubernetes cluster by checking `/etc/resolv.conf` in a Pod. | ||
|
||
To set a custom cluster domain, adjust the `CLUSTER_DOMAIN` environment variable in the KubeRay operator. | ||
Helm chart users can make this modification [here](https://github.com/ray-project/kuberay/blob/ddb5e528c29c2e1fb80994f05b1bd162ecbaf9f2/helm-chart/kuberay-operator/values.yaml#L88-L91). | ||
For more information, please refer to [#951](https://github.com/ray-project/kuberay/pull/951) and [#938](https://github.com/ray-project/kuberay/pull/938) for more details. | ||
|
||
## RayService | ||
|
||
RayService is a Custom Resource Definition (CRD) designed for Ray Serve. In KubeRay, creating a RayService will first create a RayCluster and then | ||
create Ray Serve applications once the RayCluster is ready. If the issue pertains to the data plane, specifically your Ray Serve scripts | ||
or Ray Serve configurations (`serveConfigV2`), troubleshooting may be challenging. See [rayservice-troubleshooting](rayservice-troubleshooting.md) for more details. | ||
|
||
## Questions | ||
|
||
### Why are my changes to RayCluster/RayJob CR not taking effect? | ||
|
||
Currently, only modifications to the `replicas` field in `RayCluster/RayJob` CR are supported. Changes to other fields may not take effect or could lead to unexpected results. | ||
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/troubleshooting.html#kuberay-troubleshootin-guides). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,111 +1 @@ | ||
## Autoscaler (beta) | ||
|
||
Ray Autoscaler integration is beta since KubeRay 0.3.0 and Ray 2.0.0. | ||
While autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases. | ||
|
||
See the [official Ray documentation](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html) for even more information about Ray autoscaling on Kubernetes. | ||
|
||
### Prerequisite | ||
|
||
* Follow this [document](https://github.com/ray-project/kuberay/blob/master/helm-chart/kuberay-operator/README.md) to install the latest stable KubeRay operator via Helm repository. | ||
|
||
### Deploy a cluster with autoscaling enabled | ||
|
||
Next, to deploy a sample autoscaling Ray cluster, run | ||
``` | ||
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/release-0.5/ray-operator/config/samples/ray-cluster.autoscaler.yaml | ||
``` | ||
|
||
See the above config file for details on autoscaling configuration. | ||
|
||
!!! note | ||
|
||
Ray container resource requests and limits in the example configuration above are too small | ||
to be used in production. For typical use-cases, you should use large Ray pods. If possible, | ||
each Ray pod should be sized to take up its entire K8s node. We don't recommend | ||
allocating less than 8 gigabytes of memory for Ray containers running in production. | ||
For an autoscaling configuration more suitable for production, see | ||
[ray-cluster.autoscaler.large.yaml](https://raw.githubusercontent.com/ray-project/kuberay/release-0.5/ray-operator/config/samples/ray-cluster.autoscaler.large.yaml). | ||
|
||
The output of `kubectl get pods` should indicate the presence of | ||
a Ray head pod with two containers, | ||
the Ray container and the autoscaler container. | ||
You should also see a Ray worker pod with a single Ray container. | ||
|
||
|
||
``` | ||
$ kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s | ||
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s | ||
``` | ||
|
||
Check the autoscaler container's logs to confirm that the autoscaler is healthy. | ||
Here's an example of logs from a healthy autoscaler. | ||
``` | ||
kubectl logs -f raycluster-autoscaler-head-mgwwk autoscaler | ||
2022-03-10 07:51:22,616 INFO monitor.py:226 -- Starting autoscaler metrics server on port 44217 | ||
2022-03-10 07:51:22,621 INFO monitor.py:243 -- Monitor: Started | ||
2022-03-10 07:51:22,824 INFO node_provider.py:143 -- Creating KuberayNodeProvider. | ||
2022-03-10 07:51:22,825 INFO autoscaler.py:282 -- StandardAutoscaler: {'provider': {'type': 'kuberay', 'namespace': 'default', 'disable_node_updaters': True, 'disable_launch_config_check': True}, 'cluster_name': 'raycluster-autoscaler', 'head_node_type': 'head-group', 'available_node_types': {'head-group': {'min_workers': 0, 'max_workers': 0, 'node_config': {}, 'resources': {'CPU': 1}}, 'small-group': {'min_workers': 1, 'max_workers': 300, 'node_config': {}, 'resources': {'CPU': 1}}}, 'max_workers': 300, 'idle_timeout_minutes': 5, 'upscaling_speed': 1, 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': [], 'worker_start_ray_commands': [], 'auth': {}, 'head_node': {}, 'worker_nodes': {}} | ||
2022-03-10 07:51:23,027 INFO autoscaler.py:327 -- | ||
======== Autoscaler status: 2022-03-10 07:51:23.027271 ======== | ||
Node status | ||
--------------------------------------------------------------- | ||
Healthy: | ||
1 head-group | ||
Pending: | ||
(no pending nodes) | ||
Recent failures: | ||
(no failures) | ||
Resources | ||
--------------------------------------------------------------- | ||
Usage: | ||
0.0/1.0 CPU | ||
0.00/0.931 GiB memory | ||
0.00/0.200 GiB object_store_memory | ||
Demands: | ||
(no resource demands) | ||
``` | ||
|
||
#### Notes | ||
|
||
1. To enable autoscaling, set your RayCluster CR's `spec.enableInTreeAutoscaling` field to true. | ||
The operator will then automatically inject a preconfigured autoscaler container to the head pod. | ||
The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box. | ||
The operator will also configure an empty-dir logging volume for the Ray head pod. The volume will be mounted into the Ray and | ||
autoscaler containers; this is necessary to support the event logging introduced in [Ray PR #13434](https://github.com/ray-project/ray/pull/13434). | ||
|
||
``` | ||
spec: | ||
enableInTreeAutoscaling: true | ||
``` | ||
2. If your RayCluster CR's `spec.rayVersion` field is at least `2.0.0`, the autoscaler container will use the same image as the Ray container. | ||
For Ray versions older than 2.0.0, the image `rayproject/ray:2.0.0` will be used to run the autoscaler. | ||
3. Autoscaling functionality is supported only with Ray versions at least as new as 1.11.0. Autoscaler support | ||
is beta as of Ray 2.0.0 and KubeRay 0.3.0; while autoscaling functionality is stable, the details of autoscaler behavior and configuration may change in future releases. | ||
### Test autoscaling | ||
Let's now try out the autoscaler. Run the following commands to scale up the cluster: | ||
``` | ||
export HEAD_POD=$(kubectl get pods -o custom-columns=POD:metadata.name | grep raycluster-autoscaler-head) | ||
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=4)" | ||
``` | ||
You should then see two extra Ray nodes (pods) scale up to satisfy the 4 CPU demand. | ||
``` | ||
$ kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
raycluster-autoscaler-head-mgwwk 2/2 Running 0 4m41s | ||
raycluster-autoscaler-worker-small-group-4d255 1/1 Running 0 40s | ||
raycluster-autoscaler-worker-small-group-fg4fv 1/1 Running 0 4m41s | ||
raycluster-autoscaler-worker-small-group-qzhvg 1/1 Running 0 40s | ||
``` | ||
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaling). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,70 +1 @@ | ||
# Start Amazon EKS Cluster with GPUs for KubeRay | ||
|
||
## Step 1: Create a Kubernetes cluster on Amazon EKS | ||
|
||
Follow the first two steps in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) | ||
to: (1) create your Amazon EKS cluster and (2) configure your computer to communicate with your cluster. | ||
|
||
## Step 2: Create node groups for the Amazon EKS cluster | ||
|
||
Follow "Step 3: Create nodes" in [this AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#) to create node groups. The following section provides more detailed information. | ||
|
||
### Create a CPU node group | ||
|
||
Typically, avoid running GPU workloads on the Ray head. Create a CPU node group for all Pods except Ray GPU | ||
workers, such as the KubeRay operator, Ray head, and CoreDNS Pods. | ||
|
||
Here's a common configuration that works for most KubeRay examples in the docs: | ||
* Instance type: [**m5.xlarge**](https://aws.amazon.com/ec2/instance-types/m5/) (4 vCPU; 16 GB RAM) | ||
* Disk size: 256 GB | ||
* Desired size: 1, Min size: 0, Max size: 1 | ||
|
||
### Create a GPU node group | ||
|
||
Create a GPU node group for Ray GPU workers. | ||
|
||
1. Here's a common configuration that works for most KubeRay examples in the docs: | ||
* AMI type: Bottlerocket NVIDIA (BOTTLEROCKET_x86_64_NVIDIA) | ||
* Instance type: [**g5.xlarge**](https://aws.amazon.com/ec2/instance-types/g5/) (1 GPU; 24 GB GPU Memory; 4 vCPUs; 16 GB RAM) | ||
* Disk size: 1024 GB | ||
* Desired size: 1, Min size: 0, Max size: 1 | ||
|
||
> **Note:** If you encounter permission issues with `kubectl`, follow "Step 2: Configure your computer to communicate with your cluster" | ||
in the [AWS documentation](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html#). | ||
|
||
2. Please install the NVIDIA device plugin. Note: You don't need this if you used `BOTTLEROCKET_x86_64_NVIDIA` image in above step | ||
* Install the DaemonSet for NVIDIA device plugin to run GPU enabled containers in your Amazon EKS cluster. You can refer to the [Amazon EKS optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami) | ||
or [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) repository for more details. | ||
* If the GPU nodes have taints, add `tolerations` to `nvidia-device-plugin.yml` to enable the DaemonSet to schedule Pods on the GPU nodes. | ||
|
||
```sh | ||
# Install the DaemonSet | ||
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml | ||
|
||
# Verify that your nodes have allocatable GPUs. If the GPU node fails to detect GPUs, | ||
# please verify whether the DaemonSet schedules the Pod on the GPU node. | ||
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" | ||
|
||
# Example output: | ||
# NAME GPU | ||
# ip-....us-west-2.compute.internal 4 | ||
# ip-....us-west-2.compute.internal <none> | ||
``` | ||
|
||
3. Add a Kubernetes taint to prevent scheduling CPU Pods on this GPU node group. For KubeRay examples, add the following taint to the GPU nodes: `Key: ray.io/node-type, Value: worker, Effect: NoSchedule`, and include the corresponding `tolerations` for GPU Ray worker Pods. | ||
|
||
> Warning: GPU nodes are extremely expensive. Please remember to delete the cluster if you no longer need it. | ||
## Step 3: Verify the node groups | ||
|
||
> **Note:** If you encounter permission issues with `eksctl`, navigate to your AWS account's webpage and copy the | ||
credential environment variables, including `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`, | ||
from the "Command line or programmatic access" page. | ||
|
||
```sh | ||
eksctl get nodegroup --cluster ${YOUR_EKS_NAME} | ||
|
||
# CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE | ||
# ${YOUR_EKS_NAME} cpu-node-group ACTIVE 2023-06-05T21:31:49Z 0 1 1 m5.xlarge AL2_x86_64 eks-cpu-node-group-... managed | ||
# ${YOUR_EKS_NAME} gpu-node-group ACTIVE 2023-06-05T22:01:44Z 0 1 1 g5.12xlarge BOTTLEROCKET_x86_64_NVIDIA eks-gpu-node-group-... managed | ||
``` | ||
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/k8s-cluster-setup.html#kuberay-k8s-setup). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,74 +1 @@ | ||
# Start Google Cloud GKE Cluster with GPUs for KubeRay | ||
|
||
## Step 1: Create a Kubernetes cluster on GKE | ||
|
||
Run this command and all following commands on your local machine or on the [Google Cloud Shell](https://cloud.google.com/shell). If running from your local machine, you will need to install the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). The following command creates a Kubernetes cluster named `kuberay-gpu-cluster` with 1 CPU node in the `us-west1-b` zone. In this example, we use the `e2-standard-4` machine type, which has 4 vCPUs and 16 GB RAM. | ||
|
||
```sh | ||
gcloud container clusters create kuberay-gpu-cluster \ | ||
--num-nodes=1 --min-nodes 0 --max-nodes 1 --enable-autoscaling \ | ||
--zone=us-west1-b --machine-type e2-standard-4 | ||
``` | ||
|
||
> Note: You can also create a cluster from the [Google Cloud Console](https://console.cloud.google.com/kubernetes/list). | ||
## Step 2: Create a GPU node pool | ||
|
||
Run the following command to create a GPU node pool for Ray GPU workers. | ||
(You can also create it from the Google Cloud Console; see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints#create_a_node_pool_with_node_taints) for more details.) | ||
|
||
```sh | ||
gcloud container node-pools create gpu-node-pool \ | ||
--accelerator type=nvidia-l4-vws,count=1 \ | ||
--zone us-west1-b \ | ||
--cluster kuberay-gpu-cluster \ | ||
--num-nodes 1 \ | ||
--min-nodes 0 \ | ||
--max-nodes 1 \ | ||
--enable-autoscaling \ | ||
--machine-type g2-standard-4 \ | ||
--node-taints=ray.io/node-type=worker:NoSchedule | ||
``` | ||
|
||
The `--accelerator` flag specifies the type and number of GPUs for each node in the node pool. In this example, we use the [NVIDIA L4](https://cloud.google.com/compute/docs/gpus#l4-gpus) GPU. The machine type `g2-standard-4` has 1 GPU, 24 GB GPU Memory, 4 vCPUs and 16 GB RAM. | ||
|
||
The taint `ray.io/node-type=worker:NoSchedule` prevents CPU-only Pods such as the Kuberay operator, Ray head, and CoreDNS Pods from being scheduled on this GPU node pool. This is because GPUs are expensive, so we want to use this node pool for Ray GPU workers only. | ||
|
||
Concretely, any Pod that does not have the following toleration will not be scheduled on this GPU node pool: | ||
|
||
```yaml | ||
tolerations: | ||
- key: ray.io/node-type | ||
operator: Equal | ||
value: worker | ||
effect: NoSchedule | ||
``` | ||
For more on taints and tolerations, see the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). | ||
## Step 3: Configure `kubectl` to connect to the cluster | ||
|
||
Run the following command to download Google Cloud credentials and configure the Kubernetes CLI to use them. | ||
|
||
```sh | ||
gcloud container clusters get-credentials kuberay-gpu-cluster --zone us-west1-b | ||
``` | ||
|
||
For more details, see the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl). | ||
|
||
## Step 4: Install NVIDIA GPU device drivers | ||
|
||
This step is required for GPU support on GKE. See the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers) for more details. | ||
|
||
```sh | ||
# Install NVIDIA GPU device driver | ||
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml | ||
# Verify that your nodes have allocatable GPUs | ||
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" | ||
# Example output: | ||
# NAME GPU | ||
# gke-kuberay-gpu-cluster-gpu-node-pool-xxxxx 1 | ||
# gke-kuberay-gpu-cluster-default-pool-xxxxx <none> | ||
``` | ||
This document has been moved to the [Ray documentation](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/k8s-cluster-setup.html#kuberay-k8s-setup). |
Oops, something went wrong.