From 1c336defe7cc47158b1d8e2e491e492e62349ccf Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Sat, 23 Jul 2022 21:48:13 -0700 Subject: [PATCH 1/2] Add RayJob docs and development docs --- README.md | 50 ++----- docs/deploy/installation.md | 14 ++ docs/development/development.md | 57 ++++++++ .../README.md => development/release.md} | 0 docs/guidance/gcs-ha.md | 4 +- docs/guidance/observability.md | 14 ++ docs/guidance/rayjob.md | 129 ++++++++++++++++++ docs/index.md | 2 +- mkdocs.yml | 5 +- 9 files changed, 231 insertions(+), 44 deletions(-) create mode 100644 docs/development/development.md rename docs/{release/README.md => development/release.md} (100%) create mode 100644 docs/guidance/observability.md create mode 100644 docs/guidance/rayjob.md diff --git a/README.md b/README.md index e18ae156af3..6b20cc9954f 100644 --- a/README.md +++ b/README.md @@ -3,15 +3,13 @@ [![Build Status](https://github.com/ray-project/kuberay/workflows/Go-build-and-test/badge.svg)](https://github.com/ray-project/kuberay/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/ray-project/kuberay)](https://goreportcard.com/report/github.com/ray-project/kuberay) -KubeRay is an open source toolkit to run Ray applications on Kubernetes. - -KubeRay provides several tools to improve running and managing Ray's experience on Kubernetes. +KubeRay is an open source toolkit to run Ray applications on Kubernetes. It provides several tools to improve running and managing Ray's experience on Kubernetes. - Ray Operator - Backend services to create/delete cluster resources - Kubectl plugin/CLI to operate CRD objects +- Native Job and Serving integration with Clusters - Data Scientist centric workspace for fast prototyping (incubating) -- Native Job and Serving integration with Clusters (incubating) - Kubernetes event dumper for ray clusters/pod/services (future work) - Operator Integration with Kubernetes node problem detector (future work) @@ -23,55 +21,27 @@ You can view detailed documentation and guides at [https://ray-project.github.io ### Use Yaml -#### Nightly version - -``` -kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources" -kubectl apply -k "github.com/ray-project/kuberay/manifests/base" -``` +Please choose the version you like to install. We will use nightly version `master` as an example -#### Stable version +| Version | Stable | Suggested Kubernetes Version | +|----------|:-------:|------------------------------:| +| master | N | v1.23 and above | +| v0.2.0 | Yes | v1.19 - 1.22 | ``` -kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0" -kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0" +export KUBERAY_VERSION=master +kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}" +kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}" ``` > Observe that we must use `kubectl create` to install cluster-scoped resources. > The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271). -#### Single Namespace version - -It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user -can use following commands. - -``` -# Nightly version -export KUBERAY_NAMESPACE= -# executed by cluster admin -kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f - -# executed by user -kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f - - -``` - ### Use helm chart A helm chart is a collection of files that describe a related set of Kubernetes resources. It can help users to deploy ray-operator and ray clusters conveniently. Please read [kuberay-operator](helm-chart/kuberay-operator/README.md) to deploy an operator and [ray-cluster](helm-chart/ray-cluster/README.md) to deploy a custom cluster. -### Monitor - -We have add a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system. - -You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install: -```shell -./install/prometheus/install.sh -``` -It will set up the prometheus stack and deploy the related service monitor in `config/prometheus` - -Then you can also use the json in `config/grafana` to generate the dashboards. - ## Development Please read our [CONTRIBUTING](CONTRIBUTING.md) guide before making a pull request. Refer to our [DEVELOPMENT](./ray-operator/DEVELOPMENT.md) to build and run tests locally. diff --git a/docs/deploy/installation.md b/docs/deploy/installation.md index 5295acdf445..87b907edf90 100644 --- a/docs/deploy/installation.md +++ b/docs/deploy/installation.md @@ -16,3 +16,17 @@ kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0" > Observe that we must use `kubectl create` to install cluster-scoped resources. > The corresponding `kubectl apply` command will not work. See [KubeRay issue #271](https://github.com/ray-project/kuberay/issues/271). + +#### Single Namespace version + +It is possible that the user can only access one single namespace while deploying KubeRay. To deploy KubeRay in a single namespace, the user +can use following commands. + +``` +# Nightly version +export KUBERAY_NAMESPACE= +# executed by cluster admin +kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace-resources" | envsubst | kubectl create -f - +# executed by user +kustomize build "github.com/ray-project/kuberay/manifests/overlays/single-namespace" | envsubst | kubectl apply -f - +``` diff --git a/docs/development/development.md b/docs/development/development.md new file mode 100644 index 00000000000..a6abe3a76e3 --- /dev/null +++ b/docs/development/development.md @@ -0,0 +1,57 @@ +## KubeRay Development Guidance + +Download this repo locally + +``` +mkdir -p $GOPATH/src/github.com/ray-project +cd $GOPATH/src/github.com/ray-project +git clone https://github.com/ray-project/kuberay.git +``` + +### Develop proto and OpenAPI + +Generate go clients and swagger file + +``` +make generate +``` + +### Develop KubeRay Operator + +``` +cd ray-operator + +# Build codes +make build + +# Run test +make test + +# Build container image +make docker-build +``` + +### Develop KubeRay APIServer + +``` +cd apiserver + +# Build code +go build cmd/main.go +``` + +### Develop KubeRay CLI + +``` +cd cli +go build -o kuberay -a main.go +./kuberay help +``` + +### Deploy Docs locally + +We don't need to configure `mkdocs` environment, to check static website locally, run command + +``` +docker run --rm -it -p 8000:8000 -v ${PWD}:/docs squidfunk/mkdocs-material build +``` diff --git a/docs/release/README.md b/docs/development/release.md similarity index 100% rename from docs/release/README.md rename to docs/development/release.md diff --git a/docs/guidance/gcs-ha.md b/docs/guidance/gcs-ha.md index 0b22ab7185a..67f046e6690 100644 --- a/docs/guidance/gcs-ha.md +++ b/docs/guidance/gcs-ha.md @@ -24,7 +24,7 @@ metadata: ray.io/external-storage-namespace: "my-raycluster-storage-namespace" # <- optional, to specify the external storage namespace ... ``` -An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml) +An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml) When annotation `ray.io/ha-enabled` is added with a `true` value, KubeRay will enable Ray GCS HA feature. This feature contains several components: @@ -65,7 +65,7 @@ you need to add `RAY_REDIS_ADDRESS` environment variable to the head node templa Also, you can specify a storage namespace for your Ray cluster by using an annotation `ray.io/external-storage-namespace` -An example can be found at [ray-cluster.external-redis.yaml](../../ray-operator/config/samples/ray-cluster.external-redis.yaml) +An example can be found at [ray-cluster.external-redis.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml) #### KubeRay Operator Controller diff --git a/docs/guidance/observability.md b/docs/guidance/observability.md new file mode 100644 index 00000000000..20dafb15e20 --- /dev/null +++ b/docs/guidance/observability.md @@ -0,0 +1,14 @@ +# Observability + +### Monitor + +We have added a parameter `--metrics-expose-port=8080` to open the port and expose metrics both for the ray cluster and our control plane. We also leverage the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator) to start the whole monitoring system. + +You can quickly deploy one by the following on your own kubernetes cluster by using the scripts in install: + +```shell +./install/prometheus/install.sh +``` +It will set up the prometheus stack and deploy the related service monitor in `config/prometheus` + +Then you can also use the json in `config/grafana` to generate the dashboards. diff --git a/docs/guidance/rayjob.md b/docs/guidance/rayjob.md new file mode 100644 index 00000000000..bbfd66d4e2f --- /dev/null +++ b/docs/guidance/rayjob.md @@ -0,0 +1,129 @@ +## Ray Job (alpha) + +> Note: This is the alpha version of Ray Job Support in KubeRay. There will be ongoing improvements for Ray Job in the future releases. + +### Prerequisite + +* Ray 1.10 and above. +* KubeRay v0.3.0 or master + +### What is a RayJob? + +The RayService is a new custom resource (CR) supported by KubeRay in v0.3.0. + +A RayService manages 2 things: +* RayCluster: Manages resources in kubernetes cluster. +* Ray Serve Deployment Graph: Manages users' serve deployment graph. + +### What does the RayService provide? + +* Kubernetes-native support for Ray cluster and Ray Serve deployment graphs. You can use a kubernetes config to define a ray cluster and its ray serve deployment graphs. Then you can use `kubectl` to create the cluster and its graphs. +* In-place update for ray serve deployment graph. Users can update the ray serve deployment graph config in the RayService CR config and use `kubectl apply` to update the serve deployment graph. +* Zero downtime upgrade for ray cluster. Users can update the ray cluster config in the RayService CR config and use `kubectl apply` to update the ray cluster. RayService will temporarily create a pending ray cluster, wait for the pending ray cluster ready, and then switch traffics to the new ray cluster, terminate the old cluster. +* Services HA. RayService will monitor the ray cluster and serve deployments health status. If RayService detects any unhealthy status lasting for a certain time, RayService will try to create a new ray cluster, and switch traffic to the new cluster when it is ready. + +### Deploy the KubeRay + +Make sure KubeRay v0.3.0 version is deployed in your cluster. +For installation details, please check [guidance](../deploy/installation.md) + +### Run an example Job + +There is one example config file to deploy RayJob included here: +[ray_v1alpha1_rayjob.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml) + +```shell +# Create a ray job. +$ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml +``` + +```shell +# List running RayServices. +$ kubectl get rayjob +NAME AGE +rayjob-sample 7s +``` + +```shell +# RayJob sample underneath will create a raycluster +# raycluster will create few resources including pods, services, you can type commands to have a check +$ kubectl get rayclusters +$ kubectl get pod +``` + +### RayJob Configuration + +- `entrypoint` - The shell command to run for this job. job_id. +- `jobId` - Optional. Job ID to specify for the job. If not provided, one will be generated. +- `metadata` - Arbitrary user-provided metadata for the job. +- `runtimeEnv` - base64 string of the runtime json string. +- `shutdownAfterJobFinishes` - whether to recycle the cluster after job finishes. +- `ttlSecondsAfterFinished` - TTL to clean up the cluster. This is only working if `shutdownAfterJobFinishes` is set. + +### RayJob Observability + +You can use `kubectl logs` to check the operator logs or the head/worker nodes logs. +You can also use `kubectl describe rayjobs rayjob-sample` to check the states and event logs of your RayJob instance. + +``` +Status: + Dashboard URL: rayjob-sample-raycluster-vnl8w-head-svc.ray-system.svc.cluster.local:8265 + End Time: 2022-07-24T02:04:56Z + Job Deployment Status: Complete + Job Id: test-hehe + Job Status: SUCCEEDED + Message: Job finished successfully. + Ray Cluster Name: rayjob-sample-raycluster-vnl8w + Ray Cluster Status: + Available Worker Replicas: 1 + Endpoints: + Client: 32572 + Dashboard: 32276 + Gcs - Server: 30679 + Last Update Time: 2022-07-24T02:04:43Z + State: ready + Start Time: 2022-07-24T02:04:49Z +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Created 90s rayjob-controller Created cluster rayjob-sample-raycluster-vnl8w + Normal Submitted 82s rayjob-controller Submit Job test-hehe + Normal Deleted 15s rayjob-controller Deleted cluster rayjob-sample-raycluster-vnl8w +``` + + +If the job can not successfully run, you can see from the status as well. +``` +Status: + Dashboard URL: rayjob-sample-raycluster-nrdm8-head-svc.ray-system.svc.cluster.local:8265 + End Time: 2022-07-24T02:01:39Z + Job Deployment Status: Complete + Job Id: test-hehe + Job Status: FAILED + Message: Job failed due to an application error, last available logs: +python: can't open file '/tmp/code/script.ppy': [Errno 2] No such file or directory + + Ray Cluster Name: rayjob-sample-raycluster-nrdm8 + Ray Cluster Status: + Available Worker Replicas: 1 + Endpoints: + Client: 31852 + Dashboard: 32606 + Gcs - Server: 32436 + Last Update Time: 2022-07-24T02:01:30Z + State: ready + Start Time: 2022-07-24T02:01:38Z +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Created 2m9s rayjob-controller Created cluster rayjob-sample-raycluster-nrdm8 + Normal Submitted 2m rayjob-controller Submit Job test-hehe + Normal Deleted 58s rayjob-controller Deleted cluster rayjob-sample-raycluster-nrdm8 +``` + + +### Delete the RayService instance + +```shell +$ kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml +``` diff --git a/docs/index.md b/docs/index.md index 6ca4c47960d..5d2a668d78d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,8 +31,8 @@ KubeRay provides several tools to improve running and managing Ray's experience - Ray Operator - Backend services to create/delete cluster resources - Kubectl plugin/CLI to operate CRD objects +- Native Job and Serving integration with Clusters - Data Scientist centric workspace for fast prototyping (incubating) -- Native Job and Serving integration with Clusters (incubating) - Kubernetes event dumper for ray clusters/pod/services (future work) - Operator Integration with Kubernetes node problem detector (future work) diff --git a/mkdocs.yml b/mkdocs.yml index 1ceb6ccd3e3..a5d4e179a8c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -29,9 +29,11 @@ nav: - KubeRay CLI: components/cli.md - Features: - RayService: guidance/rayservice.md + - RayJob: guidance/rayjob.md - Ray GCS HA: guidance/gcs-ha.md - Autoscaling: guidance/autoscaler.md - Ingress: guidance/ingress.md + - Observability: guidance/observability.md - Best Practice: - Worker reconnection: best-practice/worker-head-reconnection.md - Troubleshooting: @@ -39,7 +41,8 @@ nav: - Designs: - Core API and Backend Service: design/protobuf-grpc-service.md - Development: - - Release: release/README.md + - Development: development/development.md + - Release: development/release.md # Customization extra: From fb1606bef91c3e452dc5c7d0134766fa10f02142 Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Sun, 24 Jul 2022 21:40:32 -0700 Subject: [PATCH 2/2] Address code review feedbacks --- README.md | 2 +- docs/guidance/rayjob.md | 18 ++++++++---------- 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 6b20cc9954f..cedd939006b 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ [![Build Status](https://github.com/ray-project/kuberay/workflows/Go-build-and-test/badge.svg)](https://github.com/ray-project/kuberay/actions) [![Go Report Card](https://goreportcard.com/badge/github.com/ray-project/kuberay)](https://goreportcard.com/report/github.com/ray-project/kuberay) -KubeRay is an open source toolkit to run Ray applications on Kubernetes. It provides several tools to improve running and managing Ray's experience on Kubernetes. +KubeRay is an open source toolkit to run Ray applications on Kubernetes. It provides several tools to improve running and managing Ray on Kubernetes. - Ray Operator - Backend services to create/delete cluster resources diff --git a/docs/guidance/rayjob.md b/docs/guidance/rayjob.md index bbfd66d4e2f..a94b496041c 100644 --- a/docs/guidance/rayjob.md +++ b/docs/guidance/rayjob.md @@ -9,18 +9,16 @@ ### What is a RayJob? -The RayService is a new custom resource (CR) supported by KubeRay in v0.3.0. +The RayJob is a new custom resource (CR) supported by KubeRay in v0.3.0. -A RayService manages 2 things: +A RayJob manages 2 things: * RayCluster: Manages resources in kubernetes cluster. -* Ray Serve Deployment Graph: Manages users' serve deployment graph. +* Job: Manages users' job in ray cluster. -### What does the RayService provide? +### What does the RayJob provide? + +* Kubernetes-native support for Ray cluster and Ray Job. You can use a kubernetes config to define a ray cluster and jobs in ray cluster. Then you can use `kubectl` to create the cluster and its job. The cluster can be deleted automatically after the job is finished. -* Kubernetes-native support for Ray cluster and Ray Serve deployment graphs. You can use a kubernetes config to define a ray cluster and its ray serve deployment graphs. Then you can use `kubectl` to create the cluster and its graphs. -* In-place update for ray serve deployment graph. Users can update the ray serve deployment graph config in the RayService CR config and use `kubectl apply` to update the serve deployment graph. -* Zero downtime upgrade for ray cluster. Users can update the ray cluster config in the RayService CR config and use `kubectl apply` to update the ray cluster. RayService will temporarily create a pending ray cluster, wait for the pending ray cluster ready, and then switch traffics to the new ray cluster, terminate the old cluster. -* Services HA. RayService will monitor the ray cluster and serve deployments health status. If RayService detects any unhealthy status lasting for a certain time, RayService will try to create a new ray cluster, and switch traffic to the new cluster when it is ready. ### Deploy the KubeRay @@ -38,7 +36,7 @@ $ kubectl apply -f config/samples/ray_v1alpha1_rayjob.yaml ``` ```shell -# List running RayServices. +# List running RayJobs. $ kubectl get rayjob NAME AGE rayjob-sample 7s @@ -122,7 +120,7 @@ Events: ``` -### Delete the RayService instance +### Delete the RayJob instance ```shell $ kubectl delete -f config/samples/ray_v1alpha1_rayjob.yaml