Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

Merged
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
4245265
Update autoscaler image.
DmitriGekhtman May 20, 2022
8feae85
Trailing spaces.
DmitriGekhtman May 20, 2022
ad3e463
Add overlays.
DmitriGekhtman May 20, 2022
076fa54
Add to docs.
DmitriGekhtman May 20, 2022
ba25502
Remove redis in a couple of spots.
DmitriGekhtman May 20, 2022
420f4d6
Namespace selector came out of somewhere...
DmitriGekhtman May 20, 2022
77e77df
Remove scratch yaml.
DmitriGekhtman May 20, 2022
e63c2ce
Remove redis password logic from test.
DmitriGekhtman May 20, 2022
1643dee
Add namespaces.
DmitriGekhtman May 20, 2022
abdaac5
Fix kustomization.
DmitriGekhtman May 20, 2022
dd587bf
Log if the feature flag is enabled.
DmitriGekhtman May 20, 2022
1c7c6bd
Fix entrypoint.
DmitriGekhtman May 20, 2022
fe8619a
Autoscaler logs volume mount.
DmitriGekhtman May 24, 2022
ca422ae
fix-test
DmitriGekhtman May 24, 2022
cfea869
Add Ray log volume mount when autoscaling.
DmitriGekhtman May 24, 2022
16bafbd
Fix BuildPod.
DmitriGekhtman May 24, 2022
6658b8c
fix
DmitriGekhtman May 24, 2022
88848ff
Add an emptyDir volume functions.
DmitriGekhtman May 24, 2022
0dfa759
Add resources to test instance.
DmitriGekhtman May 24, 2022
91a04e1
Unit test.
DmitriGekhtman May 24, 2022
b571e25
Merge branch 'master' into dmitri/update-autoscaler-image
DmitriGekhtman May 24, 2022
5bacaef
Fix variable name.
DmitriGekhtman May 24, 2022
588e30f
apply -> create
DmitriGekhtman May 24, 2022
189b1bd
Doc typos.
DmitriGekhtman May 24, 2022
40e1579
Update example config.
DmitriGekhtman May 25, 2022
a54ddd0
Add a comment explaining what the log volume is for.
DmitriGekhtman May 26, 2022
23f631c
Document the volume.
DmitriGekhtman May 26, 2022
7cbdcff
container -> pod
DmitriGekhtman May 27, 2022
e970fab
Add volumes using the same method.
DmitriGekhtman May 28, 2022
fe6da32
Reuse function to add volume.
DmitriGekhtman May 28, 2022
26ff42c
Merge branch 'master' into dmitri/update-autoscaler-image
DmitriGekhtman May 28, 2022
d0f98ce
Remove print statements
DmitriGekhtman May 28, 2022
8f7d64e
raycluster -> ray
DmitriGekhtman May 28, 2022
c895130
explain
DmitriGekhtman May 28, 2022
f601cb8
Typo
DmitriGekhtman May 28, 2022
704519d
pods.go: Spaces
DmitriGekhtman May 28, 2022
05a18d2
Test Typo
DmitriGekhtman May 28, 2022
0cef0cc
Container indices: Log and panic.
DmitriGekhtman May 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ You can view detailed documentation and guides at [https://ray-project.github.io
#### Nightly version

```
kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
Copy link
Collaborator

@Jeffwan Jeffwan May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em. We can use create now. The downside of create here is cluster-scope-resources has ray-system namespace as well. If user has the namespace in the cluster, upgrade will fail due to existence.

We can definitely move namespace yaml to separate steps but let me check if we can resolve the issue by limiting the crd size which is much more elegant. This looks good to me at this moment

kubectl apply -k "github.com/ray-project/kuberay/manifests/base"
```

#### Stable version

```
kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"
```

Expand Down
4 changes: 2 additions & 2 deletions docs/deploy/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
#### Nightly version

```
kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base"
```

#### Stable version

```
kubectl apply -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl create -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=v0.2.0"
kubectl apply -k "github.com/ray-project/kuberay/manifests/base?ref=v0.2.0"
```
21 changes: 13 additions & 8 deletions docs/guidance/autoscaler.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,14 @@ You can follow below steps for a quick deployment.
```
git clone https://github.com/ray-project/kuberay.git
cd kuberay
kubectl apply -k manifests/cluster-scope-resources
kubectl apply -k manifests/base
kubectl create -k manifests/cluster-scope-resources
kubectl apply -k manifests/overlays/autoscaling
```

> Note: For compatibility with the Ray autoscaler, the KubeRay Operator's entrypoint
> must include the flag `--prioritize-workers-to-delete`. The kustomization overlay
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to have a plan to remove this flag and make it the default behavior.

> `manifests/overlays/autoscaling` provided in the last command above adds the necessary flag.

### Deploy a cluster with autoscaling enabled

```
Expand Down Expand Up @@ -60,20 +64,21 @@ Demands:

#### Known issues and limitations

1. operator will recognize following setting and automatically inject preconfigured autoscaler container to head pod.
The service account, role, role binding needed by autoscaler will be created by operator out-of-box.
1. The operator will recognize the following setting and automatically inject a preconfigured autoscaler container to the head pod.
The service account, role, and role binding needed by the autoscaler will be created by the operator out-of-box.
The operator will also configure an empty-dir logging volume for the Ray head pod. The volume will be mounted into the Ray and
autoscaler containers; this is necessary to support the event logging introduced in [Ray PR #13434](https://github.com/ray-project/ray/pull/13434).

```
spec:
rayVersion: 'nightly'
enableInTreeAutoscaling: true
```

2. head and work images are `rayproject/ray:413fe0`. This image was built based on [commit](https://github.com/ray-project/ray/commit/413fe08f8744d50b439717564709bc0af2f778f1) from master branch.
The reason we need to use a nightly version is because autoscaler needs to connect to Ray cluster. Due to ray [version requirements](https://docs.ray.io/en/latest/cluster/ray-client.html#versioning-requirements).
We determine to use nightly version to make sure integration is working.
2. The autoscaler image is `rayproject/ray:448f52` which reflects the latest changes from [Ray PR #24718](https://github.com/ray-project/ray/pull/24718/files) in the master branch.

3. Autoscaler image is `kuberay/autoscaler:nightly` which is built from [commit](https://github.com/ray-project/ray/pull/22689/files).
3. Autoscaling functionality is supported only with Ray versions at least as new as 1.11.0. The autoscaler image used
is compatible with all Ray versions >= 1.11.0.

### Test autoscaling

Expand Down
2 changes: 1 addition & 1 deletion docs/notebook/kuberay-on-kind.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@
}
],
"source": [
"!kubectl apply -k \"github.com/ray-project/kuberay/manifests/cluster-scope-resources\"\n",
"!kubectl create -k \"github.com/ray-project/kuberay/manifests/cluster-scope-resources\"\n",
"!kubectl apply -k \"github.com/ray-project/kuberay/manifests/base\""
]
},
Expand Down
2 changes: 1 addition & 1 deletion manifests/base/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ resources:
images:
- name: kuberay/apiserver
newName: kuberay/apiserver
newTag: nightly
newTag: nightly
- name: kuberay/operator
newName: kuberay/operator
newTag: nightly
Expand Down
14 changes: 14 additions & 0 deletions manifests/overlays/autoscaling/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This overlay patches in KubeRay operator configuration
# necessary for Ray Autoscaler support.
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../base
patches:
- path: prioritize_workers_to_delete_patch.json
target:
group: apps
version: v1
kind: Deployment
name: kuberay-operator
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[{
"op":"replace",
"path":"/spec/template/spec/containers/0/args",
"value": ["--prioritize-workers-to-delete"]
}]
13 changes: 7 additions & 6 deletions ray-operator/config/samples/ray-cluster.autoscaler.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ metadata:
# An unique identifier for the head node and workers of this cluster.
name: raycluster-autoscaler
spec:
rayVersion: 'nightly'
rayVersion: '1.12.1'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest Ray release is compatible with the pinned autoscaler image.

# Ray autoscaler integration is supported only for Ray versions >= 1.11.0
enableInTreeAutoscaling: true
######################headGroupSpecs#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
Expand All @@ -20,7 +21,7 @@ spec:
# logical group name, for this called head-group, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
# the following params are used to complete the ray start: ray start --head --block --port=6379 ...
rayStartParams:
# Flag "no-monitor" must be set when running the autoscaler in
# a sidecar container.
Expand All @@ -29,17 +30,17 @@ spec:
node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
block: 'true'
num-cpus: '1' # can be auto-completed from the limits
redis-password: 'LetMeInRay' # Deprecated since Ray 1.11 due to GCS bootstrapping enabled
# Use `resources` to optionally specify custom resource annotations for the Ray node.
# The value of `resources` is a string-integer mapping.
# Currently, `resources` must be provided in the unfortunate format demonstrated below.
# Currently, `resources` must be provided in the unfortunate format demonstrated below:
# resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
#pod template
template:
spec:
containers:
# The Ray head pod
- name: ray-head
image: rayproject/ray:413fe0
image: rayproject/ray:1.12.1
imagePullPolicy: Always
env:
- name: CPU_REQUEST
Expand Down Expand Up @@ -124,7 +125,7 @@ spec:
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: rayproject/ray:413fe0
image: rayproject/ray:1.12.1
# environment variables to set in the container.Optional.
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
env:
Expand Down
Loading