-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation and example for running simple NLP service on kuberay #1340
Conversation
|
||
Note that the RayService's Kubernetes service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running. | ||
|
||
## Step 5: Send a request to the text-to-image model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text-to-image -> text summarization (?)
Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Praveen <[email protected]>
|
||
This RayService configuration contains some important settings: | ||
|
||
* Its `tolerations` for workers match the taints on the GPU node group (which has taints), so they can be scheduled on either GPU or CPU node. We don't add these to head nodes to head node from being allocated to GPU node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tolerations
for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set nvidia.com/gpu: 1
in the Pod's resource configurations.
@@ -21,7 +21,7 @@ kubectl apply -f ray-service.stable-diffusion.yaml | |||
|
|||
This RayService configuration contains some important settings: | |||
|
|||
* Its `tolerations` for workers match the taints on the GPU node group. Without the tolerations, worker Pods won't be scheduled on GPU nodes. | |||
* Its `tolerations` for workers match the taints on the GPU node group (which has taints), so they can be scheduled on either GPU or CPU node. We don't add these to `headGroupSpec` to make sure head Pod & KubeRay operator Pod are not allocated to GPU node group (which has taints). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tolerations
for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set nvidia.com/gpu: 1
in the Pod's resource configurations.
Signed-off-by: Kai-Hsun Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ay-project#1340) * add service yaml for nlp * Documentation fixes * Fix instructions * Apply suggestions from code review Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Praveen <[email protected]> * Fix tolerations comment * review comments * Update docs/guidance/stable-diffusion-rayservice.md Signed-off-by: Kai-Hsun Chen <[email protected]> --------- Signed-off-by: Praveen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
…ay-project#1340) * add service yaml for nlp * Documentation fixes * Fix instructions * Apply suggestions from code review Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Praveen <[email protected]> * Fix tolerations comment * review comments * Update docs/guidance/stable-diffusion-rayservice.md Signed-off-by: Kai-Hsun Chen <[email protected]> --------- Signed-off-by: Praveen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
…ay-project#1340) * add service yaml for nlp * Documentation fixes * Fix instructions * Apply suggestions from code review Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Praveen <[email protected]> * Fix tolerations comment * review comments * Update docs/guidance/stable-diffusion-rayservice.md Signed-off-by: Kai-Hsun Chen <[email protected]> --------- Signed-off-by: Praveen <[email protected]> Signed-off-by: Kai-Hsun Chen <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
Why are these changes needed?
This is needed for Kuberay CUJ testing
Related issue number
Checks
Manually tested