-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an example config for TPU usage on GCP #16908
Comments
@Yard1 maybe you can take a look at this next sprint? |
Sounds exciting, I'll assign it to myself @richardliaw! |
Is there a draft of this anywhere I could take a look at? |
Hey @nickjm , you can find the example config here - https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/tpu.yaml |
I just want to say, this is fucking epic. Sorry to swear, but I really didn't expect ya'll to move this quickly on getting a TPU example set up. Thank you so much for your hard work. I'm gonna brag to everyone I meet about how amazing the Ray team is, and point them to this as proof. :) Hopefully I'll have some time to try it out and point out any pain spots, but glancing over the config, it looks super reasonable. Love that it's using swarm-jax as a baseline. |
Hey @shawwn, thank you so much for your kind words! It's been a lot of fun to work on TPU support. Let me know if you have any questions or feedback! |
Seems to currently fail, first with
on the head node, but doesn't stop, then actually fails on
As for jax, I see the conda forge option is provided, is there a small tweak to it that could fix this problem? |
Hey @nickjm, thanks for bringing this to our attention! Looks like Google changed something on their side. I'll try and update the config as soon as I figure what that was! |
Left a comment on the PR for failure i'm still seeing |
Describe your feature request
We're seeing a good amount of usage for Ray with jax/TPU VMs.
It'd be great to provide an example Ray cluster configuration for accessing TPU VMs. Note that you'll probably need to have access to the TRC before then.
cc @shawwn from our correspondence.
The text was updated successfully, but these errors were encountered: