Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an example config for TPU usage on GCP #16908

Closed
richardliaw opened this issue Jul 6, 2021 · 10 comments
Closed

Provide an example config for TPU usage on GCP #16908

richardliaw opened this issue Jul 6, 2021 · 10 comments
Assignees
Labels
enhancement Request for new feature and/or capability

Comments

@richardliaw
Copy link
Contributor

Describe your feature request

We're seeing a good amount of usage for Ray with jax/TPU VMs.

It'd be great to provide an example Ray cluster configuration for accessing TPU VMs. Note that you'll probably need to have access to the TRC before then.

cc @shawwn from our correspondence.

@richardliaw richardliaw added the enhancement Request for new feature and/or capability label Jul 6, 2021
@richardliaw
Copy link
Contributor Author

@Yard1 maybe you can take a look at this next sprint?

@Yard1
Copy link
Member

Yard1 commented Jul 6, 2021

Sounds exciting, I'll assign it to myself @richardliaw!

@Yard1 Yard1 self-assigned this Jul 6, 2021
@nickjm
Copy link

nickjm commented Aug 29, 2021

Is there a draft of this anywhere I could take a look at?

@Yard1
Copy link
Member

Yard1 commented Aug 29, 2021

Hey @nickjm , you can find the example config here - https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/tpu.yaml

@shawwn
Copy link

shawwn commented Aug 30, 2021

I just want to say, this is fucking epic. Sorry to swear, but I really didn't expect ya'll to move this quickly on getting a TPU example set up.

Thank you so much for your hard work. I'm gonna brag to everyone I meet about how amazing the Ray team is, and point them to this as proof. :)

Hopefully I'll have some time to try it out and point out any pain spots, but glancing over the config, it looks super reasonable. Love that it's using swarm-jax as a baseline.

@Yard1
Copy link
Member

Yard1 commented Aug 30, 2021

Hey @shawwn, thank you so much for your kind words! It's been a lot of fun to work on TPU support. Let me know if you have any questions or feedback!

@nickjm
Copy link

nickjm commented Sep 14, 2021

Seems to currently fail, first with

ModuleNotFoundError: No module named 'conda'

on the head node, but doesn't stop, then actually fails on

No matching distribution found for jax[cpu]==0.2.14

As for jax, I see the conda forge option is provided, is there a small tweak to it that could fix this problem?

@Yard1
Copy link
Member

Yard1 commented Sep 14, 2021

Hey @nickjm, thanks for bringing this to our attention! Looks like Google changed something on their side. I'll try and update the config as soon as I figure what that was!

@Yard1
Copy link
Member

Yard1 commented Sep 15, 2021

Hey @nickjm this PR should solve the issue. #18634

@nickjm
Copy link

nickjm commented Sep 15, 2021

Left a comment on the PR for failure i'm still seeing

@Yard1 Yard1 closed this as completed Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability
Projects
None yet
Development

No branches or pull requests

4 participants