-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Update GCP TPU config #18634
[GCP] Update GCP TPU config #18634
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
By the way, while we're at it... ray/python/ray/autoscaler/gcp/tpu.yaml Lines 35 to 37 in ce5f162
Would it be possible to uncomment
|
@shawwn Sure, good call, let me do that |
I haven't had the chance to try this demo yet, but one thing that confused me: is the TPU VMs cost $0 to run, whereas firing up a non-TPU When I ran swarm-jax, I was able to fire up 8 TPUs, and use one of the TPUs as the |
The head is not a TPU - we are using a service account to spin up new nodes from the head, and that doesn't work if a TPU instance is a head node, for some reason on Google's side. Therefore we are unfortunately limited to using a normal compute instance as head for the the time being |
It does -- you have to grant these IAM roles to your default service worker account: https://twitter.com/theshawwn/status/1431349299586211842
The steps to get this working would be:
Here's how ours is configured: (Apparently we don't actually grant the IAM roles suck. But I use this in all my GCP projects with TPU VMs. |
Huh, interesting. Thanks! I'll take a look if it's possible to use that here (thought that'd be another PR). Would you mind creating a feature request issue for it? |
I tried running the new config and it failed. As early as the file mounts step I see this error
However, it actually goes until the second head setup command before ultimately failing with the same error, which at this point has been printed out about 5 times.
|
@nickjm Thanks! Can you try this:
while running |
Hm, I added the two sleep commands and passed the no cache flag but the same thing happens, fails in the same way :( |
@nickjm does running ray up with |
the verbosity level doesn't seem to illuminate anything, but here is the end of my output for your reference
I guess the one thing to note is that the conda failure happens twice on the final command, which is an invocation of conda CLI, so makes sense that it would actually fail there. There must be some non critical code path trying to use conda every step that is gracefully failing up until that point? |
Hey @nickjm, would you like to schedule a meeting so we can try and debug this? |
Windows test failure is unrelated (because this is autoscaler only) |
Why are these changes needed?
Updates the GCP TPU config to fix issues introduced by an update on Google's side, and also makes the setup commands cleaner.
Related issue number
#16908 (comment)
Checks
scripts/format.sh
to lint the changes in this PR.