Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Update GCP TPU config #18634

Merged
merged 6 commits into from
Sep 29, 2021
Merged

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Sep 15, 2021

Why are these changes needed?

Updates the GCP TPU config to fix issues introduced by an update on Google's side, and also makes the setup commands cleaner.

Related issue number

#16908 (comment)

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Yard1 Yard1 changed the title [autoscaler] Update GCP TPU config [GCP] Update GCP TPU config Sep 15, 2021
@shawwn
Copy link

shawwn commented Sep 15, 2021

By the way, while we're at it...

# Uncomment to use preemptible TPUs
# schedulingConfig:
# preemptible: true

Would it be possible to uncomment preemptible: true? New TFRC members only have preemptible v2-8's in usc1f. So if a TFRC member wants to run this demo, they'll always need to uncomment it anyway:

            # Only v2-8 and v3-8 accelerator types are currently supported.
            # Support for TPU pods will be added in the future.
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                preemptible: true

@Yard1
Copy link
Member Author

Yard1 commented Sep 15, 2021

@shawwn Sure, good call, let me do that

@shawwn
Copy link

shawwn commented Sep 15, 2021

I haven't had the chance to try this demo yet, but one thing that confused me: is the head also a TPU?

TPU VMs cost $0 to run, whereas firing up a non-TPU n2-standard-2 costs >> $0.

When I ran swarm-jax, I was able to fire up 8 TPUs, and use one of the TPUs as the head. Out of curiosity, would it even be possible to write a configuration like that? (I'm mostly just wondering whether ray's design can support that scenario...)

@Yard1
Copy link
Member Author

Yard1 commented Sep 15, 2021

The head is not a TPU - we are using a service account to spin up new nodes from the head, and that doesn't work if a TPU instance is a head node, for some reason on Google's side. Therefore we are unfortunately limited to using a normal compute instance as head for the the time being

@shawwn
Copy link

shawwn commented Sep 15, 2021

that doesn't work if a TPU instance is a head node

It does -- you have to grant these IAM roles to your default service worker account:

https://twitter.com/theshawwn/status/1431349299586211842

  • TPU Admin
  • Service Account User
  • Compute Viewer

The steps to get this working would be:

  1. create a new TPU and SSH into it
  2. run gcloud auth list
  3. copy the service account email address
  4. go to IAM & Admin
  5. Grant TPU Admin, Service Account User, and Compute Viewer roles to that email address
  6. At that point, you'll be able to create TPUs while SSH'ed into another TPU.

Here's how ours is configured:

image

(Apparently we don't actually grant the Compute Viewer role to our default service account -- I think it's only necessary when the target account wants to SSH into TPUs, e.g. inviting an external person into your GCP project to use your TPU VMs.)

IAM roles suck. But I use this in all my GCP projects with TPU VMs.

@Yard1
Copy link
Member Author

Yard1 commented Sep 15, 2021

Huh, interesting. Thanks! I'll take a look if it's possible to use that here (thought that'd be another PR). Would you mind creating a feature request issue for it?

@shawwn
Copy link

shawwn commented Sep 15, 2021

Huh, interesting. Thanks! I'll take a look if it's possible to use that here (thought that'd be another PR). Would you mind creating a feature request issue for it?

Okay, I took a stab at it: #18645

(Apologies that it's not too coherent -- I'm rather tired at the moment. :) )

@nickjm
Copy link

nickjm commented Sep 15, 2021

I tried running the new config and it failed. As early as the file mounts step I see this error

New status: syncing-files
  [2/7] Processing file mounts
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
    /home/ubuntu/server/ from /my/path/here

However, it actually goes until the second head setup command before ultimately failing with the same error, which at this point has been printed out about 5 times.

(1/21) conda create -y -n "ray" pytho...
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
2021-09-15 15:02:12,405	INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631732531807-5cc0d55ed43cd-8ad50cd6-3fb67274 to finish...
2021-09-15 15:02:17,871	INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631732531807-5cc0d55ed43cd-8ad50cd6-3fb67274 finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

@Yard1
Copy link
Member Author

Yard1 commented Sep 15, 2021

@nickjm Thanks! Can you try this:

head_setup_commands:
  - sleep 2
  - sleep 2
  - sudo chown -R $(whoami) /opt/conda/*
  - conda create -y -n "ray" python=3.8.5
  - conda activate ray && echo 'conda activate ray' >> ~/.bashrc
  - python -m pip install --upgrade pip
  - python -m pip install --upgrade "jax[cpu]==0.2.14"
  - python -m pip install --upgrade fabric dataclasses optax==0.0.6 git+https://github.com/deepmind/dm-haiku google-api-python-client cryptography tensorboardX ray[default]
  - python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
  - git clone https://github.com/Yard1/swarm-jax.git && cd swarm-jax && python -m pip install .

while running ray up with --no-config-cache argument?

@nickjm
Copy link

nickjm commented Sep 15, 2021

Hm, I added the two sleep commands and passed the no cache flag but the same thing happens, fails in the same way :(

@Yard1
Copy link
Member Author

Yard1 commented Sep 15, 2021

@nickjm does running ray up with -v flag give any more information?

@nickjm
Copy link

nickjm commented Sep 20, 2021

the verbosity level doesn't seem to illuminate anything, but here is the end of my output for your reference

(2/23) sudo chown -R $(whoami) /opt/conda/*
    Running `sudo chown -R $(whoami) /opt/conda/*`
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
    (3/23) conda create -y -n "ray" python=3.8.5
    Running `conda create -y -n "ray" python=3.8.5`
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
2021-09-20 00:17:14,665	INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1632111434198-5cc658e44ecc5-38a84370-fabbbf76 to finish...
2021-09-20 00:17:20,107	INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1632111434198-5cc658e44ecc5-38a84370-fabbbf76 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!

  Failed to setup head node.

I guess the one thing to note is that the conda failure happens twice on the final command, which is an invocation of conda CLI, so makes sense that it would actually fail there. There must be some non critical code path trying to use conda every step that is gracefully failing up until that point?

@Yard1
Copy link
Member Author

Yard1 commented Sep 20, 2021

Hey @nickjm, would you like to schedule a meeting so we can try and debug this?

@Yard1
Copy link
Member Author

Yard1 commented Sep 21, 2021

@ijrsvt Hey, this can be merged. The issue @nickjm is having seems to be unrelated.

@ijrsvt
Copy link
Contributor

ijrsvt commented Sep 29, 2021

Windows test failure is unrelated (because this is autoscaler only)

@ijrsvt ijrsvt merged commit 573c66a into ray-project:master Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants