[GCP] Update GCP TPU config #18634

Yard1 · 2021-09-15T15:32:12Z

Why are these changes needed?

Updates the GCP TPU config to fix issues introduced by an update on Google's side, and also makes the setup commands cleaner.

Related issue number

#16908 (comment)

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ijrsvt

LGTM!

shawwn · 2021-09-15T17:21:02Z

By the way, while we're at it...

ray/python/ray/autoscaler/gcp/tpu.yaml

Lines 35 to 37 in ce5f162

    
           # Uncomment to use preemptible TPUs 
        
           # schedulingConfig: 
        
           #     preemptible: true

Would it be possible to uncomment preemptible: true? New TFRC members only have preemptible v2-8's in usc1f. So if a TFRC member wants to run this demo, they'll always need to uncomment it anyway:

            # Only v2-8 and v3-8 accelerator types are currently supported.
            # Support for TPU pods will be added in the future.
            acceleratorType: v2-8
            runtimeVersion: v2-alpha
            schedulingConfig:
                preemptible: true

Yard1 · 2021-09-15T17:22:23Z

@shawwn Sure, good call, let me do that

shawwn · 2021-09-15T17:33:13Z

I haven't had the chance to try this demo yet, but one thing that confused me: is the head also a TPU?

TPU VMs cost $0 to run, whereas firing up a non-TPU n2-standard-2 costs >> $0.

When I ran swarm-jax, I was able to fire up 8 TPUs, and use one of the TPUs as the head. Out of curiosity, would it even be possible to write a configuration like that? (I'm mostly just wondering whether ray's design can support that scenario...)

Yard1 · 2021-09-15T17:35:24Z

The head is not a TPU - we are using a service account to spin up new nodes from the head, and that doesn't work if a TPU instance is a head node, for some reason on Google's side. Therefore we are unfortunately limited to using a normal compute instance as head for the the time being

shawwn · 2021-09-15T17:45:48Z

that doesn't work if a TPU instance is a head node

It does -- you have to grant these IAM roles to your default service worker account:

https://twitter.com/theshawwn/status/1431349299586211842

TPU Admin
Service Account User
Compute Viewer

The steps to get this working would be:

create a new TPU and SSH into it
run gcloud auth list
copy the service account email address
go to IAM & Admin
Grant TPU Admin, Service Account User, and Compute Viewer roles to that email address
At that point, you'll be able to create TPUs while SSH'ed into another TPU.

Here's how ours is configured:

(Apparently we don't actually grant the Compute Viewer role to our default service account -- I think it's only necessary when the target account wants to SSH into TPUs, e.g. inviting an external person into your GCP project to use your TPU VMs.)

IAM roles suck. But I use this in all my GCP projects with TPU VMs.

Yard1 · 2021-09-15T17:47:48Z

Huh, interesting. Thanks! I'll take a look if it's possible to use that here (thought that'd be another PR). Would you mind creating a feature request issue for it?

python/ray/autoscaler/gcp/tpu.yaml

shawwn · 2021-09-15T18:10:45Z

Huh, interesting. Thanks! I'll take a look if it's possible to use that here (thought that'd be another PR). Would you mind creating a feature request issue for it?

Okay, I took a stab at it: #18645

(Apologies that it's not too coherent -- I'm rather tired at the moment. :) )

nickjm · 2021-09-15T19:07:42Z

I tried running the new config and it failed. As early as the file mounts step I see this error

New status: syncing-files
  [2/7] Processing file mounts
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
    /home/ubuntu/server/ from /my/path/here

However, it actually goes until the second head setup command before ultimately failing with the same error, which at this point has been printed out about 5 times.

(1/21) conda create -y -n "ray" pytho...
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
2021-09-15 15:02:12,405	INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631732531807-5cc0d55ed43cd-8ad50cd6-3fb67274 to finish...
2021-09-15 15:02:17,871	INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631732531807-5cc0d55ed43cd-8ad50cd6-3fb67274 finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

Yard1 · 2021-09-15T19:29:42Z

@nickjm Thanks! Can you try this:

head_setup_commands:
  - sleep 2
  - sleep 2
  - sudo chown -R $(whoami) /opt/conda/*
  - conda create -y -n "ray" python=3.8.5
  - conda activate ray && echo 'conda activate ray' >> ~/.bashrc
  - python -m pip install --upgrade pip
  - python -m pip install --upgrade "jax[cpu]==0.2.14"
  - python -m pip install --upgrade fabric dataclasses optax==0.0.6 git+https://github.com/deepmind/dm-haiku google-api-python-client cryptography tensorboardX ray[default]
  - python -m pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
  - git clone https://github.com/Yard1/swarm-jax.git && cd swarm-jax && python -m pip install .

while running ray up with --no-config-cache argument?

nickjm · 2021-09-15T20:38:51Z

Hm, I added the two sleep commands and passed the no cache flag but the same thing happens, fails in the same way :(

Yard1 · 2021-09-15T20:45:01Z

@nickjm does running ray up with -v flag give any more information?

nickjm · 2021-09-20T04:20:15Z

the verbosity level doesn't seem to illuminate anything, but here is the end of my output for your reference

(2/23) sudo chown -R $(whoami) /opt/conda/*
    Running `sudo chown -R $(whoami) /opt/conda/*`
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
    (3/23) conda create -y -n "ray" python=3.8.5
    Running `conda create -y -n "ray" python=3.8.5`
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Traceback (most recent call last):
  File "/opt/conda/bin/conda", line 12, in <module>
    from conda.cli import main
ModuleNotFoundError: No module named 'conda'
Shared connection to 35.232.38.169 closed.
2021-09-20 00:17:14,665	INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1632111434198-5cc658e44ecc5-38a84370-fabbbf76 to finish...
2021-09-20 00:17:20,107	INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1632111434198-5cc658e44ecc5-38a84370-fabbbf76 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!

  Failed to setup head node.

I guess the one thing to note is that the conda failure happens twice on the final command, which is an invocation of conda CLI, so makes sense that it would actually fail there. There must be some non critical code path trying to use conda every step that is gracefully failing up until that point?

Yard1 · 2021-09-20T08:15:09Z

Hey @nickjm, would you like to schedule a meeting so we can try and debug this?

Yard1 · 2021-09-21T19:39:43Z

@ijrsvt Hey, this can be merged. The issue @nickjm is having seems to be unrelated.

ijrsvt · 2021-09-29T19:41:21Z

Windows test failure is unrelated (because this is autoscaler only)

[autoscaler] Update GCP TPU config

5abb831

Yard1 requested a review from ijrsvt September 15, 2021 15:32

Yard1 assigned ijrsvt Sep 15, 2021

Yard1 mentioned this pull request Sep 15, 2021

Provide an example config for TPU usage on GCP #16908

Closed

ijrsvt approved these changes Sep 15, 2021

View reviewed changes

Yard1 changed the title ~~[autoscaler] Update GCP TPU config~~ [GCP] Update GCP TPU config Sep 15, 2021

Merge branch 'master' into update_tpu_config

ce5f162

Preemptible by default

735e303

shawwn reviewed Sep 15, 2021

View reviewed changes

python/ray/autoscaler/gcp/tpu.yaml Outdated Show resolved Hide resolved

Remove libtpu link from head node

cc39445

shawwn mentioned this pull request Sep 15, 2021

[GCP][autoscaler] Rework ray TPU demos to create nothing but TPU VMs (no harddrives / n2-standard-2 instances) #18645

Open

2 tasks

shawwn approved these changes Sep 15, 2021

View reviewed changes

Workaround

e1b405c

Merge branch 'master' into update_tpu_config

6b8b6b4

ijrsvt merged commit 573c66a into ray-project:master Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCP] Update GCP TPU config #18634

[GCP] Update GCP TPU config #18634

Yard1 commented Sep 15, 2021

ijrsvt left a comment

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021 •

edited

Loading

shawwn commented Sep 15, 2021

nickjm commented Sep 15, 2021

Yard1 commented Sep 15, 2021 •

edited

Loading

nickjm commented Sep 15, 2021

Yard1 commented Sep 15, 2021

nickjm commented Sep 20, 2021

Yard1 commented Sep 20, 2021

Yard1 commented Sep 21, 2021

ijrsvt commented Sep 29, 2021

[GCP] Update GCP TPU config #18634

[GCP] Update GCP TPU config #18634

Conversation

Yard1 commented Sep 15, 2021

Why are these changes needed?

Related issue number

Checks

ijrsvt left a comment

Choose a reason for hiding this comment

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021

shawwn commented Sep 15, 2021

Yard1 commented Sep 15, 2021 • edited Loading

shawwn commented Sep 15, 2021

nickjm commented Sep 15, 2021

Yard1 commented Sep 15, 2021 • edited Loading

nickjm commented Sep 15, 2021

Yard1 commented Sep 15, 2021

nickjm commented Sep 20, 2021

Yard1 commented Sep 20, 2021

Yard1 commented Sep 21, 2021

ijrsvt commented Sep 29, 2021

Yard1 commented Sep 15, 2021 •

edited

Loading

Yard1 commented Sep 15, 2021 •

edited

Loading