Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
Co-authored-by: Ayaka <[email protected]>
  • Loading branch information
yixiaoer and ayaka14732 committed Jun 16, 2024
1 parent 386b41d commit 1b40997
Show file tree
Hide file tree
Showing 3 changed files with 98 additions and 15 deletions.
81 changes: 74 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,89 @@
# tpux
# tpux: Enhance Your Google Cloud TPU Experience

Welcome to tpux, your essential toolkit designed to revolutionize the way you use Google Cloud TPUs. This suite of tools is tailored to simplify and streamline your TPU setup and operation processes, ensuring you maximize your productivity with minimal effort.

## Pronunciation

To pronounce "tpux", first say "TPU" as you would in English, followed by "X" pronounced as /iks/ in French.

## Why You Need tpux

Setting up Google Cloud TPU instances traditionally involves initializing empty VM instances, a process that can be tedious and repetitive. With tpux, this setup is greatly simplified, allowing you to focus on what truly matters—your work.

## Features

- `tpux`: A user-friendly setup script that automates the configuration of your Google Cloud TPUs. This tool ensures that you are equipped with the latest practices and optimizations, keeping your operations cutting-edge.
- `podrun`: Seamlessly execute commands across all nodes in your TPU pods. Ideal for scaling applications and managing large-scale machine learning tasks, it enhances efficiency and effectiveness across your deployments.

Inspired by the comprehensive guide [ayaka14732/tpu-starter](https://github.com/ayaka14732/tpu-starter), tpux incorporates best practices for TPU usage in open-source environments.

## Setting Up Your TPU VM or TPU Pod with tpux

### When Creating TPU VM or TPU Pod Instances

During the creation of a TPU VM instance, ensure to select the latest `tpu-ubuntu2204-base` software version to benefit from the most up-to-date system and software packages.

Besides using the web UI to create TPUs, you can also use the Google Cloud Shell. Here, your `--version` option should specify `tpu-ubuntu2204-base`. For example:

```sh
until gcloud alpha compute tpus tpu-vm create node-2 --zone us-central2-b --accelerator-type v4-32 --version tpu-ubuntu2204-base ; do : ; done
```

### Using the `tpux` Command to Execute the Setup Script

After SSH into one of the hosts of your TPU VM or TPU Pod, you can perform the setup using the following method:

```sh
pip install tpux
export PATH="$HOME/.local/bin:$PATH"
tpux
```

Simply follow the on-screen prompts to complete the setup of your TPU VM or TPU Pod.

### Executing Commands Across All Hosts with the `podrun` Command

After setting up with the `tpux` command, you can use the `podrun` command to execute specified commands across all TPU hosts.

`podrun` reads the command to be executed from stdin, for example:

Use `podrun` to make all hosts purr like a kitty:
```sh
echo echo meow | podrun -i
```

## TODO
This command outputs "meow" on all hosts.

Using the `-i` parameter executes the command on all machines, while omitting `-i` executes on all hosts except the local one:

```sh
echo echo meow | podrun
```

- [ ] `env['DEBIAN_FRONTEND'] = 'noninteractive'` for pod
- [ ] Create venv
This command outputs "meow" on all hosts except the local machine.

## Example
For more information on how to use the `podrun` command, type:

```sh
podrun -h
```

### Verifying Successful Configuration of Your TPU Pod

Given the complexity of configuring a TPU Pod, after executing the `tpux` setup command, you may want to ensure it was successful. You can verify this by:

```sh
echo echo meow | podrun -i
```

If the TPU Pod is configured correctly, the above command should output multiple lines of "meow," where the number of lines corresponds to the number of TPU Pod hosts.

```sh
touch ~/nfs_share/meow
echo ls ~/nfs_share/meow
echo ls -l ~/nfs_share/meow | podrun -i
```

If configured correctly, the above commands should display the results of `ls -l ~/nfs_share/meow` on multiple lines, with the number of lines equaling the number of TPU Pod hosts.

## Disclaimer

This is not an officially supported Google product.
2 changes: 1 addition & 1 deletion src/tpux/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.0.3'
__version__ = '0.1.0'
30 changes: 23 additions & 7 deletions src/tpux/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,13 @@ def input_priv_ipv4_addrs(prompt: str, default: Optional[List[IPv4Address]] = No
return ip_addrs
except AddressValueError:
pass
print('Please input a list of valid private IPv4 addresses (comma-separated).')
ip_host0 = get_priv_ipv4_addr()
print(f'''Please input a list of valid private IPv4 addresses (comma-separated).
{YELLOW_START}To find the IPv4 addresses:
1. Open https://console.cloud.google.com/compute/tpus
2. Click on the node name of the TPU pod you're using in the current project
3. In the details, find the External IP addresses
4. Do NOT include the IP address of the current host: {ip_host0}{COLOR_RESET}''')

def get_priv_ipv4_addr(*, interface_prefix: str = 'ens') -> IPv4Address:
addrs = psutil.net_if_addrs()
Expand Down Expand Up @@ -147,7 +153,8 @@ def generate_ssh_key() -> None:
input('Please press enter to continue...')

while True:
authorized_key_data = Path(authorized_key_path).read_text()
authorized_key_file = Path(authorized_key_path)
authorized_key_data = '' if not authorized_key_file.exists() else authorized_key_file.read_text()
if public_key in authorized_key_data:
break
input('The key has not been propagated to host machines. Please wait for a while, and then press enter to continue...')
Expand All @@ -174,7 +181,9 @@ def check_is_not_root() -> None:
def check_tpu_chip_exists() -> None:
tpu_chip_exists = len(glob.glob('/dev/accel*')) > 0
if not tpu_chip_exists:
print('TPU chips not detected, exiting...')
print('TPU chips not detected. Please check your TPU setup, create a new TPU VM or turn to the Cloud TPU documentation for further assistance.')
print('Exiting...')

exit(-1)

update_apt_commands = [
Expand Down Expand Up @@ -238,17 +247,24 @@ def insert_exports_config():
export_file = '' if not export_file_path.exists() else export_file_path.read_text()

new_entries = '\n'.join(f'/nfs_share {ip}(rw,sync,no_subtree_check)' for ip in hosts)
export_file_new = f'''
{export_file}
export_new_entries = f'''
{block_start}
{new_entries}
{block_end}
'''

if not block_pattern.search(export_file):
if not export_file or export_file.endswith('\n\n'):
export_file_new = f'{export_file}{export_new_entries}'
else:
export_file_new = f'{export_file}\n{export_new_entries}'
else:
export_file_new = block_pattern.sub(export_new_entries, export_file)

with tempfile.TemporaryDirectory() as name:
tmp_file = Path(name) / 'exports'
tmp_file.write_text(export_file_new)

subprocess.run(['sudo', 'cp', str(tmp_file), export_file_name], check=True)

def clear_exports_config() -> None:
Expand All @@ -274,7 +290,7 @@ def setup_single_host() -> None:

def setup_tpu_pod() -> None:
check_is_not_root()
# check_tpu_chip_exists()
check_tpu_chip_exists()

config_podips()
generate_ssh_key()
Expand Down

0 comments on commit 1b40997

Please sign in to comment.