Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Onprem] Support for Different Type of GPUs + Small Bugfix #1356

Merged
merged 3 commits into from
Nov 4, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ def fill_template(template_name: str,
cluster_name = variables.get('cluster_name')
output_path = _get_yaml_path_from_cluster_name(cluster_name,
output_prefix)

michaelzhiluo marked this conversation as resolved.
Show resolved Hide resolved
output_path = os.path.abspath(output_path)

# Add yaml file path to the template variables.
Expand Down Expand Up @@ -758,7 +759,10 @@ def write_cluster_config(
yaml_path = _get_yaml_path_from_cluster_name(cluster_name)

# Use a tmp file path to avoid incomplete YAML file being re-used in the future.
tmp_yaml_path = yaml_path + '.tmp'
if isinstance(cloud, clouds.Local):
tmp_yaml_path = yaml_path
else:
tmp_yaml_path = yaml_path + '.tmp'
michaelzhiluo marked this conversation as resolved.
Show resolved Hide resolved
tmp_yaml_path = fill_template(
cluster_config_template,
dict(
Expand Down
13 changes: 10 additions & 3 deletions sky/backends/onprem_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,11 @@ def get_local_cluster_accelerators(
'T4',
'P4',
'K80',
'A100',]
'A100',
'1080',
'2080',
'A5000'
'A6000']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add TITAN Xp (used in a RISE server)?

Also do you think nvidia-smi --list-gpus is more robust for detecting #num-gpus as it's agnostic to gpu names?

weichiang@freddie:~$ nvidia-smi --list-gpus
GPU 0: NVIDIA TITAN Xp (UUID: GPU-67995609-af7b-27e6-2024-0f5f3c837c1a)
GPU 1: NVIDIA TITAN Xp (UUID: GPU-7dc58de3-9a7a-3e5f-6231-c1ac72af9e0d)
GPU 2: NVIDIA TITAN Xp (UUID: GPU-6dcf1597-36aa-b71c-5e4d-713931d79f9b)
GPU 3: NVIDIA TITAN Xp (UUID: GPU-a35c7b48-c7a8-016a-aa46-f82a654bc9a4)
GPU 4: NVIDIA TITAN Xp (UUID: GPU-72bfeabf-998e-ae6e-7ab2-4b0c066869ec)
GPU 5: NVIDIA TITAN Xp (UUID: GPU-6b3b0b2a-89db-24b4-46b5-aab244bcaad8)

weichiang@blaze:~$ nvidia-smi --list-gpus
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-7942cb79-543a-c8fe-03e8-172ed446f612)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-09e76941-9a87-b1b4-8ef7-864c8d98c4af)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-a9edaeaa-5c02-2b26-2ad8-7a5d51f0d1ac)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-49a19b7d-5a7e-7ee5-8168-c4e77d648998)
GPU 4: Tesla P100-PCIE-16GB (UUID: GPU-b3b998a7-b941-040e-68f6-5f533bcf5636)
GPU 5: Tesla P100-PCIE-16GB (UUID: GPU-8fc7187f-350a-1b62-2ba2-51466153dd06)
GPU 6: Tesla P100-PCIE-16GB (UUID: GPU-7c8c9f76-d06e-2d3d-2f63-873fc9925ecf)
GPU 7: Tesla P100-PCIE-16GB (UUID: GPU-90773d83-1578-d5fb-ccc0-a30b7c0b4e66)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should think about how we can do away with this list and use just nvidia-smi to parse the resources available. If this PR is time-sensitive, we can do it in a different PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a separate Pr imo. There has to be a better way to detect GPUs, even with onprem cluster wihtout nvidia-smi installed ( as was this code's assumption, works on all machines with linux/unix). Very interested in discussion for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be a safe assumption? If nvidia-smi is not installed, very likely Ray will also fail to schedule CUDA jobs.

If we want less assumption, Ray's approach of listing /proc/driver/nvidia/gpus
https://github.com/ray-project/ray/blob/f39d323ed5916b67042407231f5b91851e8fa0b5/python/ray/_private/resource_spec.py#L281 should be better.
It looks pretty robust (similar to lspci, only assume nvidia-driver is installed) and name-agnostic.

weichiang@blaze:~$ ls /proc/driver/nvidia/gpus
0000:04:00.0  0000:06:00.0  0000:07:00.0  0000:08:00.0  0000:0c:00.0  0000:0d:00.0  0000:0e:00.0  0000:0f:00.0
weichiang@blaze:~$ cat /proc/driver/nvidia/gpus/0000\:04\:00.0/information
Model: 		 Tesla P100-PCIE-16GB
IRQ:   		 722
GPU UUID: 	 GPU-7942cb79-543a-c8fe-03e8-172ed446f612
Video BIOS: 	 86.00.41.00.06
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:04:00.0
Device Minor: 	 0
GPU Excluded:	 No

This will auto cover TITAN Xp and TITAN V that currently not in the list but used in RISE's and Woosuk's server? Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit different on Windows (how Ray handles it), which is the OS used by one of my family members (along with nvidia-smi not being available despite being able to use the GPU). I'll look deeper into it for a future PR.

accelerators_dict = {}
for acc in all_accelerators:
output_str = os.popen(f'lspci | grep \\'{acc}\\'').read()
Expand Down Expand Up @@ -358,9 +362,10 @@ def _stop_ray_workers(runner: command_runner.SSHCommandRunner):

# Launching Ray on the head node.
head_resources = json.dumps(custom_resources[0], separators=(',', ':'))
head_gpu_count = sum(list(custom_resources[0].values()))
head_cmd = ('ray start --head --port=6379 '
'--object-manager-port=8076 --dashboard-port 8265 '
f'--resources={head_resources!r}')
f'--resources={head_resources!r} --num-gpus={head_gpu_count}')

with console.status('[bold cyan]Launching ray cluster on head'):
backend_utils.run_command_and_handle_ssh_failure(
Expand Down Expand Up @@ -399,9 +404,11 @@ def _start_ray_workers(

worker_resources = json.dumps(custom_resources[idx + 1],
separators=(',', ':'))
worker_gpu_count = sum(list(custom_resources[idx + 1].values()))
worker_cmd = (f'ray start --address={head_ip}:6379 '
'--object-manager-port=8076 --dashboard-port 8265 '
f'--resources={worker_resources!r}')
f'--resources={worker_resources!r} '
f'--num-gpus={worker_gpu_count}')
backend_utils.run_command_and_handle_ssh_failure(
runner,
worker_cmd,
Expand Down
7 changes: 6 additions & 1 deletion sky/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,12 @@ def _set_accelerators(
except ValueError:
with ux_utils.print_exception_no_traceback():
raise ValueError(parse_error) from None
assert len(accelerators) == 1, accelerators

# Ignore check for the local cloud case.
# It is possible the accelerators dict can contain multiple
# types of accelerators for some on-prem clusters.
if not isinstance(self._cloud, clouds.Local):
assert len(accelerators) == 1, accelerators
michaelzhiluo marked this conversation as resolved.
Show resolved Hide resolved

# Canonicalize the accelerator names.
accelerators = {
Expand Down