Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Add L4 support #2212

Merged
merged 17 commits into from
Jul 18, 2023
Merged

[GCP] Add L4 support #2212

merged 17 commits into from
Jul 18, 2023

Conversation

hzeng-0
Copy link
Contributor

@hzeng-0 hzeng-0 commented Jul 11, 2023

Adds support for GCP's L4 GPU. (Fixes #2048) This is together with a pull request for the catalog: link.

In GCP, L4 GPUs can only be used with G2 instances; this is similar to how A100 GPUs can only be used with A2 instances. Thus, much of the code/behavior for L4 is similar to what was already implemented for A100/A2.

When testing, be sure to have the catalog updates as well: we changed v5/gcp/images.csv and v5/gcp/vms.csv.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky show-gpus -a
    • sky launch --gpus l4
    • sky launch --gpus l4:2 --cloud gcp --region asia-southeast1 --env MY_ENV=1
    • sky launch --gpus a100:2 --cloud gcp --region europe-west4 --env MY_ENV=1 (checking A100 still works)
    • sky launch -t g2-standard-12, sky launch --gpus l4 -t n1-highcpu-16 (give exception as expected)
    • Ran some LLM models on the GPU (@infwinston)
  • All smoke tests: pytest tests/test_smoke.py

@fozziethebeat
Copy link
Contributor

I ran this and confirmed it worked. I also tested use of --memory 32 and confirmed it picks g2-standard-8 as expected.

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hzeng-0 !! some comments + image issue but overall looks good to merge.

sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
@infwinston
Copy link
Member

infwinston commented Jul 13, 2023

I can also confirm that L4 runs a 7B LLM model smoothly. I tested it with vicuna-7b-v1.3

image

sky/clouds/gcp.py Outdated Show resolved Hide resolved
@infwinston
Copy link
Member

@hzeng-0 could you resolve the conflict in sky/clouds/service_catalog/gcp_catalog.py? other than that, I think we are good to go.

# https://cloud.google.com/compute/docs/gpus#l4-gpus
if acc_name in _A100_INSTANCE_TYPE_DICTS:
return True, [_A100_INSTANCE_TYPE_DICTS[acc_name][acc_count]]
if acc_name == 'L4':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if won't be needed after we merge the two type dicts

@@ -75,6 +75,20 @@
}
}

# gpu count -> [vm types]
_L4_INSTANCE_TYPE_DICT = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we merge _A100_INSTANCE_TYPE_DICTS and _L4_INSTANCE_TYPE_DICT and rename it to _ACC_INSTANCE_TYPE_DICTS? so we can simplify the code quite a bit

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! just two places left for change.

if instance_type != a100_instance_type:
if acc_name in _ACC_INSTANCE_TYPE_DICTS:
matching_types: List[str] = sum(
_ACC_INSTANCE_TYPE_DICTS[acc_name].values(), [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be _ACC_INSTANCE_TYPE_DICTS[acc_name][acc_count] right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise instance types for other numbers of accelerator will be included. (which shouldn't)

'accelerators as A100.')
for acc_name, val in _ACC_INSTANCE_TYPE_DICTS.items():
if instance_type in sum(val.values(), []):
# NOTE: While it is allowed to use A2 VMs as CPU-only nodes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add G2 VM into comments?

f'{acc_name} GPUs cannot be attached to {instance_type}. '
f'Use one of {matching_types} instead. Please refer to '
'https://cloud.google.com/compute/docs/gpus')
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to remove this return right? otherwise later code won't be run?

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks a lot for all the effort @hzeng-0 !

@infwinston infwinston merged commit a184687 into skypilot-org:master Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support L4 GPUs in GCP
3 participants