-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Add L4 support #2212
[GCP] Add L4 support #2212
Conversation
I ran this and confirmed it worked. I also tested use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hzeng-0 could you resolve the conflict in |
# https://cloud.google.com/compute/docs/gpus#l4-gpus | ||
if acc_name in _A100_INSTANCE_TYPE_DICTS: | ||
return True, [_A100_INSTANCE_TYPE_DICTS[acc_name][acc_count]] | ||
if acc_name == 'L4': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if won't be needed after we merge the two type dicts
@@ -75,6 +75,20 @@ | |||
} | |||
} | |||
|
|||
# gpu count -> [vm types] | |||
_L4_INSTANCE_TYPE_DICT = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we merge _A100_INSTANCE_TYPE_DICTS
and _L4_INSTANCE_TYPE_DICT
and rename it to _ACC_INSTANCE_TYPE_DICTS
? so we can simplify the code quite a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there! just two places left for change.
if instance_type != a100_instance_type: | ||
if acc_name in _ACC_INSTANCE_TYPE_DICTS: | ||
matching_types: List[str] = sum( | ||
_ACC_INSTANCE_TYPE_DICTS[acc_name].values(), []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be _ACC_INSTANCE_TYPE_DICTS[acc_name][acc_count]
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise instance types for other numbers of accelerator will be included. (which shouldn't)
'accelerators as A100.') | ||
for acc_name, val in _ACC_INSTANCE_TYPE_DICTS.items(): | ||
if instance_type in sum(val.values(), []): | ||
# NOTE: While it is allowed to use A2 VMs as CPU-only nodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add G2 VM into comments?
f'{acc_name} GPUs cannot be attached to {instance_type}. ' | ||
f'Use one of {matching_types} instead. Please refer to ' | ||
'https://cloud.google.com/compute/docs/gpus') | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to remove this return
right? otherwise later code won't be run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks a lot for all the effort @hzeng-0 !
Adds support for GCP's L4 GPU. (Fixes #2048) This is together with a pull request for the catalog: link.
In GCP, L4 GPUs can only be used with G2 instances; this is similar to how A100 GPUs can only be used with A2 instances. Thus, much of the code/behavior for L4 is similar to what was already implemented for A100/A2.
When testing, be sure to have the catalog updates as well: we changed
v5/gcp/images.csv
andv5/gcp/vms.csv
.Tested (run the relevant ones):
bash format.sh
sky show-gpus -a
sky launch --gpus l4
sky launch --gpus l4:2 --cloud gcp --region asia-southeast1 --env MY_ENV=1
sky launch --gpus a100:2 --cloud gcp --region europe-west4 --env MY_ENV=1
(checking A100 still works)sky launch -t g2-standard-12
,sky launch --gpus l4 -t n1-highcpu-16
(give exception as expected)pytest tests/test_smoke.py