-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Make sure the chosen VPC has subnet enabled for the region required #2772
Conversation
Blocked by #2764 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @Michaelvll. Quick comments.
sky/skylet/providers/gcp/config.py
Outdated
subnets = _list_subnets(config, compute, filter=f'(name="{usable_vpc_name}")') | ||
if not subnets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing get_usable_vpc() into get_usable_vpc_and_subnets(), i.e., avoid an extra query?
skypilot/sky/provision/aws/config.py
Lines 72 to 75 in 31aa76c
subnets, vpc_id = _get_subnet_and_vpc_id( | |
ec2, | |
security_group_ids, | |
config.provider_config['region'], |
sky/skylet/providers/gcp/config.py
Outdated
) | ||
continue | ||
|
||
if not _list_subnets(config, compute, filter=f'(name="{vpc["name"]}")'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In AWS, we get a usable VPC first, then get its usable subnets:
skypilot/sky/provision/aws/config.py
Lines 399 to 414 in 31aa76c
if vpc_name is not None: | |
vpc_id_of_sg = _get_vpc_id_by_name(ec2, vpc_name, region) | |
elif security_group_ids: | |
vpc_id_of_sg = _vpc_id_from_security_group_ids(ec2, security_group_ids) | |
else: | |
vpc_id_of_sg = None | |
all_subnets = list(ec2.subnets.all()) | |
subnets, vpc_id = _usable_subnets( | |
None, | |
all_subnets, | |
azs=availability_zone, | |
vpc_id_of_sg=vpc_id_of_sg, | |
use_internal_ips=use_internal_ips, | |
) | |
return subnets, vpc_id |
Any pros/cons in terms of doing it like that vs. the nested loop here? Maybe the former makes it easier to support things like private IP-enabled subnets in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning this @concretevitamin! We do need to sort the VPCs in this GCP case as well.
While looking at the code for AWS, it seems that we did not sort the subnets by the vpc id, which means when the user is creating multiple clusters without specifying a VPC name, it is possible that those clusters can locate in different VPCs. It might be fine for now as there is no interconnection between clusters for us right now, but it does not seem very clean. Should we sort the subnets by the (availability_zone, vpc_id) pair?
For GCP here, yes, we can do similar thing for getting the VPC. The current implementation was just inherited from the previous implementation from ray. I will start modifying it once #2764 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too sure I understand the parts about sorting. Some questions:
which means when the user is creating multiple clusters without specifying a VPC name, it is possible that those clusters can locate in different VPCs
This seems okay? E.g., users may change the VPC name setting in ~/.sky/config.yaml across launches, and SkyPilot does faithfully change the VPC to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean If the user does not have ~/.sky/config.yaml
, it is possible that multiple clusters may be located on different VPCs depends on the order of VPC returned by AWS, which might be a bit surprising to the user seeing their cluster spreaded across multiple VPCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the aws.py snippet quoted above will return 0 or 1 unique VPC identified by name / security group, so this scenario won't happen? Shall we talk more offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good point! Since we will always specify the security group name, the subnets got later should always be in the same VPC. In that case, it should be fine.
@@ -805,7 +814,15 @@ def get_usable_vpc(config) -> str: | |||
if len(vpcnets_all) == 1: | |||
# Skip checking any firewall rules if the user has specified a VPC. | |||
logger.info(f"Using user-specified VPC {specific_vpc_to_use!r}.") | |||
return specific_vpc_to_use | |||
subnets = _list_subnets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Doesn't feel very clean calling _list_subnets multiple times.
Is it possible to use the pattern in AWS's config.py? Call _list_subnets() once to get all [(subnet, vpc, ...)]
and directly sort + filter this list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Just changed the implementation to list subnets first, which will save at most #VPC times gcp API call.
Tested:
-
sky launch --cpus 2 --cloud gcp
-
sky launch --gpus L4 --cloud gcp
(default VPC does not have the first region to be tried for L4 GPU, i.e. us-east-4, and it will correctly use skypilot-vpc instead)
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @Michaelvll
sky/skylet/providers/gcp/config.py
Outdated
Raises: | ||
RuntimeError: if the user has specified a VPC name but the VPC is not found. | ||
""" | ||
_, _, compute, _ = construct_clients_from_provider_config(config["provider"]) | ||
# For backward compatibility, reuse the VPC if the VM is launched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Q: it seems like sky/provision/aws/config.py doesn't have such back-compat code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I just tried to create a new AWS VPC and after a cluster is provisioned, and run the sky launch
again on that cluster. It seems config.py
choose a different VPC, but the launching of that cluster still succeeded and the VPC of that cluster is still the old one. Probably, AWS's API ignores the different VPC for an existing VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Maybe worth a test/comment here on whether GCP has the same property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Actually, just tried the following and it seems GCP will also ignore the wrong vpc for existing cluster.
sky launch -c test --cloud gcp
the cluster is launched onskypilot-vpc
- create another VPC which will be used by SkyPilot by default to create VMs
sky launch -c test
: the function chooses the new VPC, but the cluster is still on the originalskypilot-vpc
and no error will occur.
However, there is a corner case: when the cluster with multiple nodes is partially launched, and if we do sky launch
again on that cluster, the different nodes can be launched on different VPC. This issue can also happen to AWS multiple node cluster.
It feels to me that we don't need to sacrifise the performance for all launch to handle this corner case, so I removed the existing VM check. Wdyt? @concretevitamin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good.
Q: before step 3, is the single-node cluster stopped, or is it live? Wondering if the behavior applies to relaunching(starting) a stopped VM as wel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both stopped and UP clusters, and it seems the steps above work correctly for both.
sky/skylet/providers/gcp/config.py
Outdated
Raises: | ||
RuntimeError: if the user has specified a VPC name but the VPC is not found. | ||
""" | ||
_, _, compute, _ = construct_clients_from_provider_config(config["provider"]) | ||
# For backward compatibility, reuse the VPC if the VM is launched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good.
Q: before step 3, is the single-node cluster stopped, or is it live? Wondering if the behavior applies to relaunching(starting) a stopped VM as wel.
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
A user met an issue when SkyPilot selects a VPC without subnet enabled for the region required (us-central2 for TPU-v4). This PR enforce the provisioner to select the VPC that has the subnet enabled.
This PR also fixes a corner case, when the user without TPU-v4 quota is using
n1-highmem-8
and the failover goes tous-central2
. This will fail before, as the user will not have subnet inus-central2
Tested (run the relevant ones):
bash format.sh
sky tpunode --tpus tpu-v4-8 --tpu-vm
on project without us-central2 enabled. It should fail on the region and normally trying to failover to other region.sky launch -c test --cloud gcp
on a project with 2 VPCs, one only with us-central1 subnet enabled, another with all subnets.sky launch
an existing cluster.pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh