Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Make sure the chosen VPC has subnet enabled for the region required #2772

Merged
merged 26 commits into from
Nov 15, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 11, 2023

A user met an issue when SkyPilot selects a VPC without subnet enabled for the region required (us-central2 for TPU-v4). This PR enforce the provisioner to select the VPC that has the subnet enabled.

This PR also fixes a corner case, when the user without TPU-v4 quota is using n1-highmem-8 and the failover goes to us-central2. This will fail before, as the user will not have subnet in us-central2

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky tpunode --tpus tpu-v4-8 --tpu-vm on project without us-central2 enabled. It should fail on the region and normally trying to failover to other region.
    • sky launch -c test --cloud gcp on a project with 2 VPCs, one only with us-central1 subnet enabled, another with all subnets.
    • sky launch an existing cluster.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll added the blocked PR blocked by other issues label Nov 12, 2023
@Michaelvll
Copy link
Collaborator Author

Blocked by #2764

@Michaelvll Michaelvll mentioned this pull request Nov 12, 2023
6 tasks
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @Michaelvll. Quick comments.

Comment on lines 848 to 849
subnets = _list_subnets(config, compute, filter=f'(name="{usable_vpc_name}")')
if not subnets:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing get_usable_vpc() into get_usable_vpc_and_subnets(), i.e., avoid an extra query?

subnets, vpc_id = _get_subnet_and_vpc_id(
ec2,
security_group_ids,
config.provider_config['region'],

)
continue

if not _list_subnets(config, compute, filter=f'(name="{vpc["name"]}")'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AWS, we get a usable VPC first, then get its usable subnets:

if vpc_name is not None:
vpc_id_of_sg = _get_vpc_id_by_name(ec2, vpc_name, region)
elif security_group_ids:
vpc_id_of_sg = _vpc_id_from_security_group_ids(ec2, security_group_ids)
else:
vpc_id_of_sg = None
all_subnets = list(ec2.subnets.all())
subnets, vpc_id = _usable_subnets(
None,
all_subnets,
azs=availability_zone,
vpc_id_of_sg=vpc_id_of_sg,
use_internal_ips=use_internal_ips,
)
return subnets, vpc_id

Any pros/cons in terms of doing it like that vs. the nested loop here? Maybe the former makes it easier to support things like private IP-enabled subnets in the future ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning this @concretevitamin! We do need to sort the VPCs in this GCP case as well.

While looking at the code for AWS, it seems that we did not sort the subnets by the vpc id, which means when the user is creating multiple clusters without specifying a VPC name, it is possible that those clusters can locate in different VPCs. It might be fine for now as there is no interconnection between clusters for us right now, but it does not seem very clean. Should we sort the subnets by the (availability_zone, vpc_id) pair?

For GCP here, yes, we can do similar thing for getting the VPC. The current implementation was just inherited from the previous implementation from ray. I will start modifying it once #2764 is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure I understand the parts about sorting. Some questions:

which means when the user is creating multiple clusters without specifying a VPC name, it is possible that those clusters can locate in different VPCs

This seems okay? E.g., users may change the VPC name setting in ~/.sky/config.yaml across launches, and SkyPilot does faithfully change the VPC to use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean If the user does not have ~/.sky/config.yaml, it is possible that multiple clusters may be located on different VPCs depends on the order of VPC returned by AWS, which might be a bit surprising to the user seeing their cluster spreaded across multiple VPCs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the aws.py snippet quoted above will return 0 or 1 unique VPC identified by name / security group, so this scenario won't happen? Shall we talk more offline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good point! Since we will always specify the security group name, the subnets got later should always be in the same VPC. In that case, it should be fine.

@Michaelvll Michaelvll removed the blocked PR blocked by other issues label Nov 13, 2023
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/config.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/config.py Outdated Show resolved Hide resolved
@@ -805,7 +814,15 @@ def get_usable_vpc(config) -> str:
if len(vpcnets_all) == 1:
# Skip checking any firewall rules if the user has specified a VPC.
logger.info(f"Using user-specified VPC {specific_vpc_to_use!r}.")
return specific_vpc_to_use
subnets = _list_subnets(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Doesn't feel very clean calling _list_subnets multiple times.

Is it possible to use the pattern in AWS's config.py? Call _list_subnets() once to get all [(subnet, vpc, ...)] and directly sort + filter this list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Just changed the implementation to list subnets first, which will save at most #VPC times gcp API call.

Tested:

  • sky launch --cpus 2 --cloud gcp
  • sky launch --gpus L4 --cloud gcp (default VPC does not have the first region to be tried for L4 GPU, i.e. us-east-4, and it will correctly use skypilot-vpc instead)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises:
RuntimeError: if the user has specified a VPC name but the VPC is not found.
"""
_, _, compute, _ = construct_clients_from_provider_config(config["provider"])
# For backward compatibility, reuse the VPC if the VM is launched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Q: it seems like sky/provision/aws/config.py doesn't have such back-compat code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I just tried to create a new AWS VPC and after a cluster is provisioned, and run the sky launch again on that cluster. It seems config.py choose a different VPC, but the launching of that cluster still succeeded and the VPC of that cluster is still the old one. Probably, AWS's API ignores the different VPC for an existing VM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Maybe worth a test/comment here on whether GCP has the same property.

Copy link
Collaborator Author

@Michaelvll Michaelvll Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Actually, just tried the following and it seems GCP will also ignore the wrong vpc for existing cluster.

  1. sky launch -c test --cloud gcp the cluster is launched on skypilot-vpc
  2. create another VPC which will be used by SkyPilot by default to create VMs
  3. sky launch -c test: the function chooses the new VPC, but the cluster is still on the original skypilot-vpc and no error will occur.

However, there is a corner case: when the cluster with multiple nodes is partially launched, and if we do sky launch again on that cluster, the different nodes can be launched on different VPC. This issue can also happen to AWS multiple node cluster.
It feels to me that we don't need to sacrifise the performance for all launch to handle this corner case, so I removed the existing VM check. Wdyt? @concretevitamin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

Q: before step 3, is the single-node cluster stopped, or is it live? Wondering if the behavior applies to relaunching(starting) a stopped VM as wel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both stopped and UP clusters, and it seems the steps above work correctly for both.

sky/skylet/providers/gcp/config.py Outdated Show resolved Hide resolved
sky/skylet/providers/gcp/config.py Show resolved Hide resolved
sky/provision/aws/config.py Outdated Show resolved Hide resolved
Raises:
RuntimeError: if the user has specified a VPC name but the VPC is not found.
"""
_, _, compute, _ = construct_clients_from_provider_config(config["provider"])
# For backward compatibility, reuse the VPC if the VM is launched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

Q: before step 3, is the single-node cluster stopped, or is it live? Wondering if the behavior applies to relaunching(starting) a stopped VM as wel.

@Michaelvll Michaelvll merged commit e81cbaa into master Nov 15, 2023
19 checks passed
@Michaelvll Michaelvll deleted the vpc-with-subnet branch November 15, 2023 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants