Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Launching existing cluster in the same zone to avoid leakage #1700

Merged
merged 18 commits into from
Feb 22, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 15, 2023

  1. Fixes sky start may unexpectedly launch a new VM #1054
  2. Refactors the cloud vm ray backend to make it more efficient.
  3. Fix the AWS catalog fetching by excluding availability zones that do not contain the instance type: some instance types are not available on all the availability zones within a region even if it is on-demand (e.g., p4d.24xlarge is only supported in use1-az1, use-az2, use-az6, but our previous catalog fetcher assumed that all the zones contain the instance type).

A potential leakage that may not be solved by this PR:

  1. The user Ctrl-C during the launching process of a new cluster sky launch -c new-cluster --cloud aws and the actual node was successfully launched on us-east-1a.
  2. The user manually stopped the node on the AWS console
  3. After that, the user do sky launch -c new-cluster again, the availability zone us-east-1a now has no capacity for the instance, but us-east-1b does.
  4. Since the ray yaml has us-east-1a, us-east-1b,... in the section provider.availability_zone section, it is possible that ray will keep the original node stopped and create a new one on us-east-1a. (need to confirm).

Potential solution:

  1. Fetch zone info directly from cloud provider
  2. Do not combine zones (This only happens for AWS single node) -- let the failover go through the zones one by one

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-on-demand --gpus A100:8 --cloud aws only try zones that contain the instance type p4d.24xlarge (For fix 3 above) and try to launch again during the INIT status
    • sky launch -c test-spot --gpus A100:8 --use-spot --cloud aws only try zones that contain the instance type p4d.24xlarge (For fix 3 above)
    • sky launch -c test-spot --cloud aws --use-spot
    • sky launch -c test-spot --cloud aws --use-spot INIT cluster
    • sky launch -c test-on-demand --cloud aws
    • sky launch -c test-on-demand --num-nodes 2 --cloud aws: failover through single zones.
    • sky launch -c test-on-demand --cloud aws INIT cluster
    • sky launch -c test-on-demand --cloud aws STOPPED cluster
    • sky launch -c test-spot --cloud gcp --use-spot
    • sky launch -c test-spot --cloud gcp --use-spot INIT cluster
    • sky launch -c test-on-demand --num-nodes 2 --cloud gcp: failover through single zones.
    • sky launch -c test-spot --cloud azure --use-spot
    • sky launch -c test-spot --cloud azure --use-spot INIT cluster
    • sky launch -c test-spot --cloud azure --use-spot STOPPED cluster
  • All smoke tests: pytest tests/test_smoke.py
  • All smoke tests: pytest tests/test_smoke.py --aws
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh with GCP
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh with AWS
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh with Azure

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - some comments, read all except cloud_vm_ray_backend.py and backend_utils.py.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
@@ -182,27 +183,24 @@ def regions_with_offering(cls, instance_type: Optional[str],

if region is not None:
regions = [r for r in regions if r.name == region]
if zone is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @WoosukKwon to double check - should be ok?

@@ -31,7 +31,7 @@ class CloudImplementationFeatures(enum.Enum):
class Region(collections.namedtuple('Region', ['name'])):
"""A region."""
name: str
zones: List['Zone'] = []
zones: Optional[List['Zone']] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought for the future: in a few cloud impl's, their regions() method explicitly hard code the regions. We could change those impls to reading from their respective catalogs to be more consistent. (Rationale is was confused for a bit where we define a region's zones. We do it both in that method & in catalog reader methods, IIUC.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actually using the region/zones from the catalog and those hardcoded regions are just fallbacks. It should be fine to remove them. We can do it in another PR. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll marked this pull request as ready for review February 19, 2023 09:11
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, some comments.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/data_fetchers/fetch_aws.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll - some final comments.

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Comment on lines 736 to 738
if len(zones) > 1:
# Multiple zones means the upper-level retry loop is trying the
# whole region, so we should print the region name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling

        regions_with_zones = clouds.AWS.regions_with_offering(
            launchable_resources.instance_type,
            launchable_resources.accelerators,
            launchable_resources.use_spot,
            region.name,
            zone=None)

inside the failover handler is surprising, if it's just for filling in the zones. Per my previous comment, "It seems like in master branch, we do set region.zones?" If so, can we do set(zones) == set(region.zones)?

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @Michaelvll, LGTM.

Comment on lines 736 to 738
if len(zones) > 1:
# Multiple zones means the upper-level retry loop is trying the
# whole region, so we should print the region name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe add a comment like "# Fill in the zones field in the Region."

@Michaelvll Michaelvll merged commit c2a5178 into master Feb 22, 2023
@Michaelvll Michaelvll deleted the fix-start-leakage branch February 22, 2023 06:59
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Feb 22, 2023
…kypilot-org#1700)

* launching existing cluster in the same zone

* fix

* Fix restore

* pass cluster status through

* fix region.zones

* fix

* fix

* Address comments

* format

* Fix zones for AWS

* fix failover issue

* format

* fix prev_cluster_status

* address comments

* minor

* partially address the comments

* add comments
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
…kypilot-org#1700)

* launching existing cluster in the same zone

* fix

* Fix restore

* pass cluster status through

* fix region.zones

* fix

* fix

* Address comments

* format

* Fix zones for AWS

* fix failover issue

* format

* fix prev_cluster_status

* address comments

* minor

* partially address the comments

* add comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sky start may unexpectedly launch a new VM
2 participants