Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zone support in YAML #1014

Merged
merged 41 commits into from
Aug 9, 2022
Merged

Add zone support in YAML #1014

merged 41 commits into from
Aug 9, 2022

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Jul 28, 2022

This PR adds support for specifying zone in YAML which avoids the need to modify catalog manually. This feature can be useful for TPU users who often get quota only in specific zones (such as us-central1-c but not us-central1-a).
Also, some accelerator validation is added when region/zone is specified.
I'm not sure whether CLI argument overwrite (--zone) is useful so I remove it in the latest commit. But we can bring it back if needed.

Usage example:

resources:
  cloud: gcp
  zone: europe-west4-a
  accelerators: tpu-v3-8
  accelerator_args:
    runtime_version: tpu-vm-base
    tpu_vm: True

Hint example:

(sky) weichiang@mbp sky % sky launch examples/tpu/tpuvm_mnist.yaml
Task from YAML spec: examples/tpu/tpuvm_mnist.yaml
ValueError: Accelerator "tpu-v2-8" is not available in "us-central1-a" region/zone.

TODO:

  • add test function
  • smoke test passed
  • show "did you mean" hint

@concretevitamin
Copy link
Member

Nice @infwinston. Some high-level UX comments.

resources:
  cloud: gcp
  region: us-west1-a

yields ValueError: Invalid region 'us-west1-a' for cloud GCP.. This is probably a common user "mistake" - can we additionally print, using difflib, something like "Do you mean region X"? Or perhaps difflib is not the right solution here and we should do some pattern matching and print "Do you mean zone 'us-west1-a' rather than region"?

Similar comment for

  zone: us-west1

which yields ValueError: Invalid zone 'us-west1' for cloud GCP..


It'd be great to add some unit test as well. I tried

resources:
  cloud: gcp
  zone: us-west1-a
  region: us-west2

manually and it correctly caught this mismatch: ValueError: Zone 'us-west1-a' is not in region 'us-west2'..

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the support for zone @infwinston ! It would be very useful for the TPU users. Left several comments.

Another thing that should be fixed is that when user trying to launch or exec with zone specified on an existing cluster, we need to fail the program, similar as we did in the following:

if (task_resources.region is not None and
task_resources.region != launched_resources.region):
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
'Task requested resources in region '
f'{task_resources.region!r}, but the existing cluster '
f'is in region {launched_resources.region!r}.')

We fill the region field of the launched_resources, when creating the handle. zone should be handled carefully:

  1. Single node cluster: we need to fill the zone field of launched_resources.
  2. Multi-node cluster: fill the zone field, only when all the instances are in the same zone.
  3. Check fitness: task.resources.zone should be less demanding than launched_resources.zone, i.e. task.resources.zone is None -> True; task_resources.zone is not None and launched_resources.zone is None -> False; task.resources.zone == launched.resources.zone.

sky/resources.py Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
if self._cloud is None:
with ux_utils.print_exception_no_traceback():
raise ValueError('Cloud must be specified together with zone.')
elif self._cloud.is_same_cloud(sky.Azure()):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to use clouds.Azure to be consistent with other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we also use sky.Azure() in a bunch of places. like

elif cloud.is_same_cloud(sky.Azure()):

Should we also replace them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can replace them. : )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, we need to update the zone information in the handle for a cluster launched before this PR.

def _update_cluster_region(self):
if self.launched_resources.region is not None:
return
config = common_utils.read_yaml(self.cluster_yaml)
provider = config['provider']
cloud = self.launched_resources.cloud
if cloud.is_same_cloud(sky.Azure()):
region = provider['location']
elif cloud.is_same_cloud(sky.GCP()) or cloud.is_same_cloud(
sky.AWS()):
region = provider['region']
elif cloud.is_same_cloud(sky.Local()):
# There is only 1 region for Local cluster, 'Local'.
local_regions = clouds.Local.regions()
region = local_regions[0].name
self.launched_resources = self.launched_resources.copy(
region=region)

Comment on lines 702 to 706
if zones != prev_resources.zones:
raise ValueError(f'Zones mismatch. The zones in '
f'{handle.cluster_yaml} '
'have been changed from '
f'{prev_resources.zones} to {zones}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #1014 (review), we need to add the zone information to the handle.launched_resources when creating the handle during provision, i.e. the following line, but the new changes seem do not set the zones in the handle.launched_resources.

launched_resources=to_provision.copy(region=region.name),

Also, prev_resources.zones seems undefined, is it zone instead?

Also, the comparison is a bit tricky here, as zones can be a list of zone, but prev_resources.zones can be just a single zone. Consider something similar as the following:

if prev_resources.zone in zones:
    zones = [prev_resources.zone]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. It's fixed now. But I wonder the reason behind checking between Ray YAML zone field and the zone in the handle. Is this because we want to avoid user accidentally modifying the Ray YAML file?

sky/resources.py Show resolved Hide resolved
sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
@infwinston
Copy link
Member Author

infwinston commented Aug 1, 2022

@Michaelvll Sorry for the confusion. I should have left comments on why I haven't included zone info to the handle and also why I named the column REQUESTED_ZONE.

This is actually a bit tricky and may need some discussion.
Right now we are lacking a mechanism to identify the final zone of a launched cluster. On AWS, we call ray up with a list of zones set in availability_zone in Ray YAML. But we don't actually check the actual zone launched info afterwards.

Hence if we store such list us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f into our handle. It only means the zones requested by users. If for example users change their mind to specify only us-east-1b in YAML. We can't rely on this field to check whether we should raise mismatched error or reuse the cluster.

There are probably two ways:

  1. We add codes to query the actual zone of cluster after or during ray up to reflect the info. We can then handle the above case.
  2. We treat this field as a requested zone by users. If it's None then means no restriction and we raise error if user try to sky launch --zone us-east-1b on a cluster that's previously launched with no restriction.

Pros and cons may need to be further discussed. We can also do a zoom chat on this :)

@infwinston
Copy link
Member Author

@concretevitamin I've added the hint support! Now inputing incorrect region or zone will be given a list of candidates if close enough. Let me know if this looks good :)

(sky) weichiang@mbp sky % sky launch examples/minimal.yaml --region us-west1
Task from YAML spec: examples/minimal.yaml
ValueError: Invalid region 'us-west1' for cloud AWS.
Did you mean one of these: us-west-1?

(sky) weichiang@mbp sky % sky launch examples/minimal.yaml --region us-west-1-a
Task from YAML spec: examples/minimal.yaml
ValueError: Invalid region 'us-west-1-a' for cloud AWS.
Did you mean one of these: us-west-1?

(sky) weichiang@mbp sky % sky launch examples/zone_test.yaml
Task from YAML spec: examples/zone_test.yaml
ValueError: Invalid zone 'us-central1' for cloud GCP.
Did you mean one of these: 'us-central1-a, us-central1-b, us-central1-c, us-central1-f'?

Also, smoke test has passed and the PR should be ready for review. Please take a look if you have time :)

@Michaelvll Michaelvll self-requested a review August 5, 2022 17:05
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @infwinston! The logic looks good to me. Left several comments, mostly about the cleanness of the code. ; )

examples/containerized_app.py Outdated Show resolved Hide resolved
examples/example_app.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
# if zone is not specified because head and worker nodes
# can be launched in different zones.
if (task.num_nodes > 1 and
handle.launched_resources.zone is None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use task.num_nodes == 1 or handle.launched.resources.zone is not None and remove else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the second conditition? If the launched_resources has zone specified, no need to query again, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. for the second comment, right but the query can be a sanity check for ray up. I think it's still useful?

Copy link
Collaborator

@Michaelvll Michaelvll Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanity check may increase the overhead of the sky launch on an existing cluster, but checking it may be useful for catching problems, up to you.

If the zone mismatch, that probably means an error occurs, right? Should we fail directly instead?
One possible situation the error may occur is:

  1. the user has a cluster stopped in us-west-1a
  2. she tries to sky start the cluster, a resources unavailable error occurs on that zone, and ray up starts a new cluster on us-west-1b.

I think, in that case, we should consider terminating the newly launched instance (not sure if the old stopped instance will be terminated by ray up), but preserve the cluster entry in our database so that the user can retry and launch the instance in the previous zone again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just realized I missed this comment.. for 2, I thought sky start would just fail due to resources unavailable error? it shouldn't failover to other zone right or I misunderstand it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had offline discussion with @Michaelvll , the above case can happen also with the current master and should be considered a bug to fix. Will file an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue here. #1054 Will continue tracking this in the future.

sky/clouds/aws.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
tests/test_optimizer_dryruns.py Show resolved Hide resolved
tests/test_optimizer_random_dag.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
tests/test_spot.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @infwinston! It looks good to me. Left some nits.

sky/clouds/aws.py Show resolved Hide resolved
sky/clouds/gcp.py Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Show resolved Hide resolved
tests/test_optimizer_random_dag.py Outdated Show resolved Hide resolved
tests/test_spot.py Outdated Show resolved Hide resolved
@infwinston
Copy link
Member Author

@Michaelvll Thanks a lot for detailed reviews! I've fixed all the comments and re-run/passed the smoke test. Merging this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants