-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zone support in YAML #1014
Add zone support in YAML #1014
Conversation
Nice @infwinston. Some high-level UX comments.
yields Similar comment for
which yields It'd be great to add some unit test as well. I tried
manually and it correctly caught this mismatch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the support for zone @infwinston ! It would be very useful for the TPU users. Left several comments.
Another thing that should be fixed is that when user trying to launch or exec with zone specified on an existing cluster, we need to fail the program, similar as we did in the following:
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1473 to 1479 in 9724b8f
if (task_resources.region is not None and | |
task_resources.region != launched_resources.region): | |
with ux_utils.print_exception_no_traceback(): | |
raise exceptions.ResourcesMismatchError( | |
'Task requested resources in region ' | |
f'{task_resources.region!r}, but the existing cluster ' | |
f'is in region {launched_resources.region!r}.') |
We fill the
region
field of the launched_resources
, when creating the handle. zone
should be handled carefully:
- Single node cluster: we need to fill the
zone
field oflaunched_resources
. - Multi-node cluster: fill the
zone
field, only when all the instances are in the same zone. - Check fitness:
task.resources.zone
should be less demanding thanlaunched_resources.zone
, i.e.task.resources.zone
is None -> True;task_resources.zone
is not None andlaunched_resources.zone
is None -> False;task.resources.zone == launched.resources.zone
.
sky/resources.py
Outdated
if self._cloud is None: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError('Cloud must be specified together with zone.') | ||
elif self._cloud.is_same_cloud(sky.Azure()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to use clouds.Azure
to be consistent with other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we also use sky.Azure()
in a bunch of places. like
skypilot/sky/backends/cloud_vm_ray_backend.py
Line 687 in 9724b8f
elif cloud.is_same_cloud(sky.Azure()): |
Should we also replace them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can replace them. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backward compatibility, we need to update the zone information in the handle for a cluster launched before this PR.
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1389 to 1407 in 661264a
def _update_cluster_region(self): | |
if self.launched_resources.region is not None: | |
return | |
config = common_utils.read_yaml(self.cluster_yaml) | |
provider = config['provider'] | |
cloud = self.launched_resources.cloud | |
if cloud.is_same_cloud(sky.Azure()): | |
region = provider['location'] | |
elif cloud.is_same_cloud(sky.GCP()) or cloud.is_same_cloud( | |
sky.AWS()): | |
region = provider['region'] | |
elif cloud.is_same_cloud(sky.Local()): | |
# There is only 1 region for Local cluster, 'Local'. | |
local_regions = clouds.Local.regions() | |
region = local_regions[0].name | |
self.launched_resources = self.launched_resources.copy( | |
region=region) |
sky/backends/cloud_vm_ray_backend.py
Outdated
if zones != prev_resources.zones: | ||
raise ValueError(f'Zones mismatch. The zones in ' | ||
f'{handle.cluster_yaml} ' | ||
'have been changed from ' | ||
f'{prev_resources.zones} to {zones}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to #1014 (review), we need to add the zone information to the handle.launched_resources
when creating the handle during provision
, i.e. the following line, but the new changes seem do not set the zones in the handle.launched_resources
.
skypilot/sky/backends/cloud_vm_ray_backend.py
Line 903 in 661264a
launched_resources=to_provision.copy(region=region.name), |
Also, prev_resources.zones
seems undefined, is it zone
instead?
Also, the comparison is a bit tricky here, as zones
can be a list of zone, but prev_resources.zones
can be just a single zone. Consider something similar as the following:
if prev_resources.zone in zones:
zones = [prev_resources.zone]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. It's fixed now. But I wonder the reason behind checking between Ray YAML zone field and the zone in the handle. Is this because we want to avoid user accidentally modifying the Ray YAML file?
@Michaelvll Sorry for the confusion. I should have left comments on why I haven't included zone info to the handle and also why I named the column This is actually a bit tricky and may need some discussion. Hence if we store such list There are probably two ways:
Pros and cons may need to be further discussed. We can also do a zoom chat on this :) |
@concretevitamin I've added the hint support! Now inputing incorrect region or zone will be given a list of candidates if close enough. Let me know if this looks good :)
Also, smoke test has passed and the PR should be ready for review. Please take a look if you have time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix @infwinston! The logic looks good to me. Left several comments, mostly about the cleanness of the code. ; )
sky/backends/cloud_vm_ray_backend.py
Outdated
# if zone is not specified because head and worker nodes | ||
# can be launched in different zones. | ||
if (task.num_nodes > 1 and | ||
handle.launched_resources.zone is None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use task.num_nodes == 1 or handle.launched.resources.zone is not None
and remove else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the second conditition? If the launched_resources
has zone
specified, no need to query again, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. for the second comment, right but the query can be a sanity check for ray up
. I think it's still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sanity check may increase the overhead of the sky launch
on an existing cluster, but checking it may be useful for catching problems, up to you.
If the zone mismatch, that probably means an error occurs, right? Should we fail directly instead?
One possible situation the error may occur is:
- the user has a cluster stopped in
us-west-1a
- she tries to
sky start
the cluster, a resources unavailable error occurs on that zone, andray up
starts a new cluster onus-west-1b
.
I think, in that case, we should consider terminating the newly launched instance (not sure if the old stopped instance will be terminated by ray up), but preserve the cluster entry in our database so that the user can retry and launch the instance in the previous zone again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh just realized I missed this comment.. for 2, I thought sky start
would just fail due to resources unavailable error? it shouldn't failover to other zone right or I misunderstand it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had offline discussion with @Michaelvll , the above case can happen also with the current master and should be considered a bug to fix. Will file an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue here. #1054 Will continue tracking this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix @infwinston! It looks good to me. Left some nits.
@Michaelvll Thanks a lot for detailed reviews! I've fixed all the comments and re-run/passed the smoke test. Merging this PR now. |
This PR adds support for specifying zone in YAML which avoids the need to modify catalog manually. This feature can be useful for TPU users who often get quota only in specific zones (such as
us-central1-c
but notus-central1-a
).Also, some accelerator validation is added when region/zone is specified.
I'm not sure whether CLI argument overwrite (
--zone
) is useful so I remove it in the latest commit. But we can bring it back if needed.Usage example:
Hint example:
TODO: