-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Launching existing cluster in the same zone to avoid leakage #1700
Conversation
cdbcfba
to
8c78b4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - some comments, read all except cloud_vm_ray_backend.py and backend_utils.py.
@@ -182,27 +183,24 @@ def regions_with_offering(cls, instance_type: Optional[str], | |||
|
|||
if region is not None: | |||
regions = [r for r in regions if r.name == region] | |||
if zone is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @WoosukKwon to double check - should be ok?
@@ -31,7 +31,7 @@ class CloudImplementationFeatures(enum.Enum): | |||
class Region(collections.namedtuple('Region', ['name'])): | |||
"""A region.""" | |||
name: str | |||
zones: List['Zone'] = [] | |||
zones: Optional[List['Zone']] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought for the future: in a few cloud impl's, their regions() method explicitly hard code the regions. We could change those impls to reading from their respective catalogs to be more consistent. (Rationale is was confused for a bit where we define a region's zones. We do it both in that method & in catalog reader methods, IIUC.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are actually using the region/zones from the catalog and those hardcoded regions are just fallbacks. It should be fine to remove them. We can do it in another PR. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll, some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - some final comments.
sky/backends/cloud_vm_ray_backend.py
Outdated
if len(zones) > 1: | ||
# Multiple zones means the upper-level retry loop is trying the | ||
# whole region, so we should print the region name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling
regions_with_zones = clouds.AWS.regions_with_offering(
launchable_resources.instance_type,
launchable_resources.accelerators,
launchable_resources.use_spot,
region.name,
zone=None)
inside the failover handler is surprising, if it's just for filling in the zones. Per my previous comment, "It seems like in master branch, we do set region.zones?" If so, can we do set(zones) == set(region.zones)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @Michaelvll, LGTM.
sky/backends/cloud_vm_ray_backend.py
Outdated
if len(zones) > 1: | ||
# Multiple zones means the upper-level retry loop is trying the | ||
# whole region, so we should print the region name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Maybe add a comment like "# Fill in the zones field in the Region."
…nto fix-start-leakage
…kypilot-org#1700) * launching existing cluster in the same zone * fix * Fix restore * pass cluster status through * fix region.zones * fix * fix * Address comments * format * Fix zones for AWS * fix failover issue * format * fix prev_cluster_status * address comments * minor * partially address the comments * add comments
…kypilot-org#1700) * launching existing cluster in the same zone * fix * Fix restore * pass cluster status through * fix region.zones * fix * fix * Address comments * format * Fix zones for AWS * fix failover issue * format * fix prev_cluster_status * address comments * minor * partially address the comments * add comments
use1-az1
,use-az2
,use-az6
, but our previous catalog fetcher assumed that all the zones contain the instance type).A potential leakage that may not be solved by this PR:
sky launch -c new-cluster --cloud aws
and the actual node was successfully launched onus-east-1a
.sky launch -c new-cluster
again, the availability zoneus-east-1a
now has no capacity for the instance, butus-east-1b
does.us-east-1a, us-east-1b,...
in the sectionprovider.availability_zone
section, it is possible that ray will keep the original node stopped and create a new one onus-east-1a
. (need to confirm).Potential solution:
Tested (run the relevant ones):
sky launch -c test-on-demand --gpus A100:8 --cloud aws
only try zones that contain the instance typep4d.24xlarge
(For fix 3 above) and try to launch again during the INIT statussky launch -c test-spot --gpus A100:8 --use-spot --cloud aws
only try zones that contain the instance typep4d.24xlarge
(For fix 3 above)sky launch -c test-spot --cloud aws --use-spot
sky launch -c test-spot --cloud aws --use-spot
INIT clustersky launch -c test-on-demand --cloud aws
sky launch -c test-on-demand --num-nodes 2 --cloud aws
: failover through single zones.sky launch -c test-on-demand --cloud aws
INIT clustersky launch -c test-on-demand --cloud aws
STOPPED clustersky launch -c test-spot --cloud gcp --use-spot
sky launch -c test-spot --cloud gcp --use-spot
INIT clustersky launch -c test-on-demand --num-nodes 2 --cloud gcp
: failover through single zones.sky launch -c test-spot --cloud azure --use-spot
sky launch -c test-spot --cloud azure --use-spot
INIT clustersky launch -c test-spot --cloud azure --use-spot
STOPPED clusterpytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh
with GCPbash tests/backward_comaptibility_tests.sh
with AWSbash tests/backward_comaptibility_tests.sh
with Azure