Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing & Provisioning Retries at the granularity of regions/zones #975

Merged
merged 108 commits into from
Dec 29, 2022
Merged
Show file tree
Hide file tree
Changes from 98 commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
25af691
Region/zone-based optimizer & provisioner
WoosukKwon Jul 15, 2022
7071bb3
yapf
WoosukKwon Jul 15, 2022
e47a934
Fix messages
WoosukKwon Jul 15, 2022
926a781
Fix comments
WoosukKwon Jul 15, 2022
05029aa
Fix a comment
WoosukKwon Jul 15, 2022
6fd509f
Fix version for backward compatibility
WoosukKwon Jul 15, 2022
7f32764
yapf
WoosukKwon Jul 15, 2022
09c2572
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Aug 14, 2022
784bf92
Minor fix
WoosukKwon Aug 14, 2022
f05350b
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Aug 14, 2022
fa5ba13
yapf
WoosukKwon Aug 14, 2022
4b11e37
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Aug 31, 2022
89181cd
yapf
WoosukKwon Aug 31, 2022
8a71fa5
Add blanks back to docstrings
WoosukKwon Aug 31, 2022
9d4839a
Add more detailed docstrings for get_hourly_cost
WoosukKwon Aug 31, 2022
7ff8976
Remove docstrings
WoosukKwon Aug 31, 2022
2ebe120
Address review & Allow region == None
WoosukKwon Sep 1, 2022
55ecfcf
Minor fix
WoosukKwon Sep 1, 2022
c14c175
Add type annotations
WoosukKwon Sep 1, 2022
c77ccbf
Minor fix
WoosukKwon Sep 1, 2022
bbd9a9b
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Sep 1, 2022
5150275
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Sep 14, 2022
ee90945
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Oct 10, 2022
2ae3043
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Nov 21, 2022
82cf40c
Fix TPU error
WoosukKwon Nov 22, 2022
7caccf0
Fix optimizer printer format
WoosukKwon Nov 22, 2022
74c04f5
yapf
WoosukKwon Nov 22, 2022
c027334
Remove resource validations
WoosukKwon Nov 23, 2022
6908d51
yapf
WoosukKwon Nov 23, 2022
d7bf962
Fix a bug in gcp region zones
WoosukKwon Nov 23, 2022
531215a
Indentation
WoosukKwon Nov 27, 2022
082c0c1
Minor refactoring
WoosukKwon Nov 27, 2022
0f1ca3f
Fix comment
WoosukKwon Nov 27, 2022
ead9f0d
Remove redundant check
WoosukKwon Nov 27, 2022
f645f4e
Add more detailed comments
WoosukKwon Nov 27, 2022
aa894cd
Move the comments
WoosukKwon Nov 27, 2022
3549c8a
Add a comment on batching
WoosukKwon Nov 27, 2022
ff83990
Fix comment
WoosukKwon Nov 27, 2022
fb59095
resources -> launchable_resources
WoosukKwon Nov 27, 2022
fda9a04
Fix comment
WoosukKwon Nov 27, 2022
cadd67d
Fix method name
WoosukKwon Nov 27, 2022
1d63eea
_generate_launchables_with_region_zones -> _make_launchables_for_vali…
WoosukKwon Nov 27, 2022
9456793
Add comment on knowledge leakage
WoosukKwon Nov 27, 2022
63a3b3e
Add a comment
WoosukKwon Nov 27, 2022
bb5854c
Move region-zone filtering from optimizer to resources
WoosukKwon Nov 27, 2022
1eeecee
yapf
WoosukKwon Nov 27, 2022
2552f64
Minor
WoosukKwon Nov 28, 2022
c9fe1c8
Minor
WoosukKwon Nov 28, 2022
9e3f465
Add pointer
WoosukKwon Nov 28, 2022
3d34df2
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Nov 28, 2022
15ee85d
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 7, 2022
844a5c9
Address comments
WoosukKwon Dec 7, 2022
bf6b357
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 9, 2022
6ecef54
fix indentation
WoosukKwon Dec 9, 2022
d1a8abf
Fix typo
WoosukKwon Dec 9, 2022
92ffb01
Rename method
WoosukKwon Dec 9, 2022
f0723fd
Fix comment
WoosukKwon Dec 9, 2022
a685c62
Add detailed comments
WoosukKwon Dec 9, 2022
3efa68b
Minor
WoosukKwon Dec 9, 2022
6b53f17
Minor fix
WoosukKwon Dec 9, 2022
7ad4e2f
Address comments
WoosukKwon Dec 9, 2022
9665c74
Address comments
WoosukKwon Dec 9, 2022
5b69f81
Fix bug in filtering blocked resources
WoosukKwon Dec 9, 2022
4896de0
Minor
WoosukKwon Dec 10, 2022
8c60481
Fix warning msg
WoosukKwon Dec 10, 2022
be98931
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 10, 2022
556ef04
yapf
WoosukKwon Dec 10, 2022
8f2d9a8
Display region or zone in the optimizer msg
WoosukKwon Dec 10, 2022
c344643
yapf
WoosukKwon Dec 10, 2022
3339529
Add a cooment
WoosukKwon Dec 12, 2022
914ce89
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 14, 2022
4879ce5
Refactor region_zones_provision_loop
WoosukKwon Dec 14, 2022
d95bc69
Fix minor bug
WoosukKwon Dec 14, 2022
4ddee1d
REGION (ZONE) to REGION/ZONE
WoosukKwon Dec 14, 2022
2c79e67
Fix style
WoosukKwon Dec 14, 2022
618866e
Rename
WoosukKwon Dec 14, 2022
cc6153b
Fix docstring
WoosukKwon Dec 14, 2022
aaa3b19
Fix docstring
WoosukKwon Dec 14, 2022
b1232e3
use_spot before region and zone
WoosukKwon Dec 14, 2022
1718875
yapf
WoosukKwon Dec 14, 2022
23ff576
Fix docstring
WoosukKwon Dec 14, 2022
42f5741
Fix minor bug
WoosukKwon Dec 14, 2022
2b84590
Trying other launchables -> locations
WoosukKwon Dec 15, 2022
114c8d0
yapf
WoosukKwon Dec 15, 2022
56ec033
Fix the argument order
WoosukKwon Dec 16, 2022
deb6a1d
Add region and zone args
WoosukKwon Dec 17, 2022
07a7cdb
Fix comment
WoosukKwon Dec 17, 2022
6e611c4
Fix comment
WoosukKwon Dec 17, 2022
88f5932
Ensure that zone is None for Azure
WoosukKwon Dec 17, 2022
ee54ee6
privision loop -> regions_with_offering
WoosukKwon Dec 17, 2022
a33e4e8
Check if instance_type is None
WoosukKwon Dec 17, 2022
7da4d82
Fix a bug
WoosukKwon Dec 17, 2022
056eefd
yapf
WoosukKwon Dec 17, 2022
57e4ca5
Add back the accelerator region check
WoosukKwon Dec 17, 2022
984c8c2
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 17, 2022
d0c5b24
Fix err msg
WoosukKwon Dec 17, 2022
3b572fd
Roll back the change
WoosukKwon Dec 17, 2022
5b8e844
Make sets
WoosukKwon Dec 17, 2022
45997e7
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 22, 2022
8a64db3
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 26, 2022
3be9e4c
Minor
WoosukKwon Dec 26, 2022
9ed5d85
Set[Region] -> List[Region] for deterministic tie-breaking
WoosukKwon Dec 26, 2022
f49694f
Enable the validation check for fractional accelerators
WoosukKwon Dec 26, 2022
3bfdd5d
yapf
WoosukKwon Dec 26, 2022
ea79e31
Fix provisioner msg
WoosukKwon Dec 26, 2022
f1a41cb
Add a comment
WoosukKwon Dec 26, 2022
5398b34
Merge branch 'master' into fine-grained-optimizer
WoosukKwon Dec 29, 2022
e3fa0f2
yapf
WoosukKwon Dec 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 41 additions & 18 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -606,7 +606,12 @@ def _update_blocklist_on_gcp_error(self, region, zones, stdout, stderr):
# This skip is only correct if we implement "first
# retry the region/zone of an existing cluster with the
# same name" correctly.
for r, _ in clouds.GCP.region_zones_provision_loop():
for r in clouds.GCP.regions_with_offering(
instance_type=None,
accelerators=None,
use_spot=False,
region=None,
zone=None):
self._blocked_regions.add(r.name)
else:
# Per region. Ex: Quota 'CPUS' exceeded. Limit: 24.0
Expand Down Expand Up @@ -661,7 +666,6 @@ def _update_blocklist_on_gcp_error(self, region, zones, stdout, stderr):
'check logs above.')

def _update_blocklist_on_aws_error(self, region, zones, stdout, stderr):
del zones # Unused.
style = colorama.Style
stdout_splits = stdout.split('\n')
stderr_splits = stderr.split('\n')
Expand Down Expand Up @@ -692,9 +696,13 @@ def _update_blocklist_on_aws_error(self, region, zones, stdout, stderr):
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Errors occurred during provision; '
'check logs above.')
# The underlying ray autoscaler / boto3 will try all zones of a region
# at once.
logger.warning(f'Got error(s) in all zones of {region.name}:')
if set(zones) == set(region.zones):
# The underlying ray autoscaler / boto3 will try all zones of a
# region at once.
logger.warning(f'Got error(s) in all zones of {region.name}:')
else:
zones_str = ', '.join(z.name for z in zones)
logger.warning(f'Got error(s) in {zones_str}:')
messages = '\n\t'.join(errors)
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
self._blocked_regions.add(region.name)
Expand Down Expand Up @@ -1148,9 +1156,14 @@ def _retry_region_zones(self,
CloudVmRayBackend().teardown_no_lock(handle,
terminate=need_terminate)

message = ('Failed to acquire resources in all regions/zones of '
f'{to_provision.cloud}. '
'Try changing resource requirements or use another cloud.')
if to_provision.zone is None:
message = ('Failed to acquire resources in all zones in '
f'{to_provision.region}. Try changing resource '
'requirements or use another region.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we assume to_provision.region is not None all the time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Good and tough question. One possible exception is the local clusters. I didn't consider that.

Do you see any other possible exceptional case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered whether the optimizer will always fill in the region for the to_provision for the public clouds. If it is true, I think it should be fine to keep the current way (I could not think of an exception).
nit: It would be nice if we could comment it somewhere saying that the optimizer will always fill region for the public clouds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. PTAL.

else:
message = (
f'Failed to acquire resources in {to_provision.zone}. '
'Try changing resource requirements or use another zone.')
# Do not failover to other clouds if the cluster was previously
# UP or STOPPED, since the user can have some data on the cluster.
raise exceptions.ResourcesUnavailableError(
Expand Down Expand Up @@ -1517,10 +1530,14 @@ def provision_with_retries(

logger.warning(e)
provision_failed = True
if to_provision.zone is None:
region_or_zone_str = str(to_provision.region)
else:
region_or_zone_str = str(to_provision.zone)
logger.warning(
f'\n{style.BRIGHT}Provision failed for {num_nodes}x '
f'{to_provision}. Trying other launchable resources '
f'(if any).{style.RESET_ALL}')
f'{to_provision} in {region_or_zone_str}. '
f'Trying other locations (if any).{style.RESET_ALL}')
if not cluster_exists:
# Add failed resources to the blocklist, only when it
# is in fallback mode.
Expand Down Expand Up @@ -1899,14 +1916,20 @@ def _provision(self,
backoff = common_utils.Backoff(_RETRY_UNTIL_UP_INIT_GAP_SECONDS)
attempt_cnt = 1
while True:
# RetryingVmProvisioner will retry within a cloud's regions
# first (if a region is not explicitly requested), then
# optionally retry on all other clouds (if
# backend.register_info() has been called). After this "round"
# of optimization across clouds, provisioning may still have
# not succeeded. This while loop will then kick in if
# retry_until_up is set, which will kick off new "rounds" of
# optimization infinitely.
# For on-demand instances, RetryingVmProvisioner will retry
# within the given region first, then optionally retry on all
# other clouds and regions (if backend.register_info()
# has been called).
# For spot instances, each provisioning request is made for a
# single zone and the provisioner will retry on all other
# clouds, regions, and zones.
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
# See optimizer.py#_make_launchables_for_valid_region_zones()
# for detailed reasons.

# After this "round" of optimization across clouds, provisioning
# may still have not succeeded. This while loop will then kick
# in if retry_until_up is set, which will kick off new "rounds"
# of optimization infinitely.
try:
provisioner = RetryingVmProvisioner(self.log_dir, self._dag,
self._optimize_target,
Expand Down
64 changes: 48 additions & 16 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import os
import subprocess
import typing
from typing import Dict, Iterator, List, Optional, Tuple
from typing import Dict, Iterator, List, Optional, Set, Tuple

from sky import clouds
from sky import exceptions
Expand Down Expand Up @@ -71,6 +71,28 @@ def regions(cls):
]
return cls._regions

@classmethod
def regions_with_offering(cls, instance_type: Optional[str],
accelerators: Optional[Dict[str, int]],
use_spot: bool, region: Optional[str],
zone: Optional[str]) -> Set[clouds.Region]:
del accelerators # unused
if instance_type is None:
# Fall back to default regions
regions = cls.regions()
else:
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'aws')

if region is not None:
regions = [r for r in regions if r.name == region]
if zone is not None:
for r in regions:
r.set_zones([z for z in r.zones if z.name == zone])
regions = [r for r in regions if r.zones]

return set(regions)

@classmethod
def region_zones_provision_loop(
cls,
Expand All @@ -81,14 +103,11 @@ def region_zones_provision_loop(
) -> Iterator[Tuple[clouds.Region, List[clouds.Zone]]]:
# AWS provisioner can handle batched requests, so yield all zones under
# each region.
del accelerators # unused

if instance_type is None:
# fallback to manually specified region/zones
regions = cls.regions()
else:
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'aws')
regions = cls.regions_with_offering(instance_type,
accelerators,
use_spot,
region=None,
zone=None)
for region in regions:
yield region, region.zones

Expand Down Expand Up @@ -148,14 +167,23 @@ def get_zone_shell_cmd(cls) -> Optional[str]:

#### Normal methods ####

def instance_type_to_hourly_cost(self, instance_type: str, use_spot: bool):
def instance_type_to_hourly_cost(self,
instance_type: str,
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
return service_catalog.get_hourly_cost(instance_type,
region=None,
use_spot=use_spot,
region=region,
zone=zone,
clouds='aws')

def accelerators_to_hourly_cost(self, accelerators,
use_spot: bool) -> float:
def accelerators_to_hourly_cost(self,
accelerators: Dict[str, int],
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
del accelerators, use_spot, region, zone # unused
# AWS includes accelerators as part of the instance type. Implementing
# this is also necessary for e.g., the instance may have 4 GPUs, while
# the task specifies to use 1 GPU.
Expand Down Expand Up @@ -277,9 +305,13 @@ def _make(instance_list):
assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list
) = service_catalog.get_instance_type_for_accelerator(acc,
acc_count,
clouds='aws')
) = service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
region=resources.region,
zone=resources.zone,
clouds='aws')
if instance_list is None:
return ([], fuzzy_candidate_list)
return (_make(instance_list), fuzzy_candidate_list)
Expand Down
63 changes: 48 additions & 15 deletions sky/clouds/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import os
import subprocess
import typing
from typing import Dict, Iterator, List, Optional, Tuple
from typing import Dict, Iterator, List, Optional, Set, Tuple

from sky import clouds
from sky.adaptors import azure
Expand Down Expand Up @@ -38,13 +38,23 @@ class Azure(clouds.Cloud):
_REPR = 'Azure'
_regions: List[clouds.Region] = []

def instance_type_to_hourly_cost(self, instance_type, use_spot):
def instance_type_to_hourly_cost(self,
instance_type: str,
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
return service_catalog.get_hourly_cost(instance_type,
region=None,
use_spot=use_spot,
region=region,
zone=zone,
clouds='azure')

def accelerators_to_hourly_cost(self, accelerators, use_spot):
def accelerators_to_hourly_cost(self,
accelerators: Dict[str, int],
use_spot: bool,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
del accelerators, use_spot, region, zone # unused
# Azure includes accelerators as part of the instance type.
# Implementing this is also necessary for e.g., the instance may have 4
# GPUs, while the task specifies to use 1 GPU.
Expand Down Expand Up @@ -133,6 +143,28 @@ def regions(cls) -> List[clouds.Region]:
]
return cls._regions

@classmethod
def regions_with_offering(cls, instance_type: Optional[str],
accelerators: Optional[Dict[str, int]],
use_spot: bool, region: Optional[str],
zone: Optional[str]) -> Set[clouds.Region]:
del accelerators # unused
if instance_type is None:
# Fall back to default regions
regions = cls.regions()
else:
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, 'azure')

if region is not None:
regions = [r for r in regions if r.name == region]
if zone is not None:
for r in regions:
r.set_zones([z for z in r.zones if z.name == zone])
regions = [r for r in regions if r.zones]

return set(regions)

Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
@classmethod
def region_zones_provision_loop(
cls,
Expand All @@ -141,14 +173,11 @@ def region_zones_provision_loop(
accelerators: Optional[Dict[str, int]] = None,
use_spot: bool,
) -> Iterator[Tuple[clouds.Region, List[clouds.Zone]]]:
del accelerators # unused

if instance_type is None:
# fallback to manually specified region/zones
regions = cls.regions()
else:
regions = service_catalog.get_region_zones_for_instance_type(
instance_type, use_spot, clouds='azure')
regions = cls.regions_with_offering(instance_type,
accelerators,
use_spot,
region=None,
zone=None)
for region in regions:
yield region, region.zones

Expand Down Expand Up @@ -244,9 +273,13 @@ def _make(instance_list):
assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list
) = service_catalog.get_instance_type_for_accelerator(acc,
acc_count,
clouds='azure')
) = service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
region=resources.region,
zone=resources.zone,
clouds='azure')
if instance_list is None:
return ([], fuzzy_candidate_list)
return (_make(instance_list), fuzzy_candidate_list)
Expand Down
30 changes: 26 additions & 4 deletions sky/clouds/cloud.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Interfaces: clouds, regions, and zones."""
import collections
import typing
from typing import Dict, Iterator, List, Optional, Tuple
from typing import Dict, Iterator, List, Optional, Set, Tuple

from sky.clouds import service_catalog
from sky.utils import ux_utils
Expand Down Expand Up @@ -61,6 +61,25 @@ class Cloud:
def regions(cls) -> List[Region]:
raise NotImplementedError

@classmethod
def regions_with_offering(cls, instance_type: Optional[str],
accelerators: Optional[Dict[str, int]],
use_spot: bool, region: Optional[str],
zone: Optional[str]) -> Set[Region]:
"""Returns the regions that offer the specified resources.

When region or zone is not None, the returned value will be limited to
the specified region/zone.

Returns:
A set of `Region`s that have the offerings for the specified
resources.
For each `Region` in the set, `region.zones` is the list of `Zone`s
which have the offerings. For the clouds that do not expose `Zone`s,
`region.zones` is an empty list.
"""
raise NotImplementedError

@classmethod
def region_zones_provision_loop(
cls,
Expand Down Expand Up @@ -103,12 +122,15 @@ def get_zone_shell_cmd(cls) -> Optional[str]:

#### Normal methods ####

# TODO: incorporate region/zone into the API.
WoosukKwon marked this conversation as resolved.
Show resolved Hide resolved
def instance_type_to_hourly_cost(self, instance_type, use_spot):
def instance_type_to_hourly_cost(self, instance_type: str, use_spot: bool,
region: Optional[str],
zone: Optional[str]) -> float:
"""Returns the hourly on-demand/spot price for an instance type."""
raise NotImplementedError

def accelerators_to_hourly_cost(self, accelerators, use_spot):
def accelerators_to_hourly_cost(self, accelerators: Dict[str, int],
use_spot: bool, region: Optional[str],
zone: Optional[str]) -> float:
"""Returns the hourly on-demand price for accelerators."""
raise NotImplementedError

Expand Down
Loading