Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Failover Optimization from Quota #1953

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
2d1bf15
initial commit
Apr 18, 2023
d674726
more quick changes
Apr 19, 2023
f91a3f1
try push
Apr 21, 2023
164b490
remove test lines
Apr 21, 2023
a647263
quick file add
Apr 21, 2023
8053c9f
things finally working
Apr 21, 2023
c29da94
Merge branch 'master' into aws-quota-optimization
shethhriday29 Apr 21, 2023
63fd297
refactoring
Apr 26, 2023
bc77487
quick string fix
Apr 26, 2023
b858a94
removed test cases
May 12, 2023
6bc102a
Merge branch 'master' of github.com:shethhriday29/skypilot into aws-q…
Jun 17, 2023
959dae4
Merge branch 'master' of github.com:shethhriday29/skypilot into aws-q…
Jun 17, 2023
fe2c0ff
formatting changes, comments, and docstring
Jun 17, 2023
45dd0ff
added test for aws zero quota failover
Jun 19, 2023
f472bf4
Update sky/clouds/cloud.py
shethhriday29 Jun 19, 2023
8d04c00
merge with master
Jun 19, 2023
4e843b8
edits based on romil review
Jun 19, 2023
2ffa694
ran formatter
Jun 19, 2023
669af4c
more romil comments
Jun 20, 2023
f6b8b69
formatting
Jun 20, 2023
1ee1d89
zhanghao edits
Jun 22, 2023
91f22f8
Update tests/test_smoke.py
shethhriday29 Jun 22, 2023
1b8bcc2
Update sky/backends/cloud_vm_ray_backend.py
shethhriday29 Jun 22, 2023
8aa8722
formatting
Jun 22, 2023
93d5f1d
Update tests/test_smoke.py
shethhriday29 Jun 23, 2023
4ae2214
Update sky/clouds/aws.py
shethhriday29 Jun 23, 2023
8943ef1
Update sky/clouds/aws.py
shethhriday29 Jun 23, 2023
e2db07b
took out imports
Jun 23, 2023
268255e
format
Jun 23, 2023
b8b371c
better logging
Jun 24, 2023
f471738
Merge branch 'master' of github.com:shethhriday29/skypilot into aws-q…
Jun 24, 2023
ac54096
formatting
Jun 24, 2023
b5706eb
Merge branch 'master' of github.com:shethhriday29/skypilot into aws-q…
shethhriday29 Jun 26, 2023
0b0a76d
zongheng comments
shethhriday29 Jun 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -1335,23 +1335,24 @@ def _retry_zones(
# the instance type in the target region. If not, fail early
# instead of trying to provision and failing later.
try:
shethhriday29 marked this conversation as resolved.
Show resolved Hide resolved
has_non_zero_quota = to_provision.cloud.check_quota_not_zero(
need_provision = to_provision.cloud.check_quota_available(
to_provision.region, to_provision.instance_type,
to_provision.use_spot)

except Exception as e: # pylint: disable=broad-except
has_non_zero_quota = True
logger.info(
f'Error occurred when trying to check quota. '
f'Proceeding assuming quotas are available. Error: {str(e)}')
need_provision = True
logger.info(f'Error occurred when trying to check quota. '
f'Proceeding assuming quotas are available. Error: '
f'{common_utils.format_exception(e, use_bracket=True)}')

if not has_non_zero_quota:
if not need_provision:
# if quota is found to be zero, raise exception and skip to
# the next region
raise exceptions.ResourcesUnavailableError(
f'Found no quota for {to_provision.instance_type} in region '
f'{to_provision.region}. To request quotas in this region, '
f'visit http://aws.amazon.com/contact-us/ec2-request.')
f'{to_provision.region}. To request quotas, check the instruction: ' # pylint: disable=line-too-long
f'https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html.' # pylint: disable=line-too-long
)

for zones in self._yield_zones(to_provision, num_nodes, cluster_name,
prev_cluster_status):
Expand Down
30 changes: 10 additions & 20 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
import typing
from typing import Dict, Iterator, List, Optional, Tuple, Any

import boto3
import botocore
shethhriday29 marked this conversation as resolved.
Show resolved Hide resolved

from sky import clouds
from sky import exceptions
from sky import sky_logging
Expand Down Expand Up @@ -416,13 +419,6 @@ def _make(instance_list):
@functools.lru_cache(maxsize=1) # Cache since getting identity is slow.
def check_credentials(cls) -> Tuple[bool, Optional[str]]:
"""Checks if the user has access credentials to this cloud."""
try:
# pylint: disable=import-outside-toplevel,unused-import
import boto3
import botocore
except ImportError:
raise ImportError('Fail to import dependencies for AWS.'
'Try pip install "skypilot[aws]"') from None

# Checks if the AWS CLI is installed properly
proc = subprocess.run('aws --version',
Expand Down Expand Up @@ -696,12 +692,14 @@ def _get_disk_specs(cls, disk_tier: Optional[str]) -> Dict[str, Any]:
}

@classmethod
def check_quota_not_zero(cls,
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
def check_quota_available(cls,
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: Having a single line description of this function in the first line.

Suggested change
"""
"""Check if the quota is available for the requested instance_type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google styleguide: First line should be on the same line as the triple quotes. https://google.github.io/styleguide/pyguide.html#383-functions-and-methods

AWS-specific implmentation of check_quota_not_zero. The function works by
Check if the quota is available for the requested instance_type

AWS-specific implmentation of check_quota_available. The function works by
matching the instance_type to the corresponding AWS quota code, and then using
the boto3 Python API to query the region for the specific quota code.

Expand All @@ -712,14 +710,6 @@ def check_quota_not_zero(cls,
botocore.exceptions.ClientError: error in Boto3 client request.
"""

try:
# pylint: disable=import-outside-toplevel
import boto3
import botocore
except ImportError:
raise ImportError('Fail to import dependencies for AWS.'
'Try pip install "skypilot[aws]"') from None

from sky.clouds.service_catalog import aws_catalog # pylint: disable=import-outside-toplevel,unused-import
shethhriday29 marked this conversation as resolved.
Show resolved Hide resolved

quota_code = aws_catalog.get_quota_code(instance_type, use_spot)
Expand Down
12 changes: 6 additions & 6 deletions sky/clouds/cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,15 +470,15 @@ def check_disk_tier_enabled(cls, instance_type: str,

@classmethod
# pylint: disable=unused-argument
def check_quota_not_zero(cls,
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
def check_quota_available(cls,
region: str,
instance_type: str,
use_spot: bool = False) -> bool:
"""
Checks to ensure that a particular accelerator has a nonzero quota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (styleguide)

in a region.

(Currently, check_quota_not_zero is only implemented for AWS.)
(Currently, check_quota_available is only implemented for AWS.)

The _retry_zones funtion in cloud_vm_ray_backend goes through different
candidate regions and attempts to provision the requested instance_type
Expand Down Expand Up @@ -509,7 +509,7 @@ def check_quota_not_zero(cls,
have an adequate quota.

Design choice: We chose a just-in-time approach where
check_quota_not_zero is called immediately before a potential
check_quota_available is called immediately before a potential
attempted provision, rather than checking all region quotas
beforehand, storing them, and using those values on-demand. This is
because, for example, _retry_zones may only need to go through one or
Expand Down
4 changes: 2 additions & 2 deletions tests/test_smoke.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ def get_aws_region_for_quota_failover() -> Optional[str]:
zone=None)

for region in candidate_regions:
if not AWS.check_quota_not_zero(
if not AWS.check_quota_available(
region=region.name, instance_type="p3.16xlarge", use_spot=True):
shethhriday29 marked this conversation as resolved.
Show resolved Hide resolved
return region.name

Expand Down Expand Up @@ -2227,7 +2227,7 @@ def test_aws_zero_quota_failover():

test = Test('aws-zero-quota-failover', [
f'sky launch -y -c {name} --cloud aws --region {region} --gpus V100:8 --use-spot | grep "Found no quota"',
])
], f'sky down -y {name}')
shethhriday29 marked this conversation as resolved.
Show resolved Hide resolved
run_one_test(test)


Expand Down