Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Lambda Labs #1557

Merged
merged 47 commits into from
Jan 30, 2023
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
54b4dea
Apply gmittal's lambda lab PR (#1136) on top of commit ad37a47
ewzeng Dec 9, 2022
eefa6c6
Basic working Lambda Labs support
ewzeng Dec 16, 2022
2c4d72a
Add error handling for Lambda Labs API and small lambda-ray.yml bugfix
ewzeng Dec 19, 2022
652064b
Add automatic key generation, improve sky check, and resolve import bug
ewzeng Dec 19, 2022
1679e5f
Improve Lambda Labs launch code and error handling
ewzeng Dec 20, 2022
962a9c6
Remove bootstrap_config, change metadata file design, and resolve
ewzeng Dec 21, 2022
6333ed7
Make autodown work on Lambda Labs
ewzeng Dec 22, 2022
ac0a336
Add basic tests and improve lambda-ray.yml.j2 bugfix
ewzeng Dec 22, 2022
b711534
Add sky cancel test and do not allow Lambda nodes to stop
ewzeng Dec 23, 2022
9e47585
Polish provider code and change local metadata path to avoid clutter
ewzeng Dec 23, 2022
3191847
Update and move catalog out of repo
ewzeng Dec 24, 2022
fc0b771
Clean up code
ewzeng Dec 24, 2022
a9a8df7
Cleanup and add CLI logs test
ewzeng Dec 26, 2022
8fda3a7
Disallow --num-nodes > 1 and rename some variables
ewzeng Dec 27, 2022
61f3ccd
Do not let optimizer consider Lambda Labs when launching spot
ewzeng Jan 3, 2023
e7a3cb7
Merge branch 'master' into lambda-labs-v3
ewzeng Jan 4, 2023
9e95dd3
Fix issues arising from merge
ewzeng Jan 4, 2023
245a0c5
Address Michaelvll comments
ewzeng Jan 6, 2023
786fcb6
Address infwinston comments
ewzeng Jan 6, 2023
5344a48
Update Lambda Labs help string
ewzeng Jan 7, 2023
59b93a9
Move Lambda Lab tests into smoke tests and change local tag file
ewzeng Jan 8, 2023
aa37002
Improve remote node detection
ewzeng Jan 11, 2023
ddca707
Change tag file scheme
ewzeng Jan 17, 2023
3af5de7
Add comments and change region_zone lookup
ewzeng Jan 17, 2023
a0ec422
Use same tag file path for local and remote
ewzeng Jan 17, 2023
08869d2
Merge branch 'master' into lambda-labs-v3
ewzeng Jan 18, 2023
fcb6466
Remove is_remote file
ewzeng Jan 18, 2023
750b1a5
Clean up imports in Lambda Labs node_provider
ewzeng Jan 19, 2023
2564baf
Make optimizer skip clouds that do not implement requested_features
ewzeng Jan 19, 2023
2934040
Rename Lambda Labs client functions, nits
ewzeng Jan 19, 2023
0661757
Improve requested_features implementation, nits
ewzeng Jan 23, 2023
aa119fc
Add type annotations, nits
ewzeng Jan 24, 2023
eedbc3e
Merge branch 'master' into lambda-labs-v3, update Lambda Labs testing
ewzeng Jan 25, 2023
0da8638
Improve pytest serialization logic
ewzeng Jan 25, 2023
29c31ac
Improve requested_features, introduce CloudImplementationFeatures enums
ewzeng Jan 26, 2023
88d02c0
Update lambda_utils.Metadata, address nits
ewzeng Jan 26, 2023
f4bcef9
Fix conftest.py bug introduced in previous commit
ewzeng Jan 26, 2023
8b83718
Update test comment
ewzeng Jan 27, 2023
5831d71
Rename Lambda Labs -> Lambda Cloud
ewzeng Jan 27, 2023
14992ae
Fix tag file reuse bug
ewzeng Jan 27, 2023
131f8d3
Testing nit
ewzeng Jan 27, 2023
dc2f2f1
Fix auth bug and address nits
ewzeng Jan 28, 2023
65375c5
Address final nits
ewzeng Jan 30, 2023
e37c7b2
Merge branch 'master' into lambda-labs-v3
ewzeng Jan 30, 2023
bd79715
Fix typing issues from merge
ewzeng Jan 30, 2023
30ea3d7
Provide basic support for cpus in resource specification
ewzeng Jan 30, 2023
8570d62
Improve 'cpu' resource specification for Lambda Cloud
ewzeng Jan 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
AWS = clouds.AWS
Azure = clouds.Azure
GCP = clouds.GCP
Lambda = clouds.Lambda
Local = clouds.Local
optimize = Optimizer.optimize

Expand All @@ -35,6 +36,7 @@
'AWS',
'Azure',
'GCP',
'Lambda',
'Local',
'Optimizer',
'OptimizeTarget',
Expand Down
24 changes: 24 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from sky.utils import common_utils
from sky.utils import subprocess_utils
from sky.utils import ux_utils
from sky.skylet.providers.lambda_labs.lambda_utils import LambdaLabsClient
ewzeng marked this conversation as resolved.
Show resolved Hide resolved

logger = sky_logging.init_logger(__name__)

Expand Down Expand Up @@ -299,3 +300,26 @@ def setup_azure_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
config['file_mounts'] = file_mounts

return config


def setup_lambda_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
get_or_generate_keys()

# Ensure ssh key is registered with Lambda Labs
lambda_client = LambdaLabsClient()
if lambda_client.ssh_key_name is None:
public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH)
with open(public_key_path, 'r') as f:
public_key = f.read()
name = f'{common_utils.get_user_hash()}-sky-key'
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
lambda_client.set_ssh_key(name, public_key)

# Need to use ~ relative path because Ray uses the same
# path for finding the public key path on both local and head node.
config['auth']['ssh_public_key'] = PUBLIC_SSH_KEY_PATH

file_mounts = config['file_mounts']
file_mounts[PUBLIC_SSH_KEY_PATH] = PUBLIC_SSH_KEY_PATH
config['file_mounts'] = file_mounts

return config
20 changes: 20 additions & 0 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
from sky.backends import onprem_utils
from sky.skylet import constants
from sky.skylet import log_lib
from sky.skylet.providers.lambda_labs.lambda_utils import LambdaLabsClient
from sky.utils import common_utils
from sky.utils import command_runner
from sky.utils import env_options
Expand Down Expand Up @@ -891,6 +892,8 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
config = auth.setup_gcp_authentication(config)
elif isinstance(cloud, clouds.Azure):
config = auth.setup_azure_authentication(config)
elif isinstance(cloud, clouds.Lambda):
config = auth.setup_lambda_authentication(config)
else:
assert isinstance(cloud, clouds.Local), cloud
# Local cluster case, authentication is already filled by the user
Expand Down Expand Up @@ -1651,10 +1654,27 @@ def _query_status_azure(
return _process_cli_query('Azure', cluster, query_cmd, '\t', status_map)


def _query_status_lambda(
cluster: str,
ray_config: Dict[str, Any], # pylint: disable=unused-argument
) -> List[global_user_state.ClusterStatus]:
status_map = {
'booting': global_user_state.ClusterStatus.INIT,
'active': global_user_state.ClusterStatus.UP,
ewzeng marked this conversation as resolved.
Show resolved Hide resolved
}
# TODO(ewzeng): filter by hash_filter_string to be safe
vms = LambdaLabsClient().ls().get('data', [])
for node in vms:
if node['name'] == cluster:
return [status_map[node['status']]]
return []


_QUERY_STATUS_FUNCS = {
'AWS': _query_status_aws,
'GCP': _query_status_gcp,
'Azure': _query_status_azure,
'Lambda': _query_status_lambda,
}


Expand Down
40 changes: 37 additions & 3 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ def _get_cluster_config_template(cloud):
clouds.AWS: 'aws-ray.yml.j2',
clouds.Azure: 'azure-ray.yml.j2',
clouds.GCP: 'gcp-ray.yml.j2',
clouds.Lambda: 'lambda-ray.yml.j2',
clouds.Local: 'local-ray.yml.j2',
}
return cloud_to_template[type(cloud)]
Expand Down Expand Up @@ -557,9 +558,9 @@ def __init__(self, log_dir: str, dag: 'dag.Dag',
def _in_blocklist(self, cloud, region, zones):
if region.name in self._blocked_regions:
return True
# We do not keep track of zones in Azure and Local,
# as both clouds do not have zones.
if isinstance(cloud, (clouds.Azure, clouds.Local)):
# We do not keep track of zones in Azure, Lambda, and Local,
# as these clouds do not have zones.
if isinstance(cloud, (clouds.Azure, clouds.Local, clouds.Lambda)):
return False
assert zones, (cloud, region, zones)
for zone in zones:
Expand Down Expand Up @@ -737,6 +738,32 @@ def _update_blocklist_on_azure_error(self, region, zones, stdout, stderr):
else:
self._blocked_regions.add(region.name)

def _update_blocklist_on_lambda_error(self, region, zones, stdout, stderr):
del zones # Unused.
style = colorama.Style
stdout_splits = stdout.split('\n')
stderr_splits = stderr.split('\n')
errors = [
s.strip()
for s in stdout_splits + stderr_splits
if 'LambdaLabsError:' in s.strip()
]
if not errors:
logger.info('====== stdout ======')
for s in stdout_splits:
print(s)
logger.info('====== stderr ======')
for s in stderr_splits:
print(s)
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Errors occurred during provision; '
'check logs above.')

logger.warning(f'Got error(s) in {region.name}:')
messages = '\n\t'.join(errors)
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
self._blocked_regions.add(region.name)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

def _update_blocklist_on_local_error(self, region, zones, stdout, stderr):
del zones # Unused.
style = colorama.Style
Expand Down Expand Up @@ -789,6 +816,10 @@ def _update_blocklist_on_error(self, cloud, region, zones, stdout,
return self._update_blocklist_on_azure_error(
region, zones, stdout, stderr)

if isinstance(cloud, clouds.Lambda):
return self._update_blocklist_on_lambda_error(
region, zones, stdout, stderr)

if isinstance(cloud, clouds.Local):
return self._update_blocklist_on_local_error(
region, zones, stdout, stderr)
Expand Down Expand Up @@ -818,6 +849,9 @@ def _yield_region_zones(self, to_provision: resources_lib.Resources,
elif cloud.is_same_cloud(clouds.Azure()):
region = config['provider']['location']
zones = None
elif cloud.is_same_cloud(clouds.Lambda()):
region = config['provider']['region']
zones = None
elif cloud.is_same_cloud(clouds.Local()):
local_regions = clouds.Local.regions()
region = local_regions[0].name
Expand Down
15 changes: 15 additions & 0 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,21 @@ def _launch_with_confirm(
confirm_shown = True
click.confirm(prompt, default=True, abort=True, show_default=True)

# Lambda Labs does not support autostop or multiple nodes.
# If task.resources is None, cannot be Lambda Labs.
if task.resources:
for resource in task.resources:
if resource.cloud.is_same_cloud(sky.Lambda()):
ewzeng marked this conversation as resolved.
Show resolved Hide resolved
if not down and idle_minutes_to_autostop is not None:
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
('Lambda Labs does not support stopping '
'instances.'))
elif task.num_nodes > 1:
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
('Lambda Labs does not support --num-nodes > 1.'))
ewzeng marked this conversation as resolved.
Show resolved Hide resolved

if node_type is not None:
if maybe_status != global_user_state.ClusterStatus.UP:
click.secho(f'Setting up interactive node {cluster}...',
Expand Down
2 changes: 2 additions & 0 deletions sky/clouds/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@
from sky.clouds.aws import AWS
from sky.clouds.azure import Azure
from sky.clouds.gcp import GCP
from sky.clouds.lambda_labs import Lambda
from sky.clouds.local import Local

__all__ = [
'AWS',
'Azure',
'Cloud',
'GCP',
'Lambda',
'Local',
'Region',
'Zone',
Expand Down
Loading