Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Auto mapping for cluster name #2403

Merged
merged 71 commits into from
Aug 26, 2023
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
b843b33
wip
Michaelvll Aug 11, 2023
e06c3b5
wip
Michaelvll Aug 12, 2023
575b327
working
Michaelvll Aug 12, 2023
7d7963a
explicitly update cluster ips
Michaelvll Aug 14, 2023
4edc463
optimize
Michaelvll Aug 14, 2023
7411f53
Fix comment
Michaelvll Aug 14, 2023
d65bc3c
Add comments
Michaelvll Aug 14, 2023
30d5e21
format
Michaelvll Aug 14, 2023
fb422cf
linter
Michaelvll Aug 14, 2023
15a6c8c
Automatcially map the cluster name on the cloud to avoid conflict and…
Michaelvll Aug 14, 2023
ae7778d
short user hash in the cluster name
Michaelvll Aug 14, 2023
a8a1126
fix user hash in cluster_name
Michaelvll Aug 14, 2023
5d1e499
Fix head ip fetching in run_on_head
Michaelvll Aug 14, 2023
f52f35f
Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…
Michaelvll Aug 14, 2023
8e68d44
fix head ip fetching
Michaelvll Aug 14, 2023
252b528
Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…
Michaelvll Aug 14, 2023
6af9a6f
fix status refresh
Michaelvll Aug 14, 2023
35e5823
format
Michaelvll Aug 14, 2023
d450f78
Use cached external ips instead
Michaelvll Aug 14, 2023
2b64383
format
Michaelvll Aug 15, 2023
b35aec6
Update sky/backends/cloud_vm_ray_backend.py
Michaelvll Aug 15, 2023
c97f37d
Fix ports
Michaelvll Aug 15, 2023
0efbf65
Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…
Michaelvll Aug 15, 2023
0e7ca13
use retry
Michaelvll Aug 15, 2023
1c19a82
format
Michaelvll Aug 15, 2023
5f2890e
typo
Michaelvll Aug 15, 2023
037c332
minor fix
Michaelvll Aug 15, 2023
d69507d
Merge branch 'master' of github.com:skypilot-org/skypilot into optimi…
Michaelvll Aug 15, 2023
947ee7b
Merge branch 'optimize-head-ip' of github.com:skypilot-org/skypilot i…
Michaelvll Aug 15, 2023
b700b54
Merge branch 'master' of github.com:skypilot-org/skypilot into auto-m…
Michaelvll Aug 18, 2023
bd8f5b6
Better naming
Michaelvll Aug 18, 2023
9f73d58
use base36 instead
Michaelvll Aug 18, 2023
6c84013
format
Michaelvll Aug 18, 2023
2baf0de
truncate the spot cluster name to avoid hiding job id
Michaelvll Aug 18, 2023
2af8e6f
Fix clone disk cluster name
Michaelvll Aug 18, 2023
fef2586
Fix tests
Michaelvll Aug 18, 2023
f961e66
Fix tests
Michaelvll Aug 18, 2023
2ab3721
Fix user hash
Michaelvll Aug 18, 2023
3737e95
use correct cluster name on cloud for clone
Michaelvll Aug 18, 2023
2a8c32d
fix clone disk
Michaelvll Aug 18, 2023
7a7073c
Merge branch 'master' of github.com:skypilot-org/skypilot into auto-m…
Michaelvll Aug 22, 2023
8ef7741
format
Michaelvll Aug 22, 2023
c3a0623
Address comments
Michaelvll Aug 22, 2023
193fc6c
Add tag for original cluster name
Michaelvll Aug 22, 2023
62bcef4
format
Michaelvll Aug 22, 2023
3fbf47d
Partially address the comments
Michaelvll Aug 23, 2023
8b7d52e
format
Michaelvll Aug 23, 2023
88207c7
partial fixes
Michaelvll Aug 23, 2023
342f72d
Merge branch 'master' of github.com:skypilot-org/skypilot into auto-m…
Michaelvll Aug 23, 2023
9a13119
fixes
Michaelvll Aug 23, 2023
9c050d0
Fix comments
Michaelvll Aug 23, 2023
1d005a2
better logging
Michaelvll Aug 23, 2023
c6b1166
format
Michaelvll Aug 23, 2023
c84c1ff
Add a readme
Michaelvll Aug 23, 2023
9d5aea0
fix test smoke
Michaelvll Aug 24, 2023
1570851
fix name for spot cancellation
Michaelvll Aug 24, 2023
2aa871b
Merge branch 'master' of github.com:skypilot-org/skypilot into auto-m…
Michaelvll Aug 24, 2023
9a4293a
format
Michaelvll Aug 24, 2023
5b56eee
remove cluster name
Michaelvll Aug 24, 2023
e027719
Update sky/design_docs/cluster_name.md
Michaelvll Aug 24, 2023
086e3e9
Update sky/design_docs/cluster_name.md
Michaelvll Aug 24, 2023
fa098be
add comments
Michaelvll Aug 24, 2023
d48b15d
rename to cluster_name_on_cloud for provision lib
Michaelvll Aug 24, 2023
1a00d07
format
Michaelvll Aug 24, 2023
8d4c87f
fix gcp ports cleaning
Michaelvll Aug 24, 2023
d783c09
fix autostop
Michaelvll Aug 24, 2023
bacb92e
further fix for ports of GCP
Michaelvll Aug 24, 2023
118965b
Merge branch 'master' of github.com:skypilot-org/skypilot into auto-m…
Michaelvll Aug 25, 2023
00d7250
Update sky/skylet/constants.py
Michaelvll Aug 25, 2023
57ff88f
update readme
Michaelvll Aug 25, 2023
2c7f159
update
Michaelvll Aug 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 23 additions & 9 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,7 @@ def write_cluster_config(
# task.best_resources may not be equal to to_provision if the user
# is running a job with less resources than the cluster has.
cloud = to_provision.cloud
assert cloud is not None, to_provision
# This can raise a ResourcesUnavailableError, when the region/zones
# requested does not appear in the catalog. It can be triggered when the
# user changed the catalog file, while there is a cluster in the removed
Expand Down Expand Up @@ -982,10 +983,13 @@ def write_cluster_config(
f'open(os.path.expanduser("{constants.SKY_REMOTE_RAY_PORT_FILE}"), "w"))\''
)

cluster_name_on_cloud = cloud.truncate_and_hash_cluster_name(cluster_name)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

# Only using new security group names for clusters with ports specified.
default_aws_sg_name = f'sky-sg-{common_utils.user_and_hostname_hash()}'
if ports is not None:
default_aws_sg_name += f'-{common_utils.truncate_and_hash_cluster_name(cluster_name)}'
default_aws_sg_name = f'sky-sg-{cluster_name_on_cloud}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep the common_utils.user_and_hostname_hash() for the case when multiple user uses same cluster name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster_name_on_cloud already contains the user hash, so it should be fine to don't have the user_and_hostname_hash()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. This knowledge is leaked. Calling it make_cluster_name_user_specific() may mitigate it somewhat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an argument that decides whether to add user hash to the name. I guess it is ok to call it make_cluster_name_on_cloud, without mentioning about the user hash in the name?


# Use a tmp file path to avoid incomplete YAML file being re-used in the
# future.
tmp_yaml_path = yaml_path + '.tmp'
Expand All @@ -994,7 +998,8 @@ def write_cluster_config(
dict(
resources_vars,
**{
'cluster_name': cluster_name,
'cluster_name': cluster_name_on_cloud,
'user_cluster_name': cluster_name,
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'num_nodes': num_nodes,
'ports': ports,
'disk_size': to_provision.disk_size,
Expand Down Expand Up @@ -1083,6 +1088,11 @@ def write_cluster_config(
with open(tmp_yaml_path, 'w') as f:
f.write(restored_yaml_content)

# Read the cluster name from the tmp yaml file, to take the backward
# compatbility restortion above into account.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
yaml_config = common_utils.read_yaml(tmp_yaml_path)
config_dict['cluster_name_on_cloud'] = yaml_config['cluster_name']
cblmemo marked this conversation as resolved.
Show resolved Hide resolved

# Optimization: copy the contents of source files in file_mounts to a
# special dir, and upload that as the only file_mount instead. Delay
# calling this optimization until now, when all source files have been
Expand Down Expand Up @@ -1809,7 +1819,7 @@ def _query_cluster_status_via_cloud_api(
exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
"""
cluster_name = handle.cluster_name
cluster_name_on_cloud = handle.cluster_name_on_cloud
# Use region and zone from the cluster config, instead of the
# handle.launched_resources, because the latter may not be set
# correctly yet.
Expand All @@ -1827,26 +1837,30 @@ def _query_cluster_status_via_cloud_api(
cloud_name = repr(handle.launched_resources.cloud)
try:
node_status_dict = provision_lib.query_instances(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be pretty easy for new code/contributors to call provision_lib funcs and pass cluster_name=cluster_name. Hard to catch if we don't rename the args in provision_lib interface. Is this too intrusive? Any ideas to mitigate this?

One possibility/bandage is to document this clearly at the top of provision_lib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential way is to create a lightweight class and use it in provision_lib interfaces:

class ClusterIdentity:  # or better name
  cluster_name: str
  cluster_name_on_cloud: str
  cloud: str 

Copy link
Collaborator Author

@Michaelvll Michaelvll Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point! I could not think of a better way to mitigate this. I would personally prefer keeping the cluster_name in provision_lib, and adding a comment at the top of provision_lib and a design doc. I feel like the mapping is backend specific, and may not need to be reflected in the provision APIs.

Another alternative (though more involved) is to rename the handle.cluster_name to handle.display_name and rename handle.cluster_name_on_cloud to handle.cluster_name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline discussion, we now changed the provision_lib's APIs to take cluster_name_on_cloud as an argument.

cloud_name, cluster_name, provider_config)
logger.debug(f'Querying {cloud_name} cluster {cluster_name!r} '
cloud_name, cluster_name_on_cloud, provider_config)
logger.debug(f'Querying {cloud_name} cluster '
f'{cluster_name_on_cloud!r} '
f'status:\n{pprint.pformat(node_status_dict)}')
node_statuses = list(node_status_dict.values())
except Exception as e: # pylint: disable=broad-except
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterStatusFetchingError(
f'Failed to query {cloud_name} cluster {cluster_name!r} '
f'status: {e}')
f'Failed to query {cloud_name} cluster {cluster_name_on_cloud!r} '
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
f'status: {common_utils.format_exception(e, use_bracket=True)}'
)
else:
node_statuses = handle.launched_resources.cloud.query_status(
cluster_name, tag_filter_for_cluster(cluster_name), region, zone,
cluster_name_on_cloud,
tag_filter_for_cluster(cluster_name_on_cloud), region, zone,
**kwargs)
# GCP does not clean up preempted TPU VMs. We remove it ourselves.
# TODO(wei-lin): handle multi-node cases.
# TODO(zhwu): this should be moved into the GCP class, after we refactor
# the cluster termination, as the preempted TPU VM should always be
# removed.
if kwargs.get('use_tpu_vm', False) and len(node_statuses) == 0:
logger.debug(f'Terminating preempted TPU VM cluster {cluster_name}')
logger.debug(
f'Terminating preempted TPU VM cluster {cluster_name_on_cloud}')
backend = backends.CloudVmRayBackend()
# Do not use refresh cluster status during teardown, as that will
# cause infinite recursion by calling cluster status refresh
Expand Down
55 changes: 42 additions & 13 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,6 @@
from sky.skylet import constants
from sky.skylet import job_lib
from sky.skylet import log_lib
from sky.skylet.providers.scp.node_provider import SCPError
from sky.skylet.providers.scp.node_provider import SCPNodeProvider
from sky.usage import usage_lib
from sky.utils import command_runner
from sky.utils import common_utils
Expand Down Expand Up @@ -1533,6 +1531,11 @@ def _retry_zones(
# means a second 'sky launch -c <name>' will attempt to reuse.
handle = CloudVmRayResourceHandle(
cluster_name=cluster_name,
# Backward compatibility will be guaranteed by the underlying
# backend_utils.write_cluster_config, which gets the cluster
# name on cloud from the ray yaml file, if the previous cluster
# exists.
cluster_name_on_cloud=config_dict['cluster_name_on_cloud'],
cluster_yaml=cluster_config_file,
launched_nodes=num_nodes,
# OK for this to be shown in CLI as status == INIT.
Expand Down Expand Up @@ -2179,9 +2182,14 @@ def provision_with_retries(


class CloudVmRayResourceHandle(backends.backend.ResourceHandle):
"""A pickle-able tuple of:
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
"""A pickle-able handle for the cluster created by CloudVmRayBackend.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

The handle object will last for the whole lifecycle of the cluster.

- (required) Cluster name.
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
- (required) Cluster name on cloud (different from the cluster name, as we
append user hash to avoid confliction across multiple accounts in a same
organization, and truncate the name for length limit).
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
- (required) Path to a cluster.yaml file.
- (optional) A cached head node public IP. Filled in after a
successful provision().
Expand All @@ -2192,11 +2200,12 @@ class CloudVmRayResourceHandle(backends.backend.ResourceHandle):
- (optional) Docker user name
- (optional) If TPU(s) are managed, a path to a deletion script.
"""
_VERSION = 5
_VERSION = 6
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

def __init__(self,
*,
cluster_name: str,
cluster_name_on_cloud: str,
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
cluster_yaml: str,
launched_nodes: int,
launched_resources: resources_lib.Resources,
Expand All @@ -2207,6 +2216,9 @@ def __init__(self,
tpu_delete_script: Optional[str] = None) -> None:
self._version = self._VERSION
self.cluster_name = cluster_name
# self._cluster_name_on_cloud will only be None for clusters created
# before #2403.
self._cluster_name_on_cloud: Optional[str] = cluster_name_on_cloud
self._cluster_yaml = cluster_yaml.replace(os.path.expanduser('~'), '~',
1)
# List of (internal_ip, external_ip) tuples for all the nodes
Expand All @@ -2223,6 +2235,7 @@ def __init__(self,
def __repr__(self):
return (f'ResourceHandle('
f'\n\tcluster_name={self.cluster_name},'
f'\n\tcluster_name_on_cloud={self.cluster_name_on_cloud},'
f'\n\thead_ip={self.head_ip},'
'\n\tstable_internal_external_ips='
f'{self.stable_internal_external_ips},'
Expand All @@ -2239,6 +2252,12 @@ def __repr__(self):
def get_cluster_name(self):
return self.cluster_name

@property
def cluster_name_on_cloud(self):
if self._cluster_name_on_cloud is None:
return self.cluster_name
return self._cluster_name_on_cloud

def _maybe_make_local_handle(self):
"""Adds local handle for the local cloud case.

Expand Down Expand Up @@ -2518,6 +2537,9 @@ def __setstate__(self, state):
if version < 5:
state['docker_user'] = None

if version < 6:
state['_cluster_name_on_cloud'] = None
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

self.__dict__.update(state)

# Because the update_cluster_ips and update_ssh_ports
Expand Down Expand Up @@ -3661,6 +3683,7 @@ def teardown_no_lock(self,
cloud = handle.launched_resources.cloud
config = common_utils.read_yaml(handle.cluster_yaml)
cluster_name = handle.cluster_name
cluster_name_on_cloud = handle.cluster_name_on_cloud

# Avoid possibly unbound warnings. Code below must overwrite these vars:
returncode = 0
Expand Down Expand Up @@ -3691,7 +3714,7 @@ def teardown_no_lock(self,
operation_fn = provision_lib.terminate_instances
try:
operation_fn(repr(cloud),
cluster_name,
cluster_name_on_cloud,
provider_config=config['provider'])
except Exception as e: # pylint: disable=broad-except
if purge:
Expand Down Expand Up @@ -3730,12 +3753,12 @@ def teardown_no_lock(self,
config_provider = common_utils.read_yaml(
handle.cluster_yaml)['provider']
region = config_provider['region']
cluster_name = handle.cluster_name
search_client = ibm.search_client()
vpc_found = False
# pylint: disable=unsubscriptable-object
vpcs_filtered_by_tags_and_region = search_client.search(
query=f'type:vpc AND tags:{cluster_name} AND region:{region}',
query=(f'type:vpc AND tags:{cluster_name_on_cloud} '
f'AND region:{region}'),
fields=['tags', 'region', 'type'],
limit=1000).get_result()['items']
vpc_id = None
Expand All @@ -3752,14 +3775,19 @@ def teardown_no_lock(self,
# pylint: disable=line-too-long E1136
# Delete VPC and it's associated resources
vpc_provider = IBMVPCProvider(
config_provider['resource_group_id'], region, cluster_name)
config_provider['resource_group_id'], region,
cluster_name_on_cloud)
vpc_provider.delete_vpc(vpc_id, region)
# successfully removed cluster as no exception was raised
returncode = 0

elif terminate and isinstance(cloud, clouds.SCP):
# pylint: disable=import-outside-toplevel
from sky.skylet.providers.scp.node_provider import SCPError
from sky.skylet.providers.scp.node_provider import SCPNodeProvider
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
config['provider']['cache_stopped_nodes'] = not terminate
provider = SCPNodeProvider(config['provider'], handle.cluster_name)
provider = SCPNodeProvider(config['provider'],
cluster_name_on_cloud)
try:
if not os.path.exists(provider.metadata.path):
raise SCPError('SKYPILOT_ERROR_NO_NODES_LAUNCHED: '
Expand Down Expand Up @@ -3791,7 +3819,7 @@ def teardown_no_lock(self,

# 0: All terminated successfully, failed count otherwise
returncode = oci_query_helper.terminate_instances_by_tags(
{TAG_RAY_CLUSTER_NAME: cluster_name}, region)
{TAG_RAY_CLUSTER_NAME: cluster_name_on_cloud}, region)

# To avoid undefined local variables error.
stdout = stderr = ''
Expand Down Expand Up @@ -3846,7 +3874,7 @@ def teardown_no_lock(self,
raise RuntimeError(
_TEARDOWN_FAILURE_MESSAGE.format(
extra_reason='',
cluster_name=handle.cluster_name,
cluster_name=cluster_name_on_cloud,
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
stdout=stdout,
stderr=stderr))

Expand Down Expand Up @@ -3874,6 +3902,7 @@ def post_teardown_cleanup(self,
log_path = os.path.join(os.path.expanduser(self.log_dir),
'teardown.log')
log_abs_path = os.path.abspath(log_path)
cluster_name_on_cloud = handle.cluster_name_on_cloud

if (handle.tpu_delete_script is not None and
os.path.exists(handle.tpu_delete_script)):
Expand All @@ -3896,7 +3925,7 @@ def post_teardown_cleanup(self,
raise RuntimeError(
_TEARDOWN_FAILURE_MESSAGE.format(
extra_reason='It is caused by TPU failure.',
cluster_name=handle.cluster_name,
cluster_name=cluster_name_on_cloud,
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
stdout=tpu_stdout,
stderr=tpu_stderr))
if (terminate and handle.launched_resources.is_image_managed is True):
Expand Down Expand Up @@ -3926,7 +3955,7 @@ def post_teardown_cleanup(self,
# our sky node provider.
# TODO(tian): Adding a no-op cleanup_ports API after #2286
# merged.
provision_lib.cleanup_ports(repr(cloud), handle.cluster_name,
provision_lib.cleanup_ports(repr(cloud), cluster_name_on_cloud,
config['provider'])

# The cluster file must exist because the cluster_yaml will only
Expand Down
13 changes: 7 additions & 6 deletions sky/clouds/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -768,20 +768,21 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],

@classmethod
def create_image_from_cluster(cls, cluster_name: str,
tag_filters: Dict[str,
str], region: Optional[str],
cluster_name_on_cloud: str,
region: Optional[str],
zone: Optional[str]) -> str:
assert region is not None, (tag_filters, region)
del tag_filters, zone # unused
assert region is not None, (cluster_name, cluster_name_on_cloud, region)
del zone # unused

image_name = f'skypilot-{cluster_name}-{int(time.time())}'

status = provision_lib.query_instances('AWS', cluster_name,
status = provision_lib.query_instances('AWS', cluster_name_on_cloud,
{'region': region})
instance_ids = list(status.keys())
if not instance_ids:
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Failed to find the source cluster on AWS.')
raise RuntimeError(
f'Failed to find the source cluster {cluster_name} on AWS.')
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

if len(instance_ids) != 1:
with ux_utils.print_exception_no_traceback():
Expand Down
22 changes: 10 additions & 12 deletions sky/clouds/cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from sky import exceptions
from sky import skypilot_config
from sky.clouds import service_catalog
from sky.utils import common_utils
from sky.utils import log_utils
from sky.utils import ux_utils

Expand Down Expand Up @@ -84,6 +85,12 @@ def _max_cluster_name_length(cls) -> Optional[int]:
"""
return None

@classmethod
def truncate_and_hash_cluster_name(cls, cluster_name: str) -> str:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
"""Truncates/hashes the cluster name to avoid exceeding the limit."""
return common_utils.truncate_and_hash_cluster_name(
cluster_name, max_length=cls._max_cluster_name_length())

#### Regions/Zones ####

@classmethod
Expand Down Expand Up @@ -478,7 +485,6 @@ def check_cluster_name_is_valid(cls, cluster_name: str) -> None:
"""
if cluster_name is None:
return
max_cluster_name_len_limit = cls._max_cluster_name_length()
valid_regex = '[a-z]([-a-z0-9]*[a-z0-9])?'
if re.fullmatch(valid_regex, cluster_name) is None:
with ux_utils.print_exception_no_traceback():
Expand All @@ -487,14 +493,6 @@ def check_cluster_name_is_valid(cls, cluster_name: str) -> None:
'ensure it is fully matched by regex (e.g., '
'only contains lower letters, numbers and dash): '
f'{valid_regex}')
if (max_cluster_name_len_limit is not None and
len(cluster_name) > max_cluster_name_len_limit):
cloud_name = '' if cls is Cloud else f' on {cls._REPR}'
with ux_utils.print_exception_no_traceback():
raise exceptions.InvalidClusterNameError(
f'Cluster name {cluster_name!r} has {len(cluster_name)} '
'chars; maximum length is '
f'{max_cluster_name_len_limit} chars{cloud_name}.')

@classmethod
def check_disk_tier_enabled(cls, instance_type: str,
Expand Down Expand Up @@ -636,8 +634,8 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],

@classmethod
def create_image_from_cluster(cls, cluster_name: str,
tag_filters: Dict[str,
str], region: Optional[str],
cluster_name_on_cloud: str,
region: Optional[str],
zone: Optional[str]) -> str:
"""Creates an image from the cluster.

Expand All @@ -646,7 +644,7 @@ def create_image_from_cluster(cls, cluster_name: str,
raise NotImplementedError

@classmethod
def maybe_move_image(cls, image_name: str, source_region: str,
def maybe_move_image(cls, image_id: str, source_region: str,
target_region: str, source_zone: Optional[str],
target_zone: Optional[str]) -> str:
"""Move an image if required.
Expand Down
9 changes: 7 additions & 2 deletions sky/clouds/gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -1105,11 +1105,16 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],

@classmethod
def create_image_from_cluster(cls, cluster_name: str,
tag_filters: Dict[str,
str], region: Optional[str],
cluster_name_on_cloud: str,
region: Optional[str],
zone: Optional[str]) -> str:
del region # unused
assert zone is not None
# TODO(zhwu): This assumes the cluster is created with the
# `ray-cluster-name` tag, which is guaranteed by the current `ray`
# backend. Once the `provision.query_instances` is implemented for GCP,
# we should be able to get rid of this assumption.
tag_filters = {'ray-cluster-name': cluster_name_on_cloud}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use backend_utils.tag_filter_for_cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep it here, as we will soon replace this with the provision_lib, and the backend_utils depends on the cloud module, importing it here sounds like a over-kill.

label_filter_str = cls._label_filter_str(tag_filters)
instance_name_cmd = ('gcloud compute instances list '
f'--filter="({label_filter_str})" '
Expand Down
Loading