-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Auto mapping for cluster name #2403
Changes from 54 commits
b843b33
e06c3b5
575b327
7d7963a
4edc463
7411f53
d65bc3c
30d5e21
fb422cf
15a6c8c
ae7778d
a8a1126
5d1e499
f52f35f
8e68d44
252b528
6af9a6f
35e5823
d450f78
2b64383
b35aec6
c97f37d
0efbf65
0e7ca13
1c19a82
5f2890e
037c332
d69507d
947ee7b
b700b54
bd8f5b6
9f73d58
6c84013
2baf0de
2af8e6f
fef2586
f961e66
2ab3721
3737e95
2a8c32d
7a7073c
8ef7741
c3a0623
193fc6c
62bcef4
3fbf47d
8b7d52e
88207c7
342f72d
9a13119
9c050d0
1d005a2
c6b1166
c84c1ff
9d5aea0
1570851
2aa871b
9a4293a
5b56eee
e027719
086e3e9
fa098be
d48b15d
1a00d07
8d4c87f
d783c09
bacb92e
118965b
00d7250
57ff88f
2c7f159
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -891,6 +891,7 @@ def write_cluster_config( | |
# task.best_resources may not be equal to to_provision if the user | ||
# is running a job with less resources than the cluster has. | ||
cloud = to_provision.cloud | ||
assert cloud is not None, to_provision | ||
# This can raise a ResourcesUnavailableError, when the region/zones | ||
# requested does not appear in the catalog. It can be triggered when the | ||
# user changed the catalog file, while there is a cluster in the removed | ||
|
@@ -982,10 +983,14 @@ def write_cluster_config( | |
f'open(os.path.expanduser("{constants.SKY_REMOTE_RAY_PORT_FILE}"), "w"))\'' | ||
) | ||
|
||
cluster_name_on_cloud = common_utils.make_cluster_name_on_cloud( | ||
cluster_name, max_length=cloud.max_cluster_name_length()) | ||
|
||
# Only using new security group names for clusters with ports specified. | ||
default_aws_sg_name = f'sky-sg-{common_utils.user_and_hostname_hash()}' | ||
if ports is not None: | ||
default_aws_sg_name += f'-{common_utils.truncate_and_hash_cluster_name(cluster_name)}' | ||
default_aws_sg_name = f'sky-sg-{cluster_name_on_cloud}' | ||
|
||
# Use a tmp file path to avoid incomplete YAML file being re-used in the | ||
# future. | ||
tmp_yaml_path = yaml_path + '.tmp' | ||
|
@@ -994,7 +999,7 @@ def write_cluster_config( | |
dict( | ||
resources_vars, | ||
**{ | ||
'cluster_name': cluster_name, | ||
'cluster_name_on_cloud': cluster_name_on_cloud, | ||
concretevitamin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'num_nodes': num_nodes, | ||
'ports': ports, | ||
'disk_size': to_provision.disk_size, | ||
|
@@ -1083,6 +1088,12 @@ def write_cluster_config( | |
with open(tmp_yaml_path, 'w') as f: | ||
f.write(restored_yaml_content) | ||
|
||
# Read the cluster name from the tmp yaml file, to take the backward | ||
# compatbility restortion above into account. | ||
Michaelvll marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# TODO: remove this after 2 minor releases, 0.5.0. | ||
yaml_config = common_utils.read_yaml(tmp_yaml_path) | ||
config_dict['cluster_name_on_cloud'] = yaml_config['cluster_name'] | ||
cblmemo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Optimization: copy the contents of source files in file_mounts to a | ||
# special dir, and upload that as the only file_mount instead. Delay | ||
# calling this optimization until now, when all source files have been | ||
|
@@ -1809,7 +1820,9 @@ def _query_cluster_status_via_cloud_api( | |
exceptions.ClusterStatusFetchingError: the cluster status cannot be | ||
fetched from the cloud provider. | ||
""" | ||
cluster_name = handle.cluster_name | ||
cluster_name_on_cloud = handle.cluster_name_on_cloud | ||
cluster_name_in_hint = common_utils.cluster_name_in_hint( | ||
handle.cluster_name, cluster_name_on_cloud) | ||
# Use region and zone from the cluster config, instead of the | ||
# handle.launched_resources, because the latter may not be set | ||
# correctly yet. | ||
|
@@ -1827,26 +1840,31 @@ def _query_cluster_status_via_cloud_api( | |
cloud_name = repr(handle.launched_resources.cloud) | ||
try: | ||
node_status_dict = provision_lib.query_instances( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be pretty easy for new code/contributors to call provision_lib funcs and pass One possibility/bandage is to document this clearly at the top of provision_lib. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another potential way is to create a lightweight class and use it in provision_lib interfaces:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a great point! I could not think of a better way to mitigate this. I would personally prefer keeping the Another alternative (though more involved) is to rename the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After offline discussion, we now changed the |
||
cloud_name, cluster_name, provider_config) | ||
logger.debug(f'Querying {cloud_name} cluster {cluster_name!r} ' | ||
cloud_name, cluster_name_on_cloud, provider_config) | ||
logger.debug(f'Querying {cloud_name} cluster ' | ||
f'{cluster_name_in_hint} ' | ||
f'status:\n{pprint.pformat(node_status_dict)}') | ||
node_statuses = list(node_status_dict.values()) | ||
except Exception as e: # pylint: disable=broad-except | ||
with ux_utils.print_exception_no_traceback(): | ||
raise exceptions.ClusterStatusFetchingError( | ||
f'Failed to query {cloud_name} cluster {cluster_name!r} ' | ||
f'status: {e}') | ||
f'Failed to query {cloud_name} cluster ' | ||
f'{cluster_name_in_hint} ' | ||
f'status: {common_utils.format_exception(e, use_bracket=True)}' | ||
) | ||
else: | ||
node_statuses = handle.launched_resources.cloud.query_status( | ||
cluster_name, tag_filter_for_cluster(cluster_name), region, zone, | ||
cluster_name_on_cloud, | ||
tag_filter_for_cluster(cluster_name_on_cloud), region, zone, | ||
**kwargs) | ||
# GCP does not clean up preempted TPU VMs. We remove it ourselves. | ||
# TODO(wei-lin): handle multi-node cases. | ||
# TODO(zhwu): this should be moved into the GCP class, after we refactor | ||
# the cluster termination, as the preempted TPU VM should always be | ||
# removed. | ||
if kwargs.get('use_tpu_vm', False) and len(node_statuses) == 0: | ||
logger.debug(f'Terminating preempted TPU VM cluster {cluster_name}') | ||
logger.debug( | ||
f'Terminating preempted TPU VM cluster {cluster_name_in_hint}') | ||
backend = backends.CloudVmRayBackend() | ||
# Do not use refresh cluster status during teardown, as that will | ||
# cause infinite recursion by calling cluster status refresh | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we keep the
common_utils.user_and_hostname_hash()
for the case when multiple user uses same cluster name?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
cluster_name_on_cloud
already contains the user hash, so it should be fine to don't have theuser_and_hostname_hash()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. This knowledge is leaked. Calling it
make_cluster_name_user_specific()
may mitigate it somewhat.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an argument that decides whether to add user hash to the name. I guess it is ok to call it
make_cluster_name_on_cloud
, without mentioning about the user hash in the name?