Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failover] Fix leakage of existing cluster when failed to start #1497

Merged
merged 14 commits into from
Dec 9, 2022
49 changes: 30 additions & 19 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -887,8 +887,8 @@ def _yield_region_zones(self, to_provision: resources_lib.Resources,
if cluster_status == global_user_state.ClusterStatus.STOPPED:
message = (
'Failed to acquire resources to restart the stopped '
f'cluster {cluster_name} on {region}. Please retry again '
'later.')
f'cluster {cluster_name} on {region.name}. Please retry '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
'again later.')

# Reset to STOPPED (rather than keeping it at INIT), because
# (1) the cluster is not up (2) it ensures future `sky start`
Expand Down Expand Up @@ -1004,6 +1004,10 @@ def _retry_region_zones(self,
# Get previous cluster status
prev_cluster_status = backend_utils.refresh_cluster_status_handle(
cluster_name, acquire_per_cluster_status_lock=False)[0]
prev_cluster_exists = prev_cluster_status in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is it possible that a cluster can partially exist and be in INIT? E.g., launched 2 nodes, manually terminate 1, then after a refresh it's in INIT?

How does this differ from the cluster_exists arg?

Copy link
Collaborator Author

@Michaelvll Michaelvll Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this follows the logic in the previous code that any cluster in INIT status will be able to fail over to other regions.

However, after thinking more about this, here is the pros and cons for the behavior of adding INIT in the list as well:

Environment

A cluster exists and is in INIT mode.

sky start -c cluster

Current behavior:

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch with the same resources as cluster (same region/zone/accelerators)

New behavior

Adding INIT to the list (or use the cluster_exists directly)

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

sky launch -c cluster

Current behavior:

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch and failover with empty resources, i.e. CPU instance, no matter what the previous cluster had

New behavior

Adding INIT to the list

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

sky launch -c cluster task.yaml

Current behavior

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch and failover with the task.resources

New behavior

Adding INIT to the list

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

Pro for adding INIT: more consistent for the three commands
Con for adding INIT: any INIT cluster in the status table needs to explicitly sky down before the failover can work.
For example, if a user Ctrl+C the sky launch during failover, she will have to do sky down first before sky launch being able to failover to other regions/clouds.

Based on the discussion above, I would prefer to add INIT to the list as well, to make the failover more conservative avoiding accident termination of the user's cluster.

Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with going with the new behavior, with some asterisks to discuss in the future.

For example, it's a bit unintuitive to me that a start/launch may transition INIT to STOPPED. I see why that may be desired, but it may make the state transitions more complex. TBD.

global_user_state.ClusterStatus.STOPPED,
global_user_state.ClusterStatus.UP
]

self._clear_blocklist()
for region, zones in self._yield_region_zones(to_provision,
Expand All @@ -1013,18 +1017,24 @@ def _retry_region_zones(self,
continue
zone_str = ','.join(
z.name for z in zones) if zones is not None else 'all zones'
config_dict = backend_utils.write_cluster_config(
to_provision,
num_nodes,
_get_cluster_config_template(to_provision.cloud),
cluster_name,
self._local_wheel_path,
self._wheel_hash,
region=region,
zones=zones,
dryrun=dryrun,
keep_launch_fields_in_existing_config=prev_cluster_status
is not None)
try:
config_dict = backend_utils.write_cluster_config(
to_provision,
num_nodes,
_get_cluster_config_template(to_provision.cloud),
cluster_name,
self._local_wheel_path,
self._wheel_hash,
region=region,
zones=zones,
dryrun=dryrun,
keep_launch_fields_in_existing_config=prev_cluster_status
is not None)
except exceptions.ResourcesUnavailableError as e:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# Failed due to catalog issue, e.g. image not found.
logger.info(
f'Failed to find catalog in region {region.name}: {e}')
continue
if dryrun:
return
cluster_config_file = config_dict['ray']
Expand Down Expand Up @@ -1104,10 +1114,7 @@ def _retry_region_zones(self,
# FIXME(zongheng): terminating a potentially live cluster is
# scary. Say: users have an existing cluster that got into INIT, do
# sky launch, somehow failed, then we may be terminating it here.
need_terminate = prev_cluster_status not in [
global_user_state.ClusterStatus.STOPPED,
global_user_state.ClusterStatus.UP
]
need_terminate = not prev_cluster_exists
if status == self.GangSchedulingStatus.HEAD_FAILED:
# ray up failed for the head node.
self._update_blocklist_on_error(to_provision.cloud, region,
Expand Down Expand Up @@ -1145,7 +1152,11 @@ def _retry_region_zones(self,
message = ('Failed to acquire resources in all regions/zones of '
f'{to_provision.cloud}. '
'Try changing resource requirements or use another cloud.')
raise exceptions.ResourcesUnavailableError(message)
# Do not failover to other clouds if the cluster was previously
# UP or STOPPED.
e = exceptions.ResourcesUnavailableError(
message, no_failover=prev_cluster_exists)
return e
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

def _tpu_pod_setup(self, cluster_yaml: str,
cluster_handle: 'backends.Backend.ResourceHandle'):
Expand Down