Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failover] Fix leakage of existing cluster when failed to start #1497

Merged
merged 14 commits into from
Dec 9, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 7, 2022

Closes #1496.

Tested:

@Michaelvll Michaelvll added the P0 label Dec 7, 2022
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll; some questions.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@@ -1004,6 +1004,10 @@ def _retry_region_zones(self,
# Get previous cluster status
prev_cluster_status = backend_utils.refresh_cluster_status_handle(
cluster_name, acquire_per_cluster_status_lock=False)[0]
prev_cluster_exists = prev_cluster_status in [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is it possible that a cluster can partially exist and be in INIT? E.g., launched 2 nodes, manually terminate 1, then after a refresh it's in INIT?

How does this differ from the cluster_exists arg?

Copy link
Collaborator Author

@Michaelvll Michaelvll Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this follows the logic in the previous code that any cluster in INIT status will be able to fail over to other regions.

However, after thinking more about this, here is the pros and cons for the behavior of adding INIT in the list as well:

Environment

A cluster exists and is in INIT mode.

sky start -c cluster

Current behavior:

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch with the same resources as cluster (same region/zone/accelerators)

New behavior

Adding INIT to the list (or use the cluster_exists directly)

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

sky launch -c cluster

Current behavior:

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch and failover with empty resources, i.e. CPU instance, no matter what the previous cluster had

New behavior

Adding INIT to the list

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

sky launch -c cluster task.yaml

Current behavior

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we terminate the cluster
  3. Relaunch and failover with the task.resources

New behavior

Adding INIT to the list

  1. try to launch the cluster with the same spec.
  2. If ray up fails, we try to stop the cluster
  3. Not failover, and print out not able to launch

Pro for adding INIT: more consistent for the three commands
Con for adding INIT: any INIT cluster in the status table needs to explicitly sky down before the failover can work.
For example, if a user Ctrl+C the sky launch during failover, she will have to do sky down first before sky launch being able to failover to other regions/clouds.

Based on the discussion above, I would prefer to add INIT to the list as well, to make the failover more conservative avoiding accident termination of the user's cluster.

Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with going with the new behavior, with some asterisks to discuss in the future.

For example, it's a bit unintuitive to me that a start/launch may transition INIT to STOPPED. I see why that may be desired, but it may make the state transitions more complex. TBD.

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
"""
# task.best_resources may not be equal to to_provision if the user
# is running a job with less resources than the cluster has.
cloud = to_provision.cloud
# This can raise a ResourceUnavailableError, when the region/zones requested
# does not appear in the catalog. It can be triggered when the user changed
# the catalog file, while there is a cluster in the removed region/zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should add a FIXME here to make Cloud.make_deploy_resources_variables() throw a different error (the fact that it throws an error is somewhat surprising; maybe we can consider removing such an error?). ResourceUnavailableError should be and is currently heavily used only for physical resource unavailable errors from the cloud provider. Wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point! I agree that we may probably have a different exception for the resources not found in the catalog. Also, it would be better if we can check the issue before this function, as this function name does not indicates any checks, but just translating the requirement into the ray yaml. Added a TODO for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.

"""
# task.best_resources may not be equal to to_provision if the user
# is running a job with less resources than the cluster has.
cloud = to_provision.cloud
# This can raise a ResourceUnavailableError, when the region/zones requested
# does not appear in the catalog. It can be triggered when the user changed
# the catalog file, while there is a cluster in the removed region/zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.

@Michaelvll Michaelvll merged commit d7e777f into master Dec 9, 2022
@Michaelvll Michaelvll deleted the failover-leakage branch December 9, 2022 06:05
iojw pushed a commit that referenced this pull request Feb 18, 2023
* enable requested ap region

* Fix failover leakage bug

* revert fetch aws

* fix

* format

* fix

* partially address

* drop nan

* Use inner join

* rename variable

* address comments

* use cluster_exists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Failover] The stopped cluster will be removed from the cluster table if failed to restart
2 participants