[Failover] Fix leakage of existing cluster when failed to start #1497

Michaelvll · 2022-12-07T18:21:38Z

Closes #1496.

Tested:

The reproducible snippets in the [Failover] The stopped cluster will be removed from the cluster table if failed to restart #1496
tests/run_smoke_tests.sh

concretevitamin

Thanks @Michaelvll; some questions.

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2022-12-08T01:29:53Z

sky/backends/cloud_vm_ray_backend.py

@@ -1004,6 +1004,10 @@ def _retry_region_zones(self,
        # Get previous cluster status
        prev_cluster_status = backend_utils.refresh_cluster_status_handle(
            cluster_name, acquire_per_cluster_status_lock=False)[0]
+        prev_cluster_exists = prev_cluster_status in [


Q: is it possible that a cluster can partially exist and be in INIT? E.g., launched 2 nodes, manually terminate 1, then after a refresh it's in INIT?

How does this differ from the cluster_exists arg?

I think this follows the logic in the previous code that any cluster in INIT status will be able to fail over to other regions.

However, after thinking more about this, here is the pros and cons for the behavior of adding INIT in the list as well:

Environment

A cluster exists and is in INIT mode.

sky start -c cluster

Current behavior:

try to launch the cluster with the same spec.

If ray up fails, we terminate the cluster

Relaunch with the same resources as cluster (same region/zone/accelerators)

New behavior

Adding INIT to the list (or use the cluster_exists directly)

try to launch the cluster with the same spec.

If ray up fails, we try to stop the cluster

Not failover, and print out not able to launch

sky launch -c cluster

Current behavior:

try to launch the cluster with the same spec.

If ray up fails, we terminate the cluster

Relaunch and failover with empty resources, i.e. CPU instance, no matter what the previous cluster had

New behavior

Adding INIT to the list

try to launch the cluster with the same spec.

If ray up fails, we try to stop the cluster

Not failover, and print out not able to launch

sky launch -c cluster task.yaml

Current behavior

try to launch the cluster with the same spec.

If ray up fails, we terminate the cluster

Relaunch and failover with the task.resources

New behavior

Adding INIT to the list

try to launch the cluster with the same spec.

If ray up fails, we try to stop the cluster

Not failover, and print out not able to launch

Pro for adding INIT: more consistent for the three commands
Con for adding INIT: any INIT cluster in the status table needs to explicitly sky down before the failover can work.
For example, if a user Ctrl+C the sky launch during failover, she will have to do sky down first before sky launch being able to failover to other regions/clouds.

Based on the discussion above, I would prefer to add INIT to the list as well, to make the failover more conservative avoiding accident termination of the user's cluster.

Wdyt?

Agreed with going with the new behavior, with some asterisks to discuss in the future.

For example, it's a bit unintuitive to me that a start/launch may transition INIT to STOPPED. I see why that may be desired, but it may make the state transitions more complex. TBD.

sky/backends/cloud_vm_ray_backend.py

…nto failover-leakage

concretevitamin

LGTM, thanks @Michaelvll.

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2022-12-08T09:04:43Z

sky/backends/backend_utils.py

    """
    # task.best_resources may not be equal to to_provision if the user
    # is running a job with less resources than the cluster has.
    cloud = to_provision.cloud
+    # This can raise a ResourceUnavailableError, when the region/zones requested
+    # does not appear in the catalog. It can be triggered when the user changed
+    # the catalog file, while there is a cluster in the removed region/zone.


I feel we should add a FIXME here to make Cloud.make_deploy_resources_variables() throw a different error (the fact that it throws an error is somewhat surprising; maybe we can consider removing such an error?). ResourceUnavailableError should be and is currently heavily used only for physical resource unavailable errors from the cloud provider. Wdyt?

That is a good point! I agree that we may probably have a different exception for the resources not found in the catalog. Also, it would be better if we can check the issue before this function, as this function name does not indicates any checks, but just translating the requirement into the ray yaml. Added a TODO for it.

Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.

…nto failover-leakage

concretevitamin · 2022-12-09T05:27:49Z

sky/backends/backend_utils.py

    """
    # task.best_resources may not be equal to to_provision if the user
    # is running a job with less resources than the cluster has.
    cloud = to_provision.cloud
+    # This can raise a ResourceUnavailableError, when the region/zones requested
+    # does not appear in the catalog. It can be triggered when the user changed
+    # the catalog file, while there is a cluster in the removed region/zone.


Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.

* enable requested ap region * Fix failover leakage bug * revert fetch aws * fix * format * fix * partially address * drop nan * Use inner join * rename variable * address comments * use cluster_exists

Michaelvll added 6 commits December 7, 2022 09:28

enable requested ap region

5b91374

Fix failover leakage bug

655267b

revert fetch aws

b817ab2

fix

736b49d

format

9291225

fix

c8b4669

Michaelvll added the P0 label Dec 7, 2022

concretevitamin reviewed Dec 8, 2022

View reviewed changes

Michaelvll added 5 commits December 7, 2022 18:50

partially address

e63622c

drop nan

909b7da

Use inner join

528735a

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

28b7b2c

…nto failover-leakage

rename variable

d06eea5

concretevitamin approved these changes Dec 8, 2022

View reviewed changes

Michaelvll added 3 commits December 8, 2022 12:23

address comments

cbf2684

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

3832ca2

…nto failover-leakage

use cluster_exists

63cd795

concretevitamin approved these changes Dec 9, 2022

View reviewed changes

Michaelvll merged commit d7e777f into master Dec 9, 2022

Michaelvll deleted the failover-leakage branch December 9, 2022 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failover] Fix leakage of existing cluster when failed to start #1497

[Failover] Fix leakage of existing cluster when failed to start #1497

Michaelvll commented Dec 7, 2022 •

edited

Loading

concretevitamin left a comment

concretevitamin Dec 8, 2022

Michaelvll Dec 8, 2022 •

edited

Loading

concretevitamin Dec 8, 2022

concretevitamin left a comment

concretevitamin Dec 8, 2022

Michaelvll Dec 8, 2022

concretevitamin Dec 9, 2022

concretevitamin Dec 9, 2022

[Failover] Fix leakage of existing cluster when failed to start #1497

[Failover] Fix leakage of existing cluster when failed to start #1497

Conversation

Michaelvll commented Dec 7, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Dec 8, 2022

Choose a reason for hiding this comment

Michaelvll Dec 8, 2022 • edited Loading

Choose a reason for hiding this comment

Environment

sky start -c cluster

Current behavior:

New behavior

sky launch -c cluster

Current behavior:

New behavior

sky launch -c cluster task.yaml

Current behavior

New behavior

concretevitamin Dec 8, 2022

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Dec 8, 2022

Choose a reason for hiding this comment

Michaelvll Dec 8, 2022

Choose a reason for hiding this comment

concretevitamin Dec 9, 2022

Choose a reason for hiding this comment

concretevitamin Dec 9, 2022

Choose a reason for hiding this comment

Michaelvll commented Dec 7, 2022 •

edited

Loading

Michaelvll Dec 8, 2022 •

edited

Loading