[core] Avoid terminating cluster for resources unavailability #2170

Michaelvll · 2023-07-02T20:29:16Z

Fixes #2169

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- Reproducible code in [core] Unexpected termination of user's previous cluster when resource capacity issue happens #2169
All smoke tests: pytest tests/test_smoke.py

concretevitamin · 2023-07-02T22:55:29Z

sky/skylet/log_lib.py

Is this change for the same issue?

Oops, it is from #2166. We can merge that PR first, as otherwise, the debugging is quite hard.

I reverted the changes to make it easier to review.

concretevitamin · 2023-07-02T22:56:59Z

sky/backends/cloud_vm_ray_backend.py

+                # This is important for the case, where an existing is
+                # transitioned into INIT state due to key interruption during
+                # launching, with the following steps:
+                # (1) launch, after answering prompt immediately ctrl-c;


What happens if we do these two steps for a new cluster name? I imagine with this PR, at step 2 we should not set it to STOPPED and we should do the provisioning loop as usual.

We won't set it to stop for a new cluster, because the new cluster will only have the following two cases:

the cluster is provisioned, but not correctly setup yet. Then the cluster will be in INIT state, and our failover will still be triggered.

the cluster is not provisioned. Then the cluster will be removed from the cluster table when we refresh the status, so the failover will be correctly triggered as well.

This repro gave

Running task on cluster dbg2... I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster. I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].

Maybe we should change L3788's logging to (or something more clear):

The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.

Good point! Updated the logging. Tested again with:

sky launch -c min --cloud gcp --cpus 2; manually terminate the cluster on the console; python -c 'import sky; sky.launch(sky.Task(), cluster_name="min")' again

concretevitamin

Thanks @Michaelvll ! Some questions.

sky/backends/backend_utils.py

concretevitamin · 2023-07-03T04:14:19Z

sky/spot/spot_utils.py

@@ -749,13 +749,13 @@ def is_spot_controller_up(
          identity.
    """
    try:
-        # Set force_refresh=False to make sure the refresh only happens when the
+        # Set force_refresh=None to make sure the refresh only happens when the


Q: why is setting it to None the same as “refresh only when the controller is init/up”?

It is because the spot controller will always have the autostop setup, which will trigger the refresh for both init and up cases for the controller.

sky/backends/cloud_vm_ray_backend.py

concretevitamin · 2023-07-03T04:23:23Z

sky/backends/backend_utils.py

-        force_refresh: if True, refresh the cluster status even if it may be
-            skipped. Otherwise (the default), only refresh if the cluster:
+        force_refresh: if specified, refresh the cluster in the specified status
+            even if it may be skipped. Otherwise (the default), only refresh if


Additionally, always refresh in either of these cases:

Co-authored-by: Zongheng Yang <[email protected]>

…terminate-cluster-for-resources-unavailability

concretevitamin

LGTM, thanks for identifying and fixing the critical issue @Michaelvll!

concretevitamin · 2023-07-03T16:48:06Z

sky/backends/cloud_vm_ray_backend.py

+                # This is important for the case, where an existing is
+                # transitioned into INIT state due to key interruption during
+                # launching, with the following steps:
+                # (1) launch, after answering prompt immediately ctrl-c;


This repro gave

Running task on cluster dbg2... I 07-03 09:45:21 cloud_vm_ray_backend.py:3788] The cluster 'dbg2' was autodowned or manually terminated on the cloud console. Using the same resources as the previously terminated one to provision a new cluster. I 07-03 09:45:21 cloud_vm_ray_backend.py:3813] Creating a new cluster: "dbg2" [1x AWS(m6i.large)].

Maybe we should change L3788's logging to (or something more clear):

The cluster 'dbg2' (status: XXX) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.

sky/backends/backend_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

…f github.com:skypilot-org/skypilot into avoid-terminate-cluster-for-resources-unavailability

Michaelvll requested a review from concretevitamin July 2, 2023 20:43

concretevitamin reviewed Jul 2, 2023

View reviewed changes

concretevitamin reviewed Jul 3, 2023

View reviewed changes

Michaelvll and others added 7 commits July 2, 2023 22:27

fix key interruption

ff7272e

fix

460ceaf

format

1d10976

fix spot

c485ffe

fix comment

98bb90a

Address comments

148bad0

Update sky/backends/cloud_vm_ray_backend.py

318c114

Co-authored-by: Zongheng Yang <[email protected]>

Michaelvll force-pushed the avoid-terminate-cluster-for-resources-unavailability branch from 7264631 to 318c114 Compare July 3, 2023 05:27

Michaelvll added 3 commits July 2, 2023 22:28

revert log lib

fd09434

revert log_lib

1194bbd

Merge branch 'master' of github.com:skypilot-org/skypilot into avoid-…

a8d2840

…terminate-cluster-for-resources-unavailability

concretevitamin approved these changes Jul 3, 2023

View reviewed changes

Michaelvll and others added 3 commits July 3, 2023 10:24

Update sky/backends/backend_utils.py

7908c32

Co-authored-by: Zongheng Yang <[email protected]>

Merge branch 'avoid-terminate-cluster-for-resources-unavailability' o…

b1f638e

…f github.com:skypilot-org/skypilot into avoid-terminate-cluster-for-resources-unavailability

better logging

949c4fb

Michaelvll merged commit 484617a into master Jul 3, 2023

Michaelvll deleted the avoid-terminate-cluster-for-resources-unavailability branch July 3, 2023 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Avoid terminating cluster for resources unavailability #2170

[core] Avoid terminating cluster for resources unavailability #2170

Michaelvll commented Jul 2, 2023 •

edited

Loading

concretevitamin Jul 2, 2023

Michaelvll Jul 2, 2023

Michaelvll Jul 3, 2023 •

edited

Loading

concretevitamin Jul 2, 2023

Michaelvll Jul 2, 2023

concretevitamin Jul 3, 2023

Michaelvll Jul 3, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin Jul 3, 2023

Michaelvll Jul 3, 2023

concretevitamin Jul 3, 2023

concretevitamin left a comment

concretevitamin Jul 3, 2023

[core] Avoid terminating cluster for resources unavailability #2170

[core] Avoid terminating cluster for resources unavailability #2170

Conversation

Michaelvll commented Jul 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Jul 2, 2023 •

edited

Loading

Michaelvll Jul 3, 2023 •

edited

Loading

Michaelvll Jul 3, 2023 •

edited

Loading