-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Failover] Fix leakage of existing cluster when failed to start #1497
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll; some questions.
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -1004,6 +1004,10 @@ def _retry_region_zones(self, | |||
# Get previous cluster status | |||
prev_cluster_status = backend_utils.refresh_cluster_status_handle( | |||
cluster_name, acquire_per_cluster_status_lock=False)[0] | |||
prev_cluster_exists = prev_cluster_status in [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: is it possible that a cluster can partially exist and be in INIT? E.g., launched 2 nodes, manually terminate 1, then after a refresh it's in INIT?
How does this differ from the cluster_exists
arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this follows the logic in the previous code that any cluster in INIT status will be able to fail over to other regions.
However, after thinking more about this, here is the pros and cons for the behavior of adding INIT in the list as well:
Environment
A cluster exists and is in INIT mode.
sky start -c cluster
Current behavior:
- try to launch the cluster with the same spec.
- If
ray up
fails, we terminate the cluster - Relaunch with the same resources as
cluster
(same region/zone/accelerators)
New behavior
Adding INIT to the list (or use the cluster_exists directly)
- try to launch the cluster with the same spec.
- If
ray up
fails, we try to stop the cluster - Not failover, and print out not able to launch
sky launch -c cluster
Current behavior:
- try to launch the cluster with the same spec.
- If
ray up
fails, we terminate the cluster - Relaunch and failover with empty resources, i.e. CPU instance, no matter what the previous cluster had
New behavior
Adding INIT to the list
- try to launch the cluster with the same spec.
- If
ray up
fails, we try to stop the cluster - Not failover, and print out not able to launch
sky launch -c cluster task.yaml
Current behavior
- try to launch the cluster with the same spec.
- If
ray up
fails, we terminate the cluster - Relaunch and failover with the task.resources
New behavior
Adding INIT to the list
- try to launch the cluster with the same spec.
- If
ray up
fails, we try to stop the cluster - Not failover, and print out not able to launch
Pro for adding INIT: more consistent for the three commands
Con for adding INIT: any INIT cluster in the status table needs to explicitly sky down
before the failover can work.
For example, if a user Ctrl+C the sky launch
during failover, she will have to do sky down
first before sky launch
being able to failover to other regions/clouds.
Based on the discussion above, I would prefer to add INIT to the list as well, to make the failover more conservative avoiding accident termination of the user's cluster.
Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with going with the new behavior, with some asterisks to discuss in the future.
For example, it's a bit unintuitive to me that a start/launch
may transition INIT
to STOPPED
. I see why that may be desired, but it may make the state transitions more complex. TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll.
""" | ||
# task.best_resources may not be equal to to_provision if the user | ||
# is running a job with less resources than the cluster has. | ||
cloud = to_provision.cloud | ||
# This can raise a ResourceUnavailableError, when the region/zones requested | ||
# does not appear in the catalog. It can be triggered when the user changed | ||
# the catalog file, while there is a cluster in the removed region/zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should add a FIXME here to make Cloud.make_deploy_resources_variables()
throw a different error (the fact that it throws an error is somewhat surprising; maybe we can consider removing such an error?). ResourceUnavailableError
should be and is currently heavily used only for physical resource unavailable errors from the cloud provider. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point! I agree that we may probably have a different exception for the resources not found in the catalog. Also, it would be better if we can check the issue before this function, as this function name does not indicates any checks, but just translating the requirement into the ray yaml. Added a TODO for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.
""" | ||
# task.best_resources may not be equal to to_provision if the user | ||
# is running a job with less resources than the cluster has. | ||
cloud = to_provision.cloud | ||
# This can raise a ResourceUnavailableError, when the region/zones requested | ||
# does not appear in the catalog. It can be triggered when the user changed | ||
# the catalog file, while there is a cluster in the removed region/zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed that this func name does not suggest it'd do anything that can trigger exceptions. The TODO looks good to me.
* enable requested ap region * Fix failover leakage bug * revert fetch aws * fix * format * fix * partially address * drop nan * Use inner join * rename variable * address comments * use cluster_exists
Closes #1496.
Tested:
tests/run_smoke_tests.sh