-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up preempted resources for TPU #1483
Conversation
sky/spot/recovery_strategy.py
Outdated
# Note: Preempted TPU VM cannot be reused and needs to be | ||
# cleaned up. Otherwise, it will occupy the quota. | ||
is_tpuvm = tpu_utils.is_tpu_vm(new_resources) | ||
if is_tpuvm: | ||
self.terminate_cluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only be effective for a managed spot VM. What will happen if the sky status -r
is called when there is a preempted spot TPU VM in the status table? What status do we show for that cluster?
I am leaning towards having the termination in the refresh_cluster_status
in backend_utils.py, so that a spot TPU VM launched with --use-spot
can be handled correctly as well. @infwinston
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we clean up preempted spot TPU VM during sky status -r
.
skypilot/sky/backends/backend_utils.py
Lines 1551 to 1556 in 1955bee
# GCP does not clean up preempted TPU VMs. We remove it ourselves. | |
# TODO(wei-lin): handle multi-node cases. | |
if use_tpu_vm and len(status_list) == 0: | |
backend = backends.CloudVmRayBackend() | |
handle = global_user_state.get_handle_from_cluster_name(cluster) | |
backend.teardown_no_lock(handle, |
do you mean we should just call refresh_cluster_status
here instead of self.terminate_cluster()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix and the refactoring @infwinston! Left several comments, mostly for the comments. We need to run the smoke tests for managed spots to make sure the changes do not cause issues. : )
sky/spot/recovery_strategy.py
Outdated
# Note: Preempted TPU VM cannot be reused and needs to be | ||
# cleaned up. Otherwise, it will occupy the quota. | ||
is_tpuvm = tpu_utils.is_tpu_vm(new_resources) | ||
if is_tpuvm: | ||
self.terminate_cluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status
Smoke test passed. still waiting for a real preemption to happen |
Just confirmed that |
(Haven't read the PR) Is there a strong reason to make |
The main reason is currently there's no suitable state for a preempted TPU VM. Now we have
None of them applies to a preempted instance. and there seems to be no benefits of creating a new state just for it because a preempted TPU VM only occupies a quota and has no use. User has probably no reason to keep it. This behavior also matches AWS which would remove the preempted instance automatically. Hence |
Just got a preemption and my spot TPU Pod successfully recovered. also, the preempted one is cleaned up. |
Thanks for the details! Questions:
======= (wei-lin) SORRY I accidentally edit your reply. |
I think it's also okay to map it to
I think a preempted AWS VM will be removed after
I think if a preempted VM set to stop then To me I think a preempted TPU VM is a garbage resource occupying quota. It needs to be cleaned up asap for user's convenience. But both options are okay to me. I'll let @Michaelvll add more. |
Please correct me if I am wrong @infwinston, the main reason we can safely delete the VM after the preemption is that the preempted TPU VM cannot be operated in any way except for termination, i.e. it is a "garbage resource" as @infwinston mentioned. In that case, it should be fine to clean it up with
In my mind, the
Yes, for AWS, the preempted spot cluster (single-node) will be removed from our cluster table, since the cloud will set it to terminate.
If STOPPED on preemption, I think the cluster should be set to STOPPED, instead of INIT. We may want to be aligned with the status shown from the cloud provider. |
Doing it in either case seems to break "each command does one thing". Given that a preempted TPU VM cannot have any actions except to be terminated (please document this rationale in code), the former seems slightly better now. I think it may be ok to not update the Thanks @infwinston @Michaelvll for clarifying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @infwinston! After another pass, I think our code can be improved a bit more. : )
@Michaelvll thanks for the detailed reviews. I fixed the comments and let me know if it looks good enough to ship :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing it @infwinston! LGTM.
Thanks. Smoke test passed again. Merging. |
* fix in controller * remove debug msg * msg * handle job_status == None case and refactor * space * update * comments * comments
This PR implements a better fix for #1468.
The preempted VM has to be cleaned up otherwise it'd occupy the quota.
Another choice is to modify the Autoscaler node provider, but I think that might complicate the semantic of
create_node
function. because for normal compute VM, we instead want to keep and reuse the preempted resources.skypilot/sky/skylet/providers/gcp/node_provider.py
Line 178 in 1955bee
Explicitly calling
terminate_cluster
on higher level might be better. @Michaelvll wdyt?skypilot/sky/spot/recovery_strategy.py
Line 99 in 1955bee
TODO