Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up preempted resources for TPU #1483

Merged
merged 9 commits into from
Dec 6, 2022
Merged

Clean up preempted resources for TPU #1483

merged 9 commits into from
Dec 6, 2022

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Dec 1, 2022

This PR implements a better fix for #1468.
The preempted VM has to be cleaned up otherwise it'd occupy the quota.

Another choice is to modify the Autoscaler node provider, but I think that might complicate the semantic of create_node function. because for normal compute VM, we instead want to keep and reuse the preempted resources.

def create_node(self, base_config: dict, tags: dict, count: int) -> Dict[str, dict]:

Explicitly calling terminate_cluster on higher level might be better. @Michaelvll wdyt?

def terminate_cluster(self, max_retry: int = 3) -> None:

TODO

  • Spot related smoke tests
  • Test the recovery from real preemption (in progress)

Comment on lines 310 to 314
# Note: Preempted TPU VM cannot be reused and needs to be
# cleaned up. Otherwise, it will occupy the quota.
is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
if is_tpuvm:
self.terminate_cluster()
Copy link
Collaborator

@Michaelvll Michaelvll Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be effective for a managed spot VM. What will happen if the sky status -r is called when there is a preempted spot TPU VM in the status table? What status do we show for that cluster?

I am leaning towards having the termination in the refresh_cluster_status in backend_utils.py, so that a spot TPU VM launched with --use-spot can be handled correctly as well. @infwinston

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we clean up preempted spot TPU VM during sky status -r.

# GCP does not clean up preempted TPU VMs. We remove it ourselves.
# TODO(wei-lin): handle multi-node cases.
if use_tpu_vm and len(status_list) == 0:
backend = backends.CloudVmRayBackend()
handle = global_user_state.get_handle_from_cluster_name(cluster)
backend.teardown_no_lock(handle,

do you mean we should just call refresh_cluster_status here instead of self.terminate_cluster()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix and the refactoring @infwinston! Left several comments, mostly for the comments. We need to run the smoke tests for managed spots to make sure the changes do not cause issues. : )

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
Comment on lines 310 to 314
# Note: Preempted TPU VM cannot be reused and needs to be
# cleaned up. Otherwise, it will occupy the quota.
is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
if is_tpuvm:
self.terminate_cluster()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status

@infwinston
Copy link
Member Author

Smoke test passed. still waiting for a real preemption to happen

@infwinston
Copy link
Member Author

infwinston commented Dec 3, 2022

Just confirmed that status -r will clean up a preempted TPU VM!

@Michaelvll Michaelvll self-requested a review December 3, 2022 01:00
@concretevitamin
Copy link
Member

(Haven't read the PR) Is there a strong reason to make status -r to do a cleanup action? So far, users' mental model of this flag is to reconcile the cached cluster status with the cloud's status, which doesn't involve an active "cleanup".

@infwinston
Copy link
Member Author

The main reason is currently there's no suitable state for a preempted TPU VM. Now we have

global_user_state.ClusterStatus.INIT
global_user_state.ClusterStatus.UP
global_user_state.ClusterStatus.STOPPED

None of them applies to a preempted instance. and there seems to be no benefits of creating a new state just for it because a preempted TPU VM only occupies a quota and has no use. User has probably no reason to keep it. This behavior also matches AWS which would remove the preempted instance automatically.

Hence status -r treats preempted TPU VM as gone hence removes it. does it make sense to you?

@infwinston
Copy link
Member Author

Just got a preemption and my spot TPU Pod successfully recovered. also, the preempted one is cleaned up.

@concretevitamin
Copy link
Member

concretevitamin commented Dec 3, 2022

Thanks for the details! Questions:

  1. What about using INIT? It already refers to the cluster may be up or down. Basically any abnormal states.

  2. What about using INIT? It already refers to the cluster being possibly up or down. Basically any abnormal states can be mapped to INIT.

  3. What do we do in today's main branch for AWS unmanaged spot getting preempted; what would status -r before this PR show?
    In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

=======

(wei-lin) SORRY I accidentally edit your reply.

@infwinston
Copy link
Member Author

  1. What about using INIT? It already refers to the cluster being possibly up or down. Basically any abnormal states can be mapped to INIT.

I think it's also okay to map it to INIT. but I think there's no much difference between removing it because the VM/disk is gone and there's no way to recover. User has to manually remove them which can be a hassle. @Michaelvll wdyt?

  1. What do we do in today's main branch for AWS unmanaged spot getting preempted; what would status -r before this PR show?

I think a preempted AWS VM will be removed after status -r as it's terminated? @Michaelvll is it true?

  1. In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

I think if a preempted VM set to stop then status -r will show STOPPED for the VM already?
This also happens to a normal GCP spot VM which will be set to stopped if preempted. Our status -r reflects this as well.
however, for TPU VM, there's no such option and preemption always means termination.

To me I think a preempted TPU VM is a garbage resource occupying quota. It needs to be cleaned up asap for user's convenience. But both options are okay to me. I'll let @Michaelvll add more.

@Michaelvll
Copy link
Collaborator

Michaelvll commented Dec 4, 2022

Please correct me if I am wrong @infwinston, the main reason we can safely delete the VM after the preemption is that the preempted TPU VM cannot be operated in any way except for termination, i.e. it is a "garbage resource" as @infwinston mentioned. In that case, it should be fine to clean it up with sky status -r.

I think it's also okay to map it to INIT. but I think there's no much difference between removing it because the VM/disk is gone and there's no way to recover. User has to manually remove them which can be a hassle. @Michaelvll wdyt?

In my mind, the INIT cluster will be able to restart with a simple sky launch. However, a preempted TPU VM cannot be correctly restarted with ray up. In this case, we need either terminate the cluster during sky status -r or make the sky launch automatically delete the instance before restart the cluster when it finds the cluster is preempted.

I think a preempted AWS VM will be removed after status -r as it's terminated? @Michaelvll is it true?

Yes, for AWS, the preempted spot cluster (single-node) will be removed from our cluster table, since the cloud will set it to terminate.

In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

If STOPPED on preemption, I think the cluster should be set to STOPPED, instead of INIT. We may want to be aligned with the status shown from the cloud provider.

@concretevitamin
Copy link
Member

In this case, we need either terminate the cluster during sky status -r or make the sky launch automatically delete the instance before restart the cluster when it finds the cluster is preempted.

Doing it in either case seems to break "each command does one thing". Given that a preempted TPU VM cannot have any actions except to be terminated (please document this rationale in code), the former seems slightly better now. I think it may be ok to not update the --refresh help string since this is a corner case.

Thanks @infwinston @Michaelvll for clarifying.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @infwinston! After another pass, I think our code can be improved a bit more. : )

sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
sky/spot/controller.py Outdated Show resolved Hide resolved
@infwinston
Copy link
Member Author

@Michaelvll thanks for the detailed reviews. I fixed the comments and let me know if it looks good enough to ship :)

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing it @infwinston! LGTM.

@infwinston
Copy link
Member Author

Thanks. Smoke test passed again. Merging.

@infwinston infwinston merged commit ee73e7d into master Dec 6, 2022
@infwinston infwinston deleted the spot-tpu-bug-fix branch December 6, 2022 01:20
iojw pushed a commit that referenced this pull request Feb 18, 2023
* fix in controller

* remove debug msg

* msg

* handle job_status == None case and refactor

* space

* update

* comments

* comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants