Clean up preempted resources for TPU #1483

infwinston · 2022-12-01T23:13:43Z

This PR implements a better fix for #1468.
The preempted VM has to be cleaned up otherwise it'd occupy the quota.

Another choice is to modify the Autoscaler node provider, but I think that might complicate the semantic of create_node function. because for normal compute VM, we instead want to keep and reuse the preempted resources.

skypilot/sky/skylet/providers/gcp/node_provider.py

Line 178 in 1955bee

    
           def create_node(self, base_config: dict, tags: dict, count: int) -> Dict[str, dict]:

Explicitly calling terminate_cluster on higher level might be better. @Michaelvll wdyt?

skypilot/sky/spot/recovery_strategy.py

Line 99 in 1955bee

def terminate_cluster(self, max_retry: int = 3) -> None:

TODO

Spot related smoke tests
Test the recovery from real preemption (in progress)

Michaelvll · 2022-12-01T23:55:05Z

sky/spot/recovery_strategy.py

+                # Note: Preempted TPU VM cannot be reused and needs to be
+                # cleaned up. Otherwise, it will occupy the quota.
+                is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
+                if is_tpuvm:
+                    self.terminate_cluster()


This will only be effective for a managed spot VM. What will happen if the sky status -r is called when there is a preempted spot TPU VM in the status table? What status do we show for that cluster?

I am leaning towards having the termination in the refresh_cluster_status in backend_utils.py, so that a spot TPU VM launched with --use-spot can be handled correctly as well. @infwinston

Looks like we clean up preempted spot TPU VM during sky status -r.

skypilot/sky/backends/backend_utils.py

Lines 1551 to 1556 in 1955bee

# GCP does not clean up preempted TPU VMs. We remove it ourselves.

# TODO(wei-lin): handle multi-node cases.

if use_tpu_vm and len(status_list) == 0:

backend = backends.CloudVmRayBackend()

handle = global_user_state.get_handle_from_cluster_name(cluster)

backend.teardown_no_lock(handle,

do you mean we should just call refresh_cluster_status here instead of self.terminate_cluster()?

Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status

Michaelvll

Thank you for the fix and the refactoring @infwinston! Left several comments, mostly for the comments. We need to run the smoke tests for managed spots to make sure the changes do not cause issues. : )

sky/spot/controller.py

Michaelvll · 2022-12-02T06:12:40Z

sky/spot/recovery_strategy.py

+                # Note: Preempted TPU VM cannot be reused and needs to be
+                # cleaned up. Otherwise, it will occupy the quota.
+                is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
+                if is_tpuvm:
+                    self.terminate_cluster()


Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status

infwinston · 2022-12-02T22:48:22Z

Smoke test passed. still waiting for a real preemption to happen

infwinston · 2022-12-03T00:57:08Z

Just confirmed that status -r will clean up a preempted TPU VM!

concretevitamin · 2022-12-03T01:02:59Z

(Haven't read the PR) Is there a strong reason to make status -r to do a cleanup action? So far, users' mental model of this flag is to reconcile the cached cluster status with the cloud's status, which doesn't involve an active "cleanup".

infwinston · 2022-12-03T07:49:07Z

The main reason is currently there's no suitable state for a preempted TPU VM. Now we have

global_user_state.ClusterStatus.INIT
global_user_state.ClusterStatus.UP
global_user_state.ClusterStatus.STOPPED

None of them applies to a preempted instance. and there seems to be no benefits of creating a new state just for it because a preempted TPU VM only occupies a quota and has no use. User has probably no reason to keep it. This behavior also matches AWS which would remove the preempted instance automatically.

Hence status -r treats preempted TPU VM as gone hence removes it. does it make sense to you?

infwinston · 2022-12-03T07:50:31Z

Just got a preemption and my spot TPU Pod successfully recovered. also, the preempted one is cleaned up.

concretevitamin · 2022-12-03T08:46:07Z

Thanks for the details! Questions:

What about using INIT? It already refers to the cluster may be up or down. Basically any abnormal states.
What about using INIT? It already refers to the cluster being possibly up or down. Basically any abnormal states can be mapped to INIT.
What do we do in today's main branch for AWS unmanaged spot getting preempted; what would status -r before this PR show?
In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

=======

(wei-lin) SORRY I accidentally edit your reply.

infwinston · 2022-12-03T22:10:16Z

What about using INIT? It already refers to the cluster being possibly up or down. Basically any abnormal states can be mapped to INIT.

I think it's also okay to map it to INIT. but I think there's no much difference between removing it because the VM/disk is gone and there's no way to recover. User has to manually remove them which can be a hassle. @Michaelvll wdyt?

What do we do in today's main branch for AWS unmanaged spot getting preempted; what would status -r before this PR show?

I think a preempted AWS VM will be removed after status -r as it's terminated? @Michaelvll is it true?

In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

I think if a preempted VM set to stop then status -r will show STOPPED for the VM already?
This also happens to a normal GCP spot VM which will be set to stopped if preempted. Our status -r reflects this as well.
however, for TPU VM, there's no such option and preemption always means termination.

To me I think a preempted TPU VM is a garbage resource occupying quota. It needs to be cleaned up asap for user's convenience. But both options are okay to me. I'll let @Michaelvll add more.

Michaelvll · 2022-12-04T00:26:43Z

Please correct me if I am wrong @infwinston, the main reason we can safely delete the VM after the preemption is that the preempted TPU VM cannot be operated in any way except for termination, i.e. it is a "garbage resource" as @infwinston mentioned. In that case, it should be fine to clean it up with sky status -r.

I think it's also okay to map it to INIT. but I think there's no much difference between removing it because the VM/disk is gone and there's no way to recover. User has to manually remove them which can be a hassle. @Michaelvll wdyt?

In my mind, the INIT cluster will be able to restart with a simple sky launch. However, a preempted TPU VM cannot be correctly restarted with ray up. In this case, we need either terminate the cluster during sky status -r or make the sky launch automatically delete the instance before restart the cluster when it finds the cluster is preempted.

I think a preempted AWS VM will be removed after status -r as it's terminated? @Michaelvll is it true?

Yes, for AWS, the preempted spot cluster (single-node) will be removed from our cluster table, since the cloud will set it to terminate.

In AWS, the preemption behavior can be explicitly set to stop or terminate. So it's plausible to refresh a preempted node to INIT if it's stopped on preemption.

If STOPPED on preemption, I think the cluster should be set to STOPPED, instead of INIT. We may want to be aligned with the status shown from the cloud provider.

concretevitamin · 2022-12-04T01:18:29Z

In this case, we need either terminate the cluster during sky status -r or make the sky launch automatically delete the instance before restart the cluster when it finds the cluster is preempted.

Doing it in either case seems to break "each command does one thing". Given that a preempted TPU VM cannot have any actions except to be terminated (please document this rationale in code), the former seems slightly better now. I think it may be ok to not update the --refresh help string since this is a corner case.

Thanks @infwinston @Michaelvll for clarifying.

Michaelvll

Thanks for the fix @infwinston! After another pass, I think our code can be improved a bit more. : )

sky/spot/controller.py

infwinston · 2022-12-05T21:21:39Z

@Michaelvll thanks for the detailed reviews. I fixed the comments and let me know if it looks good enough to ship :)

Michaelvll

Thanks for fixing it @infwinston! LGTM.

infwinston · 2022-12-06T01:20:52Z

Thanks. Smoke test passed again. Merging.

* fix in controller * remove debug msg * msg * handle job_status == None case and refactor * space * update * comments * comments

infwinston added 3 commits December 1, 2022 15:03

fix in controller

ada83de

remove debug msg

3160afd

msg

c836216

concretevitamin added the P0 label Dec 1, 2022

Michaelvll reviewed Dec 1, 2022

View reviewed changes

infwinston added 3 commits December 1, 2022 18:56

handle job_status == None case and refactor

c5233e7

space

0262323

update

3b2826c

Michaelvll reviewed Dec 2, 2022

View reviewed changes

infwinston added 2 commits December 1, 2022 22:33

comments

5cb71b2

Merge branch 'master' into spot-tpu-bug-fix

505fd7a

Michaelvll self-requested a review December 3, 2022 01:00

Michaelvll reviewed Dec 4, 2022

View reviewed changes

sky/spot/controller.py Outdated Show resolved Hide resolved

sky/spot/controller.py Outdated Show resolved Hide resolved

sky/spot/controller.py Outdated Show resolved Hide resolved

comments

152c8eb

Michaelvll approved these changes Dec 5, 2022

View reviewed changes

infwinston merged commit ee73e7d into master Dec 6, 2022

infwinston deleted the spot-tpu-bug-fix branch December 6, 2022 01:20

iojw pushed a commit that referenced this pull request Feb 18, 2023

Clean up preempted resources for TPU (#1483)

cbb132e

* fix in controller * remove debug msg * msg * handle job_status == None case and refactor * space * update * comments * comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up preempted resources for TPU #1483

Clean up preempted resources for TPU #1483

infwinston commented Dec 1, 2022 •

edited

Loading

Michaelvll Dec 1, 2022 •

edited

Loading

infwinston Dec 2, 2022

Michaelvll Dec 2, 2022

Michaelvll left a comment

Michaelvll Dec 2, 2022

infwinston commented Dec 2, 2022

infwinston commented Dec 3, 2022 •

edited

Loading

concretevitamin commented Dec 3, 2022

infwinston commented Dec 3, 2022

infwinston commented Dec 3, 2022

concretevitamin commented Dec 3, 2022 •

edited by infwinston

Loading

infwinston commented Dec 3, 2022

Michaelvll commented Dec 4, 2022 •

edited

Loading

concretevitamin commented Dec 4, 2022

Michaelvll left a comment

infwinston commented Dec 5, 2022

Michaelvll left a comment

infwinston commented Dec 6, 2022

	# GCP does not clean up preempted TPU VMs. We remove it ourselves.
	# TODO(wei-lin): handle multi-node cases.
	if use_tpu_vm and len(status_list) == 0:
	backend = backends.CloudVmRayBackend()
	handle = global_user_state.get_handle_from_cluster_name(cluster)
	backend.teardown_no_lock(handle,

Clean up preempted resources for TPU #1483

Clean up preempted resources for TPU #1483

Conversation

infwinston commented Dec 1, 2022 • edited Loading

Michaelvll Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

infwinston Dec 2, 2022

Choose a reason for hiding this comment

Michaelvll Dec 2, 2022

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Dec 2, 2022

Choose a reason for hiding this comment

infwinston commented Dec 2, 2022

infwinston commented Dec 3, 2022 • edited Loading

concretevitamin commented Dec 3, 2022

infwinston commented Dec 3, 2022

infwinston commented Dec 3, 2022

concretevitamin commented Dec 3, 2022 • edited by infwinston Loading

infwinston commented Dec 3, 2022

Michaelvll commented Dec 4, 2022 • edited Loading

concretevitamin commented Dec 4, 2022

Michaelvll left a comment

Choose a reason for hiding this comment

infwinston commented Dec 5, 2022

Michaelvll left a comment

Choose a reason for hiding this comment

infwinston commented Dec 6, 2022

infwinston commented Dec 1, 2022 •

edited

Loading

Michaelvll Dec 1, 2022 •

edited

Loading

infwinston commented Dec 3, 2022 •

edited

Loading

concretevitamin commented Dec 3, 2022 •

edited by infwinston

Loading

Michaelvll commented Dec 4, 2022 •

edited

Loading