Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up preempted resources for TPU #1483

Merged
merged 9 commits into from
Dec 6, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1226,11 +1226,8 @@ def _get_tpu_vm_pod_ips(ray_config: Dict[str, Any],

cluster_name = ray_config['cluster_name']
zone = ray_config['provider']['availability_zone']
# Excluding preempted VMs is safe as they are already terminated and
# do not charge.
query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
f'"(labels.ray-cluster-name={cluster_name} AND '
f'state!=PREEMPTED)" '
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'--zone={zone} --format=value\\(name\\)')
if not get_internal_ips:
tpuvm_cmd = (f'gcloud compute tpus tpu-vm describe $({query_cmd})'
Expand Down
5 changes: 1 addition & 4 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2680,12 +2680,9 @@ def teardown_no_lock(self,
# check if gcloud includes TPU VM API
backend_utils.check_gcp_cli_include_tpu_vm()

# Excluding preempted VMs is safe as they are already
# terminated and do not charge.
query_cmd = (
f'gcloud compute tpus tpu-vm list --filter='
f'"(labels.ray-cluster-name={cluster_name} AND '
f'state!=PREEMPTED)" '
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'--zone={zone} --format=value\\(name\\)')
terminate_cmd = (
f'gcloud compute tpus tpu-vm delete --zone={zone}'
Expand Down
7 changes: 7 additions & 0 deletions sky/spot/recovery_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from sky.spot import spot_utils
from sky.usage import usage_lib
from sky.utils import common_utils
from sky.utils import tpu_utils
from sky.utils import ux_utils

if typing.TYPE_CHECKING:
Expand Down Expand Up @@ -305,6 +306,12 @@ def recover(self) -> float:
new_resources = resources.copy(cloud=launched_cloud,
region=launched_region)
task.set_resources({new_resources})

# Note: Preempted TPU VM cannot be reused and needs to be
# cleaned up. Otherwise, it will occupy the quota.
is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
if is_tpuvm:
self.terminate_cluster()
Copy link
Collaborator

@Michaelvll Michaelvll Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be effective for a managed spot VM. What will happen if the sky status -r is called when there is a preempted spot TPU VM in the status table? What status do we show for that cluster?

I am leaning towards having the termination in the refresh_cluster_status in backend_utils.py, so that a spot TPU VM launched with --use-spot can be handled correctly as well. @infwinston

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we clean up preempted spot TPU VM during sky status -r.

# GCP does not clean up preempted TPU VMs. We remove it ourselves.
# TODO(wei-lin): handle multi-node cases.
if use_tpu_vm and len(status_list) == 0:
backend = backends.CloudVmRayBackend()
handle = global_user_state.get_handle_from_cluster_name(cluster)
backend.teardown_no_lock(handle,

do you mean we should just call refresh_cluster_status here instead of self.terminate_cluster()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We decided to move the termination of the cluster into the refresh_cluster_status

# Not using self.launch to avoid the retry until up logic.
launched_time = self._launch(raise_on_failure=False)
# Restore the original dag, i.e. reset the region constraint.
Expand Down