-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500
Conversation
sky/spot/recovery_strategy.py
Outdated
|
||
# Clean up preempted TPU VM before launching the cluster. | ||
# This is needed as status -r will not remove it if GCP | ||
# turns the VM state to other than PREEMPTED. | ||
is_tpuvm = tpu_utils.is_tpu_vm(new_resources) | ||
if is_tpuvm: | ||
self.terminate_cluster() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing offline, we found it might be better to have this special case handling in the controller process instead of the recovery strategy. Therefore, future recovery strategies will not need to handle separately.
Ok after some investigation, I finally realize what's going on. So basically when a TPU VM is preempted by GCP (i.e., no longer SSHable into the server), its state may not be immediately set to To be specific, during skypilot/sky/backends/backend_utils.py Line 1655 in 742c1dc
and then we got READY which translated to ClusterStatus.UP (see below log)
However, later we immediately change the status from skypilot/sky/backends/backend_utils.py Line 1674 in 742c1dc
skypilot/sky/backends/backend_utils.py Line 1696 in 742c1dc
That's why we see the cluster was in
=================== In conclusion
To avoid adding specific logic that only applies to TPU VM, I introduce a slightly better abstraction in this PR. skypilot/sky/spot/controller.py Lines 149 to 154 in b147c9a
@Michaelvll @concretevitamin what do you think? |
Thanks. Read everything before "In conclusion" and they made sense to me. Question: from the log pasted:
we see that -- RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped. |
Ah this action was in effect only after this PR. Without this PR, the termination action won't be performed.
Yes this is a problem. I was thinking we need to add a new field (such as |
Thanks for the update @infwinston!
It seems we skip the "retry in the same region first" behavior. That is because the
I think the problem may not be related to the preemption behavior, as the That said, I would prefer to rename the method |
Great catch! I just updated the code.
Yes, I agree However, I think the issue @concretevitamin mentioned still remains? If in our j2 template we specify a different interruption behavior such as |
For those, we still retry on the same region first, which will automatically reuse the stopped cluster if needed right? The current change does not make any worse for supporting those in the future comparing to the master branch, right? |
Ah I see. so you mean for AWS I think this makes sense. ==== Another thing I'm not sure is: does Azure require manual cleanup after preemption? not sure how to test it as our Azure subscription disallows us to launch a spot vm. |
Yes. Since our |
Launched several long running jobs on Spot TPU VM/Pods a few days ago.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix @infwinston! It looks good to me. Left some comments to make the code cleaner. It would be great if we can run the smoke tests before merging.
sky/backends/backend_utils.py
Outdated
if len(stdout) == 0: | ||
logger.warning('No TPU VMs found with cluster name ' | ||
f'{cluster_name} in zone {zone}.') | ||
if len(stdout.splitlines()) > 1: | ||
logger.warning('Found more than one TPU VM with cluster name ' | ||
f'{cluster_name} in zone {zone}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning? Also, we must be careful about the logs printed during the status refresh. Since that will corrupt the progress bar output of sky status -r
. How about we change them to logger.debug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I choose logger.warning
because multiple TPU VM/Pod with the same cluster name is considered an abnormal case which is not supposed to happen.
When this happens, it means there's a resource leak. I think in this case we'd like to let user know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a normal case for spot VM? I think for a non-TPU cluster, we don't show the warning. I think we will handle the number of IPs not equal to the actual amount in the caller function.
Also, is it true that a user can have multiple TPU VM with the same name in a same zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for the confusion. let me explain again. For Spot VM, it also shouldn't happen that multiple Spot TPU VM having the same labels.ray-cluster-name
.
Basically the query command should only return one VM/Pod in normal case.
query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'--zone={zone} --format=value\\(name\\)')
But if there's a leak resource (e.g., controller failed to terminate a preempted spot TPU), then this query command will return two VMs which is an abnormal case.
Also, is it true that a user can have multiple TPU VM with the same name in a same zone?
note that I was not referring to the "TPU name" shown on the console but labels.ray-cluster-name
. so yes multiple TPU VM can have same labels.ray-cluster-name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing it to logger.debug
but I'm also afraid that user will never find out there's a leaked resource unless they manually check the console.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, then why does the problem not happen for a non TPU VM cluster? What ensures those cluster not leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non TPU VM cluster, as it doesn't require manual cleanup after preemption, resource won't be leaked this way? but I'm not sure if there are other scenarios that could trigger leakage. Also we mostly rely on (probably irrelevant)ray up
to handle non-TPU VM clusters
tpuvm_json = json.loads(stdout) | ||
if tpuvm_json['state'] != 'READY': | ||
# May be a leaked preempted resource. | ||
logger.warning(f'TPU VM {tpu_id} is not in READY state. ' | ||
'Could be a garbage resource. Skipping...') | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this state be different for differnet tpu_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each TPU VM or TPU Pod only maps to a single tpu_id
. So yes, different tpu_id
can have different states.
But when this multiple tpu_id
situation happens, it means there is a leaked resource with the same cluster name as the current one. That's why I print the garbage resource message.
Normally there should be only one tpu_id
returned with the below query command.
query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'--zone={zone} --format=value\\(name\\)')
sky/backends/cloud_vm_ray_backend.py
Outdated
returncode, stdout, stderr = log_lib.run_with_log( | ||
query_cmd, | ||
log_abs_path, | ||
shell=True, | ||
stream_logs=False, | ||
require_outputs=True) | ||
|
||
# Needs to create a list as GCP does not allow deleting | ||
# multiple TPU VMs at once | ||
tpu_terminate_cmds = [] | ||
for tpu_id in stdout.splitlines(): | ||
tpu_terminate_cmds.append( | ||
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | ||
f'--quiet {tpu_id}') | ||
terminate_cmd = ' && '.join(tpu_terminate_cmds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not handle the first returncode
here. How about?
returncode, stdout, stderr = log_lib.run_with_log( | |
query_cmd, | |
log_abs_path, | |
shell=True, | |
stream_logs=False, | |
require_outputs=True) | |
# Needs to create a list as GCP does not allow deleting | |
# multiple TPU VMs at once | |
tpu_terminate_cmds = [] | |
for tpu_id in stdout.splitlines(): | |
tpu_terminate_cmds.append( | |
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | |
f'--quiet {tpu_id}') | |
terminate_cmd = ' && '.join(tpu_terminate_cmds) | |
returncode, stdout, stderr = log_lib.run_with_log( | |
query_cmd, | |
log_abs_path, | |
shell=True, | |
stream_logs=False, | |
require_outputs=True) | |
# Needs to create a list as GCP does not allow deleting | |
# multiple TPU VMs at once | |
# Skip the termination commands, if the TPU ID query commands fail. | |
tpu_terminate_cmds = [f'([[ "{returncode}" == "0" ]] || exit {returncode})'] | |
for tpu_id in stdout.splitlines(): | |
tpu_terminate_cmds.append( | |
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | |
f'--quiet {tpu_id}') | |
terminate_cmd = ' && '.join(tpu_terminate_cmds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. fixed with minor modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @infwinston!
sky/clouds/cloud.py
Outdated
In most cases, spot resources do not need cleanup after preemption. | ||
The only exception by far is GCP's Spot TPU VM. We override this method | ||
in gcp.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, spot resources do not need cleanup after preemption. | |
The only exception by far is GCP's Spot TPU VM. We override this method | |
in gcp.py. | |
In most cases, spot resources do not need cleanup after preemption, | |
as long as the cluster can be launched with the same name and tag, | |
no matter the preemption behavior is to terminate or stop the cluster. | |
The only exception by far is GCP's Spot TPU VM. We override this method | |
in gcp.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion. fixed with minor modification.
sky/backends/cloud_vm_ray_backend.py
Outdated
tpu_terminate_cmds = [f'exit {returncode}' | ||
] if returncode != 0 else [] | ||
for tpu_id in stdout.splitlines(): | ||
tpu_terminate_cmds.append( | ||
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | ||
f'--quiet {tpu_id}') | ||
terminate_cmd = ' && '.join(tpu_terminate_cmds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can print out the information about the failed query?
tpu_terminate_cmds = [f'exit {returncode}' | |
] if returncode != 0 else [] | |
for tpu_id in stdout.splitlines(): | |
tpu_terminate_cmds.append( | |
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | |
f'--quiet {tpu_id}') | |
terminate_cmd = ' && '.join(tpu_terminate_cmds) | |
if returncode != 0: | |
tpu_terminate_cmd = f'echo "cmd: {query_cmd}" && echo "{stdout}" && echo "{stderr}" >&2 && eixt {returncode}' | |
else: | |
tpu_terminate_cmds = [f'exit {returncode}' | |
] if returncode != 0 else [] | |
for tpu_id in stdout.splitlines(): | |
tpu_terminate_cmds.append( | |
f'gcloud compute tpus tpu-vm delete --zone={zone} ' | |
f'--quiet {tpu_id}') | |
terminate_cmd = ' && '.join(tpu_terminate_cmds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes let me do it. you meant this right
if returncode != 0:
terminate_cmd = (f'echo "cmd: {query_cmd}" && '
f'echo "{stdout}" && '
f'echo "{stderr}" >&2 && '
f'exit {returncode}')
else:
tpu_terminate_cmds = []
for tpu_id in stdout.splitlines():
tpu_terminate_cmds.append(
f'gcloud compute tpus tpu-vm delete --zone={zone} '
f'--quiet {tpu_id}')
terminate_cmd = ' && '.join(tpu_terminate_cmds)
OK all smoke tests have passed. I just spot launched 7 TPU VMs to see if they can handle preemptions correctly. will merge tomorrow if everything works well. Thanks a lot for reviewing @Michaelvll @concretevitamin
|
They all recovered successfully from preemptions (1 out of 7 was the special case situation). Merging this PR now!
|
Our user reported some errors when getting TPU IPs. This PR adds some safe guard for
_get_tpu_vm_pod_ips
with better error handling to prevent failures like leaked resources with duplicated cluster IDs.This PR also fixed #1514 which tried to remove an
INIT
TPU VM resource.Tested: