Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500
Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500
Changes from 8 commits
4d77d3d
a9340bf
e24aa8a
00111a5
732298d
b147c9a
94d3214
fdf1942
b8ed50d
e8cca2a
464336f
5daf378
a1536c2
bb34e27
6e03445
9fee569
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning? Also, we must be careful about the logs printed during the status refresh. Since that will corrupt the progress bar output of
sky status -r
. How about we change them tologger.debug
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I choose
logger.warning
because multiple TPU VM/Pod with the same cluster name is considered an abnormal case which is not supposed to happen.When this happens, it means there's a resource leak. I think in this case we'd like to let user know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a normal case for spot VM? I think for a non-TPU cluster, we don't show the warning. I think we will handle the number of IPs not equal to the actual amount in the caller function.
Also, is it true that a user can have multiple TPU VM with the same name in a same zone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for the confusion. let me explain again. For Spot VM, it also shouldn't happen that multiple Spot TPU VM having the same
labels.ray-cluster-name
.Basically the query command should only return one VM/Pod in normal case.
But if there's a leak resource (e.g., controller failed to terminate a preempted spot TPU), then this query command will return two VMs which is an abnormal case.
note that I was not referring to the "TPU name" shown on the console but
labels.ray-cluster-name
. so yes multiple TPU VM can have samelabels.ray-cluster-name
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing it to
logger.debug
but I'm also afraid that user will never find out there's a leaked resource unless they manually check the console.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, then why does the problem not happen for a non TPU VM cluster? What ensures those cluster not leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non TPU VM cluster, as it doesn't require manual cleanup after preemption, resource won't be leaked this way? but I'm not sure if there are other scenarios that could trigger leakage.
Also we mostly rely on(probably irrelevant)ray up
to handle non-TPU VM clustersThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this state be different for differnet tpu_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each TPU VM or TPU Pod only maps to a single
tpu_id
. So yes, differenttpu_id
can have different states.But when this multiple
tpu_id
situation happens, it means there is a leaked resource with the same cluster name as the current one. That's why I print the garbage resource message.Normally there should be only one
tpu_id
returned with the below query command.