Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU/Spot] TPU pods fail to be launched after preempted #1468

Closed
Michaelvll opened this issue Nov 29, 2022 · 3 comments · Fixed by #1470
Closed

[TPU/Spot] TPU pods fail to be launched after preempted #1468

Michaelvll opened this issue Nov 29, 2022 · 3 comments · Fixed by #1470
Assignees
Labels
bug Something isn't working P0

Comments

@Michaelvll
Copy link
Collaborator

Our user reported that when the TPU pod is preempted, and the spot controller tries to launch it again, it fails.
image

Just for more info that might help, I can confirm it seems to work fine if I manually delete the TPU instances. The spot controller then has no trouble detecting the preemption and creating + running a new instance. The only case where it does not work is when the TPU instance is preempted (goes into a red state on the TPU dashboard). Perhaps the old preempted instance is not being deleted properly?

@infwinston
Copy link
Member

infwinston commented Dec 8, 2022

According to our user, the bug still exists.
Reason: During preemption, we expect GCP to turn the VM state from READY to PREEMPTED as shown in the document .
However, this may be false sometimes. GCP seems to turn the state to something other than PREEMPTED and make skypilot recognize the cluster as INIT state and fail to clean up the resource.

...
12-08 04:43:16 controller.py:118] Cluster is preempted (status: INIT). Recovering...
12-08 04:43:16 spot_state.py:134] === Recovering... ===

@infwinston infwinston reopened this Dec 8, 2022
@infwinston
Copy link
Member

#1500 proposes a solution by
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-6749e0638b4e0e0bf9e5b2e0be361b6394a82458a729e81c2ff2ca6dcd6a653aR315
to explicitly terminate the cluster before launching another one as status -r may not work.

@infwinston
Copy link
Member

Fixed by #1500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants