You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our user reported that when the TPU pod is preempted, and the spot controller tries to launch it again, it fails.
Just for more info that might help, I can confirm it seems to work fine if I manually delete the TPU instances. The spot controller then has no trouble detecting the preemption and creating + running a new instance. The only case where it does not work is when the TPU instance is preempted (goes into a red state on the TPU dashboard). Perhaps the old preempted instance is not being deleted properly?
The text was updated successfully, but these errors were encountered:
According to our user, the bug still exists.
Reason: During preemption, we expect GCP to turn the VM state from READY to PREEMPTED as shown in the document .
However, this may be false sometimes. GCP seems to turn the state to something other than PREEMPTED and make skypilot recognize the cluster as INIT state and fail to clean up the resource.
Our user reported that when the TPU pod is preempted, and the spot controller tries to launch it again, it fails.
The text was updated successfully, but these errors were encountered: