Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

infwinston · 2022-12-07T23:32:30Z

Our user reported some errors when getting TPU IPs. This PR adds some safe guard for _get_tpu_vm_pod_ips with better error handling to prevent failures like leaked resources with duplicated cluster IDs.

This PR also fixed #1514 which tried to remove an INIT TPU VM resource.

Tested:

launching TPU VM/TPU Pod
TPU Pod smoke test

Michaelvll · 2022-12-08T21:51:01Z

sky/spot/recovery_strategy.py

+
+                # Clean up preempted TPU VM before launching the cluster.
+                # This is needed as status -r will not remove it if GCP
+                # turns the VM state to other than PREEMPTED.
+                is_tpuvm = tpu_utils.is_tpu_vm(new_resources)
+                if is_tpuvm:
+                    self.terminate_cluster()


After discussing offline, we found it might be better to have this special case handling in the controller process instead of the recovery strategy. Therefore, future recovery strategies will not need to handle separately.

infwinston · 2022-12-11T20:55:52Z

Ok after some investigation, I finally realize what's going on. So basically when a TPU VM is preempted by GCP (i.e., no longer SSHable into the server), its state may not be immediately set to PREEMPTED but may stay in READY for a while.
Hence, even when our controller detects such failure to connect and determine the server is preempted, status -r won't reflect such status correctly.

To be specific, during status -r, we run the below code to refresh status

skypilot/sky/backends/backend_utils.py

Line 1655 in 742c1dc

node_statuses = _get_cluster_status_via_cloud_cli(handle)

and then we got READY which translated to ClusterStatus.UP (see below log)

(sky-de99-weichiang, pid=734143) I 12-10 21:39:36 spot_utils.py:72] Failed to connect to the cluster.
(sky-de99-weichiang, pid=734143) I 12-10 21:39:36 spot_utils.py:73] ==================================
(sky-de99-weichiang, pid=734143) D 12-10 21:39:38 backend_utils.py:1676] Refreshing status: Failed to get IPs from cluster 'sky-de99-weichiang-16', trying to fetch from provider.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] gcloud compute tpus tpu-vm list --zone us-central1-f --filter="(labels.ray-cluster-name=sky-de99-weichiang-16 AND labels.ray-launch-config=(c070152c2602f36dd1adc9a0a6cd087fc2db9352))" --format="value(state)" returned 0.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] **** STDOUT ****
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] READY
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433]
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433] **** STDERR ****
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1433]
(sky-de99-weichiang, pid=734143) W 12-10 21:39:39 backend_utils.py:1452] Cluster status: 'READY'.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1710] Failed to reset autostop. Due to <class 'sky.exceptions.CommandError'>: Command python3 -u -c 'from sky.skylet import autostop_lib;autostop_lib.set_autostop(-1, '"'"'cloudvmray'"'"', False)' failed with return code 255.
(sky-de99-weichiang, pid=734143) D 12-10 21:39:39 backend_utils.py:1710] Failed to set autostop
(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 spot_state.py:134] === Recovering... ===
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126] Ignoring the job cancellation failure; the spot cluster is likely completely stopped.
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126]   Detailed exception: 'NoneType' object has no attribute 'cluster_yaml'
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:316] Terminating unhealthy spot cluster.
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:320] Relaunch the cluster  without constraining to prior cloud/region.

However, later we immediately change the status from UP to INIT because we determine the cluster is "abnormal".

skypilot/sky/backends/backend_utils.py

Line 1674 in 742c1dc

is_abnormal = ((0 < len(node_statuses) < handle.launched_nodes) or

skypilot/sky/backends/backend_utils.py

Line 1696 in 742c1dc

global_user_state.add_or_update_cluster(cluster_name,

That's why we see the cluster was in INIT state.

(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...

===================

In conclusion

We cannot rely on status -r to clean up preempted resources because the state may not reflect in real-time.
We also cannot rely on ray up because a spot controller may not try ray up with the same region/zone after preemption. so a preempted resource can still be leaked.
in the end, I think it is the spot controller's job to make sure there's no leaked preempted resource before launching a new one.

To avoid adding specific logic that only applies to TPU VM, I introduce a slightly better abstraction in this PR.

skypilot/sky/spot/controller.py

Lines 149 to 154 in b147c9a

    
           if not resources.is_spot_restartable(): 
        
               # If the resource is not restartable after preemption, 
        
               # we need to terminate the cluster before recovering it. 
        
               logger.info('Resource not restartable. Cleaning up ' 
        
                           'the cluster.') 
        
               self._strategy_executor.terminate_cluster()

@Michaelvll @concretevitamin what do you think?

concretevitamin · 2022-12-12T07:40:55Z

Thanks. Read everything before "In conclusion" and they made sense to me.

Question: from the log pasted:

(sky-de99-weichiang, pid=734143) I 12-10 21:39:39 controller.py:118] Cluster is preempted (status: INIT). Recovering...
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 spot_state.py:134] === Recovering... ===
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126] Ignoring the job cancellation failure; the spot cluster is likely completely stopped.
(sky-de99-weichiang, pid=734143) I 12-10 21:42:11 recovery_strategy.py:126]   Detailed exception: 'NoneType' object has no attribute 'cluster_yaml'
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:316] Terminating unhealthy spot cluster.
(sky-de99-weichiang, pid=734143) D 12-10 21:42:11 recovery_strategy.py:320] Relaunch the cluster  without constraining to prior cloud/region.

we see that recovery_strategy.py:316] Terminating unhealthy spot cluster. This suggests a termination request is being run for the preempted cluster (our status: INIT; console status: READY/PREEMPTED). Is this call not going through or otherwise not sufficient?

--

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

infwinston · 2022-12-12T18:47:59Z

we see that recovery_strategy.py:316] Terminating unhealthy spot cluster. This suggests a termination request is being run for the preempted cluster (our status: INIT; console status: READY/PREEMPTED). Is this call not going through or otherwise not sufficient?

Ah this action was in effect only after this PR. Without this PR, the termination action won't be performed.

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

Yes this is a problem. I was thinking we need to add a new field (such as self.termination_on_spot) in Resources to indicate such property. And each cloud may have their own default value.

Michaelvll · 2022-12-12T19:26:50Z

Thanks for the update @infwinston!

Ah this action was in effect only after this PR. Without this PR, the termination action won't be performed.

It seems we skip the "retry in the same region first" behavior. That is because the terminate_cluster resets the self._launched_cloud_region, before the recovery strategy starts. Let's remove the terminate_cluster method in the FailoverStrategyExecutor and add the self._launched_cloud_region = None before the self.terminate_cluster() in the recover() method.

RE the new method that takes in a Resources: it's probably not enough to determine whether it's restartable by looking at the logical representation sky.Resources. For example, the .j2 template can change preemption behavior to stopped.

I think the problem may not be related to the preemption behavior, as the is_spot_restartable here only means if we can sky.launch a cluster with the same name and hash, when the spot cluster is preempted (no matter it is stopped or terminated).

That said, I would prefer to rename the method is_spot_restartable to need_termination_after_preemption, or something similar.

infwinston · 2022-12-12T20:00:20Z

It seems we skip the "retry in the same region first" behavior. That is because the terminate_cluster resets the self._launched_cloud_region, before the recovery strategy starts. Let's remove the terminate_cluster method in the FailoverStrategyExecutor and add the self._launched_cloud_region = None before the self.terminate_cluster() in the recover() method.

Great catch! I just updated the code.

I think the problem may not be related to the preemption behavior, as the is_spot_restartable here only means if we can sky.launch a cluster with the same name and hash, when the spot cluster is preempted (no matter it is stopped or terminated).
That said, I would prefer to rename the method is_spot_restartable to need_termination_after_preemption, or something similar.

Yes, I agree need_termination_after_preemption can be a better name. I update the code with a shorter name need_cleanup_after_preemption.

However, I think the issue @concretevitamin mentioned still remains? If in our j2 template we specify a different interruption behavior such as Stop rather than Terminate (default) in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html.
Then for AWS Spot VM, we shouldn't clean up the spot vm's disk after preemption as we want to reuse it if possible?

Michaelvll · 2022-12-12T20:04:04Z

However, I think the issue @concretevitamin mentioned still remains? If in our j2 template we specify a different interruption behavior such as Stop rather than Terminate (default) in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/interruption-behavior.html.
Then for AWS Spot VM, we shouldn't clean up the spot vm's disk after preemption as we want to reuse it if possible?

For those, we still retry on the same region first, which will automatically reuse the stopped cluster if needed right? The current change does not make any worse for supporting those in the future comparing to the master branch, right?

infwinston · 2022-12-12T20:11:46Z

For those, we still retry on the same region first, which will automatically reuse the stopped cluster if needed right?

Ah I see. so you mean for AWS need_cleanup_after_preemption should still be False even if we change its preemption behavior.
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-96e0e94ebec2d13f2565051ef5df13f97627aa02ad198916e5591392a73d1b65R362
Because AWS never requires manual cleanup by users after preemption.

I think this makes sense.

====

Another thing I'm not sure is: does Azure require manual cleanup after preemption? not sure how to test it as our Azure subscription disallows us to launch a spot vm.

Michaelvll · 2022-12-12T23:26:35Z

Ah I see. so you mean for AWS need_cleanup_after_preemption should still be False even if we change its preemption behavior.
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-96e0e94ebec2d13f2565051ef5df13f97627aa02ad198916e5591392a73d1b65R362
Because AWS never requires manual cleanup by users after preemption.

Yes. Since our sky status -r will reflect the correct status from the cloud provider, I think it should be sufficient to handle the different behavior of terminate/stop.

infwinston · 2022-12-13T18:52:51Z

Launched several long running jobs on Spot TPU VM/Pods a few days ago.
They survived through 30+ preemptions in total and no leaked resource found.
Looks like the change should be robust enough..

(sky-tmp) weichiang@mbp sky % sky spot queue
Fetching managed spot job statuses...
Managed spot jobs:
In progress jobs: 7 RUNNING

ID  NAME                RESOURCES         SUBMITTED   TOT. DURATION       JOB DURATION        #RECOVERIES  STATUS
19  sky-c6c2-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 12h 20m 1s   3 days 9h 53m 32s   10           RUNNING
18  sky-f148-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 10m 15s  3 days 20h 14m 52s  5            RUNNING
17  sky-05c6-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 11m 56s  3 days 20h 22m 54s  4            RUNNING
16  sky-de99-weichiang  1x [tpu-v2-8:1]   3 days ago  3 days 21h 13m 3s   3 days 19h 27m 23s  11           RUNNING
15  sky-55ca-weichiang  1x [tpu-v3-8:1]   4 days ago  4 days 10h 30m 56s  4 days 10h 1m 3s    4            RUNNING
14  sky-b46d-weichiang  1x [tpu-v2-32:1]  4 days ago  4 days 11h 11m 21s  4 days 10h 32m 46s  4            RUNNING
13  sky-94d0-weichiang  1x [tpu-v3-32:1]  4 days ago  4 days 11h 13m 34s  4 days 10h 18m 54s  4            RUNNING

Michaelvll

Thank you for the fix @infwinston! It looks good to me. Left some comments to make the code cleaner. It would be great if we can run the smoke tests before merging.

Michaelvll · 2022-12-09T00:00:28Z

sky/backends/backend_utils.py

+    if len(stdout) == 0:
+        logger.warning('No TPU VMs found with cluster name '
+                       f'{cluster_name} in zone {zone}.')
+    if len(stdout.splitlines()) > 1:
+        logger.warning('Found more than one TPU VM with cluster name '
+                       f'{cluster_name} in zone {zone}.')


Should this be a warning? Also, we must be careful about the logs printed during the status refresh. Since that will corrupt the progress bar output of sky status -r. How about we change them to logger.debug?

Ah I choose logger.warning because multiple TPU VM/Pod with the same cluster name is considered an abnormal case which is not supposed to happen.
When this happens, it means there's a resource leak. I think in this case we'd like to let user know?

Isn't it a normal case for spot VM? I think for a non-TPU cluster, we don't show the warning. I think we will handle the number of IPs not equal to the actual amount in the caller function.

Also, is it true that a user can have multiple TPU VM with the same name in a same zone?

Ah sorry for the confusion. let me explain again. For Spot VM, it also shouldn't happen that multiple Spot TPU VM having the same labels.ray-cluster-name.

Basically the query command should only return one VM/Pod in normal case.

query_cmd = (f'gcloud compute tpus tpu-vm list --filter=' f'\\(labels.ray-cluster-name={cluster_name}\\) ' f'--zone={zone} --format=value\\(name\\)')

But if there's a leak resource (e.g., controller failed to terminate a preempted spot TPU), then this query command will return two VMs which is an abnormal case.

Also, is it true that a user can have multiple TPU VM with the same name in a same zone?

note that I was not referring to the "TPU name" shown on the console but labels.ray-cluster-name. so yes multiple TPU VM can have same labels.ray-cluster-name.

I'm fine with changing it to logger.debug but I'm also afraid that user will never find out there's a leaked resource unless they manually check the console.

I am confused, then why does the problem not happen for a non TPU VM cluster? What ensures those cluster not leaked?

for non TPU VM cluster, as it doesn't require manual cleanup after preemption, resource won't be leaked this way? but I'm not sure if there are other scenarios that could trigger leakage. ~~Also we mostly rely on ray up to handle non-TPU VM clusters~~ (probably irrelevant)

sky/backends/cloud_vm_ray_backend.py

sky/clouds/aws.py

sky/backends/backend_utils.py

Michaelvll · 2022-12-14T00:57:38Z

sky/backends/backend_utils.py

+        tpuvm_json = json.loads(stdout)
+        if tpuvm_json['state'] != 'READY':
+            # May be a leaked preempted resource.
+            logger.warning(f'TPU VM {tpu_id} is not in READY state. '
+                           'Could be a garbage resource. Skipping...')
+            continue


Will this state be different for differnet tpu_id?

Each TPU VM or TPU Pod only maps to a single tpu_id. So yes, different tpu_id can have different states.
But when this multiple tpu_id situation happens, it means there is a leaked resource with the same cluster name as the current one. That's why I print the garbage resource message.

Normally there should be only one tpu_id returned with the below query command.

query_cmd = (f'gcloud compute tpus tpu-vm list --filter=' f'\\(labels.ray-cluster-name={cluster_name}\\) ' f'--zone={zone} --format=value\\(name\\)')

Michaelvll · 2022-12-14T05:26:30Z

sky/backends/cloud_vm_ray_backend.py

+                    returncode, stdout, stderr = log_lib.run_with_log(
+                        query_cmd,
+                        log_abs_path,
+                        shell=True,
+                        stream_logs=False,
+                        require_outputs=True)
+
+                    # Needs to create a list as GCP does not allow deleting
+                    # multiple TPU VMs at once
+                    tpu_terminate_cmds = []
+                    for tpu_id in stdout.splitlines():
+                        tpu_terminate_cmds.append(
+                            f'gcloud compute tpus tpu-vm delete --zone={zone} '
+                            f'--quiet {tpu_id}')
+                    terminate_cmd = ' && '.join(tpu_terminate_cmds)


We did not handle the first returncode here. How about?

Suggested change

returncode, stdout, stderr = log_lib.run_with_log(

query_cmd,

log_abs_path,

shell=True,

stream_logs=False,

require_outputs=True)

# Needs to create a list as GCP does not allow deleting

# multiple TPU VMs at once

tpu_terminate_cmds = []

for tpu_id in stdout.splitlines():

tpu_terminate_cmds.append(

f'gcloud compute tpus tpu-vm delete --zone={zone} '

f'--quiet {tpu_id}')

terminate_cmd = ' && '.join(tpu_terminate_cmds)

returncode, stdout, stderr = log_lib.run_with_log(

query_cmd,

log_abs_path,

shell=True,

stream_logs=False,

require_outputs=True)

# Needs to create a list as GCP does not allow deleting

# multiple TPU VMs at once

# Skip the termination commands, if the TPU ID query commands fail.

tpu_terminate_cmds = [f'([[ "{returncode}" == "0" ]] || exit {returncode})']

for tpu_id in stdout.splitlines():

tpu_terminate_cmds.append(

f'gcloud compute tpus tpu-vm delete --zone={zone} '

f'--quiet {tpu_id}')

terminate_cmd = ' && '.join(tpu_terminate_cmds)

good point. fixed with minor modification.

Michaelvll

LGTM! Thanks @infwinston!

Michaelvll · 2022-12-14T05:29:59Z

sky/clouds/cloud.py

+        In most cases, spot resources do not need cleanup after preemption.
+        The only exception by far is GCP's Spot TPU VM. We override this method
+        in gcp.py.


Suggested change

In most cases, spot resources do not need cleanup after preemption.

The only exception by far is GCP's Spot TPU VM. We override this method

in gcp.py.

In most cases, spot resources do not need cleanup after preemption,

as long as the cluster can be launched with the same name and tag,

no matter the preemption behavior is to terminate or stop the cluster.

The only exception by far is GCP's Spot TPU VM. We override this method

in gcp.py.

good suggestion. fixed with minor modification.

Michaelvll · 2022-12-14T06:26:00Z

sky/backends/cloud_vm_ray_backend.py

+                    tpu_terminate_cmds = [f'exit {returncode}'
+                                         ] if returncode != 0 else []
+                    for tpu_id in stdout.splitlines():
+                        tpu_terminate_cmds.append(
+                            f'gcloud compute tpus tpu-vm delete --zone={zone} '
+                            f'--quiet {tpu_id}')
+                    terminate_cmd = ' && '.join(tpu_terminate_cmds)


nit: we can print out the information about the failed query?

Suggested change

tpu_terminate_cmds = [f'exit {returncode}'

] if returncode != 0 else []

for tpu_id in stdout.splitlines():

tpu_terminate_cmds.append(

f'gcloud compute tpus tpu-vm delete --zone={zone} '

f'--quiet {tpu_id}')

terminate_cmd = ' && '.join(tpu_terminate_cmds)

if returncode != 0:

tpu_terminate_cmd = f'echo "cmd: {query_cmd}" && echo "{stdout}" && echo "{stderr}" >&2 && eixt {returncode}'

else:

tpu_terminate_cmds = [f'exit {returncode}'

] if returncode != 0 else []

for tpu_id in stdout.splitlines():

tpu_terminate_cmds.append(

f'gcloud compute tpus tpu-vm delete --zone={zone} '

f'--quiet {tpu_id}')

terminate_cmd = ' && '.join(tpu_terminate_cmds)

yes let me do it. you meant this right

if returncode != 0: terminate_cmd = (f'echo "cmd: {query_cmd}" && ' f'echo "{stdout}" && ' f'echo "{stderr}" >&2 && ' f'exit {returncode}') else: tpu_terminate_cmds = [] for tpu_id in stdout.splitlines(): tpu_terminate_cmds.append( f'gcloud compute tpus tpu-vm delete --zone={zone} ' f'--quiet {tpu_id}') terminate_cmd = ' && '.join(tpu_terminate_cmds)

infwinston · 2022-12-14T06:59:03Z

OK all smoke tests have passed. I just spot launched 7 TPU VMs to see if they can handle preemptions correctly. will merge tomorrow if everything works well. Thanks a lot for reviewing @Michaelvll @concretevitamin

(sky) ubuntu@ip-172-31-94-104:~$ sky spot queue | grep "tpu"
46  sky-c3a0-ubuntu                            1x [tpu-v2-8:1]   7 mins ago   7m 51s         1m 47s        0            RUNNING
45  sky-64ce-ubuntu                            1x [tpu-v3-8:1]   8 mins ago   8m 54s         2m 48s        0            RUNNING
44  sky-e5d5-ubuntu                            1x [tpu-v2-8:1]   10 mins ago  10m 9s         4m            0            RUNNING
43  sky-cc0c-ubuntu                            1x [tpu-v2-8:1]   10 mins ago  10m 59s        -             0            STARTING
42  sky-e540-ubuntu                            1x [tpu-v2-8:1]   11 mins ago  11m 45s        -             0            STARTING
41  sky-0d9f-ubuntu                            1x [tpu-v2-32:1]  13 mins ago  13m            5m 31s        0            RUNNING
40  sky-c11f-ubuntu                            1x [tpu-v3-32:1]  14 mins ago  14m 4s         5m 20s        0            RUNNING

infwinston · 2022-12-15T07:21:15Z

They all recovered successfully from preemptions (1 out of 7 was the special case situation). Merging this PR now!

ID  NAME                                       RESOURCES         SUBMITTED   TOT. DURATION  JOB DURATION   #RECOVERIES  STATUS
46  sky-c3a0-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 26m 28s  1 day 14m 23s  1            RUNNING
45  sky-64ce-ubuntu                            1x [tpu-v3-8:1]   1 day ago   1 day 27m 31s  1 day 15m 27s  1            RUNNING
44  sky-e5d5-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 28m 46s  1 day 14m 20s  1            RUNNING
43  sky-cc0c-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 29m 36s  1 day 12m 18s  1            RUNNING
42  sky-e540-ubuntu                            1x [tpu-v2-8:1]   1 day ago   1 day 30m 22s  1 day 3m 53s   1            RUNNING
41  sky-0d9f-ubuntu                            1x [tpu-v2-32:1]  1 day ago   1 day 31m 37s  1 day 16m 40s  1            RUNNING
40  sky-c11f-ubuntu                            1x [tpu-v3-32:1]  1 day ago   1 day 32m 41s  1 day 15m 10s  1            RUNNING

infwinston added 2 commits December 7, 2022 15:29

safe guard

4d77d3d

terminate the cluster to be safe

a9340bf

infwinston mentioned this pull request Dec 8, 2022

[TPU/Spot] TPU pods fail to be launched after preempted #1468

Closed

infwinston added 2 commits December 8, 2022 13:50

update

e24aa8a

rm

00111a5

Michaelvll reviewed Dec 8, 2022

View reviewed changes

infwinston added 2 commits December 10, 2022 13:27

better abstraction

732298d

comment

b147c9a

infwinston mentioned this pull request Dec 12, 2022

[TPU VM] Cannot sky down a TPU VM when it does not appear on GCP #1514

Closed

infwinston added 2 commits December 12, 2022 11:43

comments

94d3214

rename

fdf1942

Michaelvll mentioned this pull request Dec 12, 2022

Add user identity to cluster status to avoid leakage when switching account #1513

Merged

6 tasks

infwinston changed the title ~~Add safe guard when getting TPU VM IPs~~ Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak Dec 12, 2022

Michaelvll reviewed Dec 14, 2022

View reviewed changes

infwinston added 2 commits December 13, 2022 19:00

comments

b8ed50d

comment

e8cca2a

Michaelvll reviewed Dec 14, 2022

View reviewed changes

Michaelvll approved these changes Dec 14, 2022

View reviewed changes

Michaelvll reviewed Dec 14, 2022

View reviewed changes

msg

464336f

infwinston added 3 commits December 13, 2022 22:17

comment

5daf378

bug..

a1536c2

msg

bb34e27

Michaelvll approved these changes Dec 14, 2022

View reviewed changes

infwinston added 2 commits December 13, 2022 22:26

miss one place

6e03445

output error msg

9fee569

infwinston merged commit af1b7fd into master Dec 15, 2022

infwinston deleted the tpu-safeguard branch December 15, 2022 07:21

infwinston mentioned this pull request Dec 19, 2022

Bugfix for spot controller #1545

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

infwinston commented Dec 7, 2022 •

edited

Loading

Michaelvll Dec 8, 2022

infwinston commented Dec 11, 2022 •

edited

Loading

concretevitamin commented Dec 12, 2022 •

edited

Loading

infwinston commented Dec 12, 2022

Michaelvll commented Dec 12, 2022

infwinston commented Dec 12, 2022 •

edited

Loading

Michaelvll commented Dec 12, 2022 •

edited

Loading

infwinston commented Dec 12, 2022 •

edited

Loading

Michaelvll commented Dec 12, 2022

infwinston commented Dec 13, 2022

Michaelvll left a comment

Michaelvll Dec 9, 2022

infwinston Dec 14, 2022

Michaelvll Dec 14, 2022 •

edited

Loading

infwinston Dec 14, 2022 •

edited

Loading

infwinston Dec 14, 2022

Michaelvll Dec 14, 2022

infwinston Dec 14, 2022 •

edited

Loading

Michaelvll Dec 14, 2022

infwinston Dec 14, 2022

Michaelvll Dec 14, 2022

infwinston Dec 14, 2022

Michaelvll left a comment

Michaelvll Dec 14, 2022

infwinston Dec 14, 2022

Michaelvll Dec 14, 2022

infwinston Dec 14, 2022

infwinston commented Dec 14, 2022 •

edited

Loading

infwinston commented Dec 15, 2022

Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500

Conversation

infwinston commented Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

infwinston commented Dec 11, 2022 • edited Loading

concretevitamin commented Dec 12, 2022 • edited Loading

infwinston commented Dec 12, 2022

Michaelvll commented Dec 12, 2022

infwinston commented Dec 12, 2022 • edited Loading

Michaelvll commented Dec 12, 2022 • edited Loading

infwinston commented Dec 12, 2022 • edited Loading

Michaelvll commented Dec 12, 2022

infwinston commented Dec 13, 2022

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

infwinston Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston commented Dec 14, 2022 • edited Loading

infwinston commented Dec 15, 2022

infwinston commented Dec 7, 2022 •

edited

Loading

infwinston commented Dec 11, 2022 •

edited

Loading

concretevitamin commented Dec 12, 2022 •

edited

Loading

infwinston commented Dec 12, 2022 •

edited

Loading

Michaelvll commented Dec 12, 2022 •

edited

Loading

infwinston commented Dec 12, 2022 •

edited

Loading

Michaelvll Dec 14, 2022 •

edited

Loading

infwinston Dec 14, 2022 •

edited

Loading

infwinston Dec 14, 2022 •

edited

Loading

infwinston commented Dec 14, 2022 •

edited

Loading