Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user identity to cluster status to avoid leakage when switching account #1513

Merged
merged 87 commits into from
Dec 20, 2022
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
028340b
add user identity check
Michaelvll Dec 12, 2022
f3a5488
fix
Michaelvll Dec 12, 2022
b164b73
fix
Michaelvll Dec 12, 2022
a393f54
continue when error happens in status refresh
Michaelvll Dec 12, 2022
aff8e61
format
Michaelvll Dec 12, 2022
cfd8ff3
fix
Michaelvll Dec 12, 2022
e6c7b0c
fix
Michaelvll Dec 12, 2022
4235517
fix table output
Michaelvll Dec 12, 2022
3f75d73
check the identity earlier
Michaelvll Dec 12, 2022
f006c81
fix
Michaelvll Dec 12, 2022
5446452
fix
Michaelvll Dec 12, 2022
b719495
supress exception
Michaelvll Dec 12, 2022
aa59b5a
update message
Michaelvll Dec 12, 2022
f07629a
show old status in the table
Michaelvll Dec 12, 2022
978ebd4
fix message
Michaelvll Dec 12, 2022
c2a42ce
fix refresh
Michaelvll Dec 12, 2022
cf6df3c
Fix test smoke
Michaelvll Dec 12, 2022
20a5f4e
Avoid thread-safety issue for creating aws client
Michaelvll Dec 12, 2022
a49254c
rename
Michaelvll Dec 12, 2022
b17f5b3
Handle unknown exceptions
Michaelvll Dec 13, 2022
4586039
Reuse the aws utils
Michaelvll Dec 13, 2022
ce86160
update message
Michaelvll Dec 13, 2022
7acf924
Error handling
Michaelvll Dec 13, 2022
739d216
skip identity check for spot controller
Michaelvll Dec 13, 2022
9f8517f
fix
Michaelvll Dec 13, 2022
4df9d9e
Use direct client cache instead
Michaelvll Dec 13, 2022
8f2d543
Make client creation thread-safe
Michaelvll Dec 13, 2022
e734563
yapf
Michaelvll Dec 13, 2022
6ad60e4
Resource thread-safe
Michaelvll Dec 13, 2022
c4d679e
remove retry loop
Michaelvll Dec 13, 2022
f7980fd
Do not fail if the identity fetching errors out
Michaelvll Dec 13, 2022
97f972a
add back exception
Michaelvll Dec 13, 2022
2fb23e6
fix exceptions for GCP
Michaelvll Dec 13, 2022
41c4325
fix
Michaelvll Dec 13, 2022
55468c7
fix backward compat check
Michaelvll Dec 13, 2022
9e46673
Address comments
Michaelvll Dec 14, 2022
3368562
rename function
Michaelvll Dec 14, 2022
b9f23a1
Update sky/utils/env_options.py
Michaelvll Dec 14, 2022
548d32e
fix
Michaelvll Dec 14, 2022
3e8c046
Merge branch 'add-user-identity-to-cluster' of github.com:concretevit…
Michaelvll Dec 14, 2022
8cbdfcb
Fix the `sky check`
Michaelvll Dec 15, 2022
2e58894
fix
Michaelvll Dec 15, 2022
c69082d
Fix check public cloud
Michaelvll Dec 15, 2022
714d5a8
Make check identity more fine-grained
Michaelvll Dec 15, 2022
10acb75
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 15, 2022
b461c93
locking issue
Michaelvll Dec 15, 2022
09af9c7
backward compat test fix
Michaelvll Dec 15, 2022
10e8651
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 16, 2022
e158e96
address comments
Michaelvll Dec 16, 2022
eec7ee7
gcp project_id
Michaelvll Dec 16, 2022
405cded
rename a func
Michaelvll Dec 16, 2022
a664be7
skip identity check for dryruns
Michaelvll Dec 16, 2022
dc2a425
fix
Michaelvll Dec 16, 2022
d3f8731
Keep old owner value
Michaelvll Dec 17, 2022
287788a
Merge branch 'add-user-identity-to-cluster' of github.com:concretevit…
Michaelvll Dec 17, 2022
0cb6a45
fix
Michaelvll Dec 17, 2022
1276bbf
Add comments
Michaelvll Dec 17, 2022
47c529c
fix comment
Michaelvll Dec 17, 2022
46ab156
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 17, 2022
e6107d2
Fix comments
Michaelvll Dec 17, 2022
cb40ca6
remove the "|| true"
Michaelvll Dec 17, 2022
fd2e552
address comments
Michaelvll Dec 18, 2022
09f9edc
Warn the identity mismatch
Michaelvll Dec 18, 2022
82f1fff
Not fail for identity when ssh can be used
Michaelvll Dec 18, 2022
43a2c92
suppress stack trace
Michaelvll Dec 18, 2022
e8b376e
fix message
Michaelvll Dec 18, 2022
42abdd9
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 18, 2022
77a8783
Fix UX
Michaelvll Dec 18, 2022
5f06131
Fix spot
Michaelvll Dec 18, 2022
3dacf29
fix exec
Michaelvll Dec 18, 2022
45aeb42
format
Michaelvll Dec 18, 2022
d0f1888
fix ux
Michaelvll Dec 18, 2022
718802a
Fix UX
Michaelvll Dec 18, 2022
82c8e35
Address comments
Michaelvll Dec 18, 2022
53c755d
fix
Michaelvll Dec 19, 2022
c589ab6
Fix aws v2
Michaelvll Dec 19, 2022
577a38f
Change logs
Michaelvll Dec 19, 2022
c9899e4
Disallow all operations with different user identity
Michaelvll Dec 19, 2022
e451078
Better logging
Michaelvll Dec 19, 2022
ddb401d
fix comment
Michaelvll Dec 19, 2022
3645923
better logging
Michaelvll Dec 19, 2022
a1b9c7f
Update sky/backends/backend_utils.py
Michaelvll Dec 19, 2022
9b89b0a
address comments
Michaelvll Dec 19, 2022
aa61492
Merge branch 'add-user-identity-to-cluster' of github.com:concretevit…
Michaelvll Dec 19, 2022
5f54855
lint
Michaelvll Dec 19, 2022
7ed02d3
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Dec 20, 2022
7874618
solve error
Michaelvll Dec 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 104 additions & 15 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import threading
import time
import typing
from typing import Any, Dict, List, Optional, Set, Tuple
from typing import Any, Dict, List, Optional, Set, Tuple, Type
import uuid

import colorama
Expand Down Expand Up @@ -1644,7 +1644,9 @@ def check_owner_identity(cluster_name: str) -> None:

Raises:
exceptions.ClusterOwnerIdentityMismatchError: if the current user is not the
same as the user who created the cluster.
same as the user who created the cluster.
sky.exceptions.CloudUserIdentityError: if failed to get the activated user
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
identity.
"""
if env_options.Options.SKIP_CLOUD_IDENTITY_CHECK.get():
return
Expand All @@ -1664,18 +1666,19 @@ def check_owner_identity(cluster_name: str) -> None:
# The user identity can be None, if the cluster is created by an older
# version of SkyPilot. In that case, we set the user identity to the
# current one.
# NOTE: a user upgrades SkyPilot and switches to a new cloud identity
# immediately without `sky status --refresh` first, can cause a leakage
# of the existing cluster.
# NOTE: a user who upgrades SkyPilot and switches to a new cloud identity
# immediately without `sky status --refresh` first, will cause a leakage
# of the existing cluster. We deem this an acceptable tradeoff mainly
# because multi-identity is not common (at least at the moment).
if owner_identity is None:
global_user_state.set_owner_identity_for_cluster(
cluster_name, current_user_identity)
elif owner_identity != current_user_identity:
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterOwnerIdentityMismatchError(
f'The cluster {cluster_name!r} (on {cloud}) is created by the account '
f'{owner_identity!r}, but the activated one is {current_user_identity!r}.'
)
f'Cluster {cluster_name!r} ({cloud}) is owned by account '
f'{owner_identity!r}, but the currently activated account '
f'is {current_user_identity!r}.')


def _get_cluster_status_via_cloud_cli(
Expand Down Expand Up @@ -1800,9 +1803,11 @@ def _update_cluster_identity_and_status(

Raises:
sky.exceptions.ClusterOwnerIdentityMismatchError: the cluster's owner
identity mismatches the current cloud user.
identity mismatches the current cloud user.
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
sky.exceptions.CloudUserIdentityError: if failed to get the activated user
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
identity.
sky.exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
fetched from the cloud provider.
"""
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
check_owner_identity(cluster_name)

Expand All @@ -1829,8 +1834,19 @@ def refresh_cluster_status_handle(
*,
force_refresh: bool = False,
acquire_per_cluster_status_lock: bool = True,
suppress_error: bool = False
) -> Tuple[Optional[global_user_state.ClusterStatus],
Optional[backends.Backend.ResourceHandle]]:
"""Refresh the cluster status and return the status and handle.

Raises:
sky.exceptions.ClusterOwnerIdentityMismatchError: the cluster's owner
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
identity mismatches the current cloud user.
sky.exceptions.CloudUserIdentityError: if failed to get the activated user
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
identity.
sky.exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
"""
record = global_user_state.get_cluster_from_name(cluster_name)
if record is None:
return None, None
Expand All @@ -1839,14 +1855,87 @@ def refresh_cluster_status_handle(
if isinstance(handle, backends.CloudVmRayBackend.ResourceHandle):
use_spot = handle.launched_resources.use_spot
if (force_refresh or record['autostop'] >= 0 or use_spot):
record = _update_cluster_identity_and_status(
cluster_name,
acquire_per_cluster_status_lock=acquire_per_cluster_status_lock)
if record is None:
return None, None
try:
record = _update_cluster_identity_and_status(
cluster_name,
acquire_per_cluster_status_lock=
acquire_per_cluster_status_lock)
if record is None:
return None, None
except (exceptions.ClusterOwnerIdentityMismatchError,
exceptions.CloudUserIdentityError,
exceptions.ClusterStatusFetchingError) as e:
if suppress_error:
logger.debug(
f'Failed to refresh cluster {cluster_name!r} due to {e}'
)
return None, None
raise
return record['status'], record['handle']


def check_cluster_available(
cluster_name: str,
operation: str,
*,
expected_backend: Optional[Type[backends.Backend]] = None
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
) -> backends.Backend.ResourceHandle:
"""Check if the cluster is available.

Raises:
exceptions.ClusterNotUpError: if the cluster is not UP.
exceptions.NotSupportedError: if the cluster is not based on
CloudVmRayBackend
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
"""
try:
cluster_status, handle = refresh_cluster_status_handle(cluster_name)
except (exceptions.ClusterOwnerIdentityMismatchError,
exceptions.CloudUserIdentityError,
exceptions.ClusterStatusFetchingError) as e:
# Failed to refresh the cluster status is not fatal error as the callers
# can still be done by only using ssh, but the ssh can hang if the
# cluster is not up (e.g., autostopped).
ux_utils.console_newline()
logger.warning(
f'Failed to refresh the cluster status, it is not fatal, but '
f'{operation} cluster {cluster_name!r} might hang if the cluster '
'is not up.\n'
f'Detailed reason: {e}')
record = global_user_state.get_cluster_from_name(cluster_name)
cluster_status, handle = record['status'], record['handle']

if handle is None:
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'{colorama.Fore.YELLOW}Cluster {cluster_name!r} does not '
f'exist... skipped.{colorama.Style.RESET_ALL}')
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
backend = get_backend_from_handle(handle)
if expected_backend is not None and not isinstance(backend,
expected_backend):
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
f'{colorama.Fore.YELLOW}{operation} cluster '
f'{cluster_name!r} (status: {cluster_status.value})... skipped'
f'{colorama.Style.RESET_ALL}')
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
if cluster_status != global_user_state.ClusterStatus.UP:
if onprem_utils.check_if_local_cloud(cluster_name):
raise exceptions.ClusterNotUpError(
constants.UNINITIALIZED_ONPREM_CLUSTER_MESSAGE.format(
cluster_name))
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
f'{colorama.Fore.YELLOW}{operation} cluster {cluster_name!r} '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'(status: {cluster_status.value})... skipped.'
f'{colorama.Style.RESET_ALL}')

if handle.head_ip is None:
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
f'Cluster {cluster_name!r} has been stopped or not properly '
'set up. Please re-launch it with `sky start`.')
return handle


class CloudFilter(enum.Enum):
# Filter for all types of clouds.
ALL = 'all'
Expand Down
2 changes: 1 addition & 1 deletion sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3052,7 +3052,7 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):

"""

_, handle = _is_spot_controller_up(
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
_, handle = spot_lib.is_spot_controller_up(
'All managed spot jobs should have finished.')
if handle is None:
return
Expand Down
125 changes: 43 additions & 82 deletions sky/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
from sky import sky_logging
from sky import spot
from sky.backends import backend_utils
from sky.backends import onprem_utils
from sky.skylet import constants
from sky.skylet import job_lib
from sky.usage import usage_lib
Expand Down Expand Up @@ -356,10 +355,23 @@ def autostop(
raise exceptions.NotSupportedError(
f'{operation} sky reserved cluster {cluster_name!r} '
f'is not supported.')
(cluster_status,
handle) = backend_utils.refresh_cluster_status_handle(cluster_name)
if handle is None:
raise ValueError(f'Cluster {cluster_name!r} does not exist.')
try:
handle = backend_utils.check_cluster_available(
cluster_name,
operation,
expected_backend=backends.CloudVmRayBackend)
except exceptions.ClusterNotUpError as e:
with ux_utils.print_exception_no_traceback():
e.message += (
f'\n auto{option_str} can only be set/unset for '
f'{global_user_state.ClusterStatus.UP.value} clusters.')
raise e from None
except exceptions.NotSupportedError as e:
with ux_utils.print_exception_no_traceback():
e.message += (f'\n auto{option_str} is only supported by backend: '
f'{backends.CloudVmRayBackend.NAME}')
raise e from None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to establish the convention that programmatic APIs should throw full stacktraces if possible (to ease debugging)? I think it's ok to defer to the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point! In that case, we may want to change the behavior of ux_utils.print_exception_no_traceback based on the entrypoint (CLI or programmatic API)


if tpu_utils.is_tpu_vm_pod(handle.launched_resources):
# Reference:
# https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#stopping_a_with_gcloud # pylint: disable=line-too-long
Expand All @@ -368,21 +380,6 @@ def autostop(
'is not supported.')

backend = backend_utils.get_backend_from_handle(handle)
if not isinstance(backend, backends.CloudVmRayBackend):
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(
f'{colorama.Fore.YELLOW}{operation} cluster '
f'{cluster_name!r}... skipped{colorama.Style.RESET_ALL}'
f'\n auto{option_str} is only supported by backend: '
f'{backends.CloudVmRayBackend.NAME}')
if cluster_status != global_user_state.ClusterStatus.UP:
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
f'{colorama.Fore.YELLOW}{operation} cluster '
f'{cluster_name!r} (status: {cluster_status.value})... skipped'
f'{colorama.Style.RESET_ALL}'
f'\n auto{option_str} can only be set/unset for '
f'{global_user_state.ClusterStatus.UP.value} clusters.')
usage_lib.record_cluster_name_for_current_operation(cluster_name)
backend.set_autostop(handle, idle_minutes, down)

Expand All @@ -392,39 +389,6 @@ def autostop(
# ==================


def _check_cluster_available(cluster_name: str,
operation: str) -> backends.Backend.ResourceHandle:
"""Check if the cluster is available."""
cluster_status, handle = backend_utils.refresh_cluster_status_handle(
cluster_name)
if handle is None:
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError(
f'{colorama.Fore.YELLOW}Cluster {cluster_name!r} does not '
f'exist; skipped.{colorama.Style.RESET_ALL}')
backend = backend_utils.get_backend_from_handle(handle)
if isinstance(backend, backends.LocalDockerBackend):
# LocalDockerBackend does not support job queues
raise exceptions.NotSupportedError(
f'Cluster {cluster_name} with LocalDockerBackend does '
f'not support {operation}.')
if cluster_status != global_user_state.ClusterStatus.UP:
if onprem_utils.check_if_local_cloud(cluster_name):
raise exceptions.ClusterNotUpError(
constants.UNINITIALIZED_ONPREM_CLUSTER_MESSAGE.format(
cluster_name))
raise exceptions.ClusterNotUpError(
f'{colorama.Fore.YELLOW}Cluster {cluster_name!r} is not up '
f'(status: {cluster_status.value}); skipped.'
f'{colorama.Style.RESET_ALL}')

if handle.head_ip is None:
raise exceptions.ClusterNotUpError(
f'Cluster {cluster_name!r} has been stopped or not properly set up.'
' Please re-launch it with `sky start`.')
return handle


@usage_lib.entrypoint
def queue(cluster_name: str,
skip_finished: bool = False,
Expand Down Expand Up @@ -460,7 +424,10 @@ def queue(cluster_name: str,
username = None
code = job_lib.JobLibCodeGen.get_job_queue(username, all_jobs)

handle = _check_cluster_available(cluster_name, 'getting the job queue')
handle = backend_utils.check_cluster_available(
cluster_name,
'getting the job queue',
expected_backend=backends.CloudVmRayBackend)
backend = backend_utils.get_backend_from_handle(handle)

returncode, jobs_payload, stderr = backend.run_on_head(handle,
Expand Down Expand Up @@ -500,7 +467,10 @@ def cancel(cluster_name: str,
cluster_name, operation_str='Cancelling jobs')

# Check the status of the cluster.
handle = _check_cluster_available(cluster_name, 'cancelling jobs')
handle = backend_utils.check_cluster_available(
cluster_name,
'cancelling jobs',
expected_backend=backends.CloudVmRayBackend)
backend = backend_utils.get_backend_from_handle(handle)

if all:
Expand Down Expand Up @@ -532,7 +502,10 @@ def tail_logs(cluster_name: str,
sky.exceptions.NotSupportedError: the feature is not supported.
"""
# Check the status of the cluster.
handle = _check_cluster_available(cluster_name, 'tailing logs')
handle = backend_utils.check_cluster_available(
cluster_name,
'tailing logs',
expected_backend=backends.CloudVmRayBackend)
backend = backend_utils.get_backend_from_handle(handle)

job_str = f'job {job_id}'
Expand Down Expand Up @@ -561,7 +534,10 @@ def download_logs(
Dict[str, str]: a mapping of job_id to local log path.
"""
# Check the status of the cluster.
handle = _check_cluster_available(cluster_name, 'downloading logs')
handle = backend_utils.check_cluster_available(
cluster_name,
'downloading logs',
expected_backend=backends.CloudVmRayBackend)
backend = backend_utils.get_backend_from_handle(handle)

if job_ids is not None and len(job_ids) == 0:
Expand Down Expand Up @@ -593,7 +569,10 @@ def job_status(
{None: None}.
"""
# Check the status of the cluster.
handle = _check_cluster_available(cluster_name, 'getting job status')
handle = backend_utils.check_cluster_available(
cluster_name,
'getting job status',
expected_backend=backends.CloudVmRayBackend)
backend = backend_utils.get_backend_from_handle(handle)

if job_ids is not None and len(job_ids) == 0:
Expand All @@ -613,26 +592,6 @@ def job_status(
# =======================


def _is_spot_controller_up(
stopped_message: str,
) -> Tuple[Optional[global_user_state.ClusterStatus],
Optional[backends.Backend.ResourceHandle]]:
controller_status, handle = backend_utils.refresh_cluster_status_handle(
spot.SPOT_CONTROLLER_NAME, force_refresh=True)
if controller_status is None:
print('No managed spot job has been run.')
elif controller_status != global_user_state.ClusterStatus.UP:
msg = (f'Spot controller {spot.SPOT_CONTROLLER_NAME} '
f'is {controller_status.value}.')
if controller_status == global_user_state.ClusterStatus.STOPPED:
msg += f'\n{stopped_message}'
if controller_status == global_user_state.ClusterStatus.INIT:
msg += '\nPlease wait for the controller to be ready.'
print(msg)
handle = None
return controller_status, handle


@usage_lib.entrypoint
def spot_status(refresh: bool) -> List[Dict[str, Any]]:
"""[Deprecated] (alias of spot_queue) Get statuses of managed spot jobs."""
Expand Down Expand Up @@ -672,7 +631,7 @@ def spot_queue(refresh: bool) -> List[Dict[str, Any]]:
stop_msg = ''
if not refresh:
stop_msg = 'To view the latest job table: sky spot queue --refresh'
controller_status, handle = _is_spot_controller_up(stop_msg)
controller_status, handle = spot.is_spot_controller_up(stop_msg)

if controller_status is None:
return []
Expand Down Expand Up @@ -725,10 +684,12 @@ def spot_cancel(name: Optional[str] = None,
RuntimeError: failed to cancel the job.
"""
job_ids = [] if job_ids is None else job_ids
_, handle = _is_spot_controller_up(
_, handle = spot.is_spot_controller_up(
'All managed spot jobs should have finished.')
if handle is None or handle.head_ip is None:
raise exceptions.ClusterNotUpError('All jobs finished.')
# The error message is already printed in spot.is_spot_controller_up.
with ux_utils.print_exception_no_traceback():
raise exceptions.ClusterNotUpError()

job_id_str = ','.join(map(str, job_ids))
if sum([len(job_ids) > 0, name is not None, all]) != 1:
Expand Down Expand Up @@ -778,7 +739,7 @@ def spot_tail_logs(name: Optional[str], job_id: Optional[int],
sky.exceptions.ClusterNotUpError: the spot controller is not up.
"""
# TODO(zhwu): Automatically restart the spot controller
controller_status, handle = _is_spot_controller_up(
controller_status, handle = spot.is_spot_controller_up(
'Please restart the spot controller with '
f'`sky start {spot.SPOT_CONTROLLER_NAME}`.')
if handle is None or handle.head_ip is None:
Expand Down
Loading