Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Disallow sky down spot controller when in-progress spot jobs exist #1667

Merged
merged 44 commits into from
Feb 12, 2023
Merged
Changes from 37 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
4ae4a2a
Expose failure reason for spot jobs
Michaelvll Feb 1, 2023
f88a408
Add failure reason for normal failure
Michaelvll Feb 1, 2023
3786936
Failure reason hint for sky logs sky-spot-controller
Michaelvll Feb 1, 2023
fbf720d
require failure reason for all
Michaelvll Feb 1, 2023
bd75460
Fix the conftest
Michaelvll Feb 1, 2023
585b268
fix controller name
Michaelvll Feb 1, 2023
b90bf51
revert SKYPILOT_USER
Michaelvll Feb 1, 2023
8aa176b
Show controller processs logs with sky spot logs for better UX
Michaelvll Feb 1, 2023
7ab8355
revert usage user ID
Michaelvll Feb 1, 2023
c6c0f45
do not overwrite failure reason for resource unavailable
Michaelvll Feb 2, 2023
5318c43
format
Michaelvll Feb 2, 2023
a609901
lint
Michaelvll Feb 2, 2023
bda6c72
address comments
Michaelvll Feb 2, 2023
4516596
fix comment
Michaelvll Feb 2, 2023
fcbabd2
Update docs/source/examples/spot-jobs.rst
Michaelvll Feb 3, 2023
d034b0b
improve readability and refactoring
Michaelvll Feb 3, 2023
cb501a7
Merge branch 'fix-spot-status-for-cluster-name' of github.com:concret…
Michaelvll Feb 3, 2023
555f0ec
address comments
Michaelvll Feb 3, 2023
5c2d005
format
Michaelvll Feb 3, 2023
903deb9
Add comment
Michaelvll Feb 3, 2023
11ef41c
address comments
Michaelvll Feb 3, 2023
0ee6703
format
Michaelvll Feb 3, 2023
78696c6
Add failover history to the error rasied by _launch
Michaelvll Feb 3, 2023
b054f10
Add comment
Michaelvll Feb 3, 2023
cb773ab
Update sky/spot/recovery_strategy.py
Michaelvll Feb 4, 2023
087c31b
refactor
Michaelvll Feb 4, 2023
5d6723b
Address comment
Michaelvll Feb 4, 2023
d3726ed
Update sky/spot/recovery_strategy.py
Michaelvll Feb 4, 2023
f524a41
format
Michaelvll Feb 4, 2023
1deab6f
Merge branch 'fix-spot-status-for-cluster-name' of github.com:concret…
Michaelvll Feb 4, 2023
1bf7993
format
Michaelvll Feb 4, 2023
5b43ae9
fix exception name
Michaelvll Feb 4, 2023
e1a5536
Disallow termination of spot controller when spot job is running
Michaelvll Feb 4, 2023
a7d540a
Merge branch 'fix-spot-status-for-cluster-name' of github.com:concret…
Michaelvll Feb 4, 2023
8db91f1
Fix message
Michaelvll Feb 4, 2023
b4ad820
Fix message
Michaelvll Feb 4, 2023
b1d9b85
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Feb 5, 2023
da09263
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Feb 9, 2023
88b151f
address comments
Michaelvll Feb 9, 2023
b2eb844
Update sky/cli.py
Michaelvll Feb 10, 2023
16f4400
Fix UX
Michaelvll Feb 10, 2023
17833ad
Update sky/cli.py
Michaelvll Feb 10, 2023
c6389a3
change confirm str to `delete`
Michaelvll Feb 10, 2023
5353f62
Merge branch 'disallow-termination-spot' of github.com:concretevitami…
Michaelvll Feb 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 24 additions & 18 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2200,7 +2200,7 @@ def down(
purge=purge)


def _hint_for_down_spot_controller(controller_name: str):
def _hint_or_raise_for_down_spot_controller(controller_name: str):
# spot_jobs will be empty when the spot cluster is not running.
cluster_status, _ = backend_utils.refresh_cluster_status_handle(
controller_name)
Expand All @@ -2213,35 +2213,41 @@ def _hint_for_down_spot_controller(controller_name: str):
f'aware of the following:{colorama.Style.RESET_ALL}'
'\n * All logs and status information of the spot '
'jobs (output of `sky spot queue`) will be lost.')
if cluster_status == global_user_state.ClusterStatus.INIT:
msg += ('\n * If there are pending/in-progress spot jobs, those '
'resources will not be terminated and require manual cleanup.')
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
click.echo(msg)
if cluster_status == global_user_state.ClusterStatus.UP:
try:
spot_jobs = core.spot_queue(refresh=False)
except exceptions.ClusterNotUpError:
# The spot controller cluster status changed during querying
# the spot jobs, use the latest cluster status, so that the
# message for INIT and STOPPED states will be correctly
# added to the message.
cluster_status = backend_utils.refresh_cluster_status_handle(
controller_name)
spot_jobs = []
with backend_utils.safe_console_status(
'[bold cyan]Checking spot jobs on the controller[/]'):
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
try:
spot_jobs = core.spot_queue(refresh=False)
except exceptions.ClusterNotUpError:
# The spot controller cluster status changed during querying
# the spot jobs, use the latest cluster status, so that the
# message for INIT and STOPPED states will be correctly
# added to the message.
cluster_status = backend_utils.refresh_cluster_status_handle(
controller_name)
spot_jobs = []

# Find in-progress spot jobs, and hint users to cancel them.
non_terminal_jobs = [
job for job in spot_jobs if not job['status'].is_terminal()
]
if (cluster_status == global_user_state.ClusterStatus.UP and
non_terminal_jobs):
msg += ('\n * In-progress spot jobs found, their resources '
'will not be terminated and require manual cleanup:\n')
job_table = spot_lib.format_job_table(non_terminal_jobs,
show_all=False)
msg = (f'{colorama.Fore.RED}In-progress spot jobs found, '
'to avoid resources leakage, firstly run: '
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
f'{colorama.Style.BRIGHT}sky spot cancel -a'
f'{colorama.Style.RESET_ALL}\n')
# Add prefix to each line to align with the bullet point.
msg += '\n'.join(
[' ' + line for line in job_table.split('\n') if line != ''])
if cluster_status == global_user_state.ClusterStatus.INIT:
msg += ('\n * If there are pending/in-progress spot jobs, those '
'resources will not be terminated and require manual cleanup.')
click.echo(msg)
with ux_utils.print_exception_no_traceback():
raise exceptions.NotSupportedError(msg)


def _down_or_stop_clusters(
Expand Down Expand Up @@ -2316,7 +2322,7 @@ def _down_or_stop_clusters(
# TODO(zhwu): We can only have one reserved cluster (spot
# controller).
assert len(reserved_clusters) == 1, reserved_clusters
_hint_for_down_spot_controller(reserved_clusters[0])
_hint_or_raise_for_down_spot_controller(reserved_clusters[0])

if not no_confirm:
click.confirm('Proceed?',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option, for discussion: we can check

value = click.prompt('Type 'delete' to proceed', type=str)
if value == 'delete':
  ...
else:
  raise click.Abort() from None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I changed it to the following. PTAL. : )
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of feel this is a bit too long; was proposing delete since it's used in AWS console for certain VPC resources. Wdyt? Feel free to do a poll too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I used the I understand because that will remind the user to read the messages above, but I am not feeling strongly about this. Switched back to delete.

Expand Down