-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Show spot controller in sky status and simplify tearing down #1270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @Michaelvll - Quick UX comments, haven't read the code in detail.
sky/utils/cli_utils/status_utils.py
Outdated
if is_reserved: | ||
click.echo( | ||
f'{colorama.Style.DIM}Reserved clusters will be autostopped ' | ||
'when inactive. Refresh statuses with: sky status --refresh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(didn't read code carefully) do we need to show "Refresh statuses with: sky status --refresh" here and again below? Can we show it once below and keep this hint about controller shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please refer to the latest output in the comment.
sky/cli.py
Outdated
f'{reserved_clusters_str} is not supported.') | ||
else: | ||
click.secho( | ||
'WARNING: Tearing down a SkyPilot reserved cluster. If ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto - just say managed spot controller.
Also, we should hint on all consequences (1) <what's already written here> (2) managed spot jobs info (sky spot status
) will be lost.
For (1), maybe add a TODO that we automatically warn if there are in-progress jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I added the warning for the in-progress jobs as well. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll - some questions.
sky/cli.py
Outdated
filtered_cluster_records = [] | ||
reserved_clusters = [] | ||
nonreserved_cluster_records = [] | ||
reserved_clusters = collections.defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to simplify this group based logic? Maybe specializing to just 1 spot controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Changed each group to only has 1 cluster.
sky/cli.py
Outdated
f'\n {cnt}. Resource leakage may happen caused by ' | ||
'spot jobs being submitted, and in-progress spot jobs.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If there are pending/in-progress spot jobs, those resources will not be terminated and require manual cleanup.
Actually, why do we show this when the controller is INIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is possible that another sky spot launch
is running in parallel, which will make the cluster status INIT
, i.e. during a spot job is being submitted to the spot controller.
sky/cli.py
Outdated
'spot jobs being submitted, and in-progress spot jobs.') | ||
cnt += 1 | ||
elif cluster_status == global_user_state.ClusterStatus.UP: | ||
spot_jobs = core.spot_status(refresh=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if before this line, a concurrent spot launch
made the controller INIT? Would this call fail? The docstr of this func doesn't suggest what to expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added an error handling for it. PTAL.
…nto reserved-cluster-in-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @Michaelvll for this UX improvement!
sky/cli.py
Outdated
non_terminal_jobs): | ||
msg += ('\n * In-progress spot jobs found, their resources ' | ||
'will not be terminated and require manual cleanup ' | ||
'(sky spot cancel --all):\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to show (sky spot cancel --all)? User is downing the controller, and after it's downed, spot cancel won't work? I feel either of the following is fine:
- just say manual cleanup (I prefer this one)
- say manual cleanup via cloud consoles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I was trying to say abort the current down process and run sky spot cancel --all
first, and then do the sky down
, but that could take a long time, waiting the controller process hit the line that checks the cancel signal.
Removed. Thanks!
…kypilot-org#1270) * Show controller in status and enable tear down. * Reduce the autostop for controller to save cost * fix test * format * address comments * fix down output * fix test * address comments * Fix the docs for spot job * fix space * address comments
…kypilot-org#1270) * Show controller in status and enable tear down. * Reduce the autostop for controller to save cost * fix test * format * address comments * fix down output * fix test * address comments * Fix the docs for spot job * fix space * address comments
According to the offline discussion with @concretevitamin, we decided to add the spot controller in the
sky status
to avoid confusion caused by the controller in the cloud console. (Part of the efforts for fixing #1269)After this PR, the
sky status
will be:When tearing down the controller, the output would be (no
--purge
required, as purging can lead to resource leakage):TODO: