-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Disallow sky down
spot controller when in-progress spot jobs exist
#1667
Conversation
Co-authored-by: Zongheng Yang <[email protected]>
…evitamin/sky-experiments into fix-spot-status-for-cluster-name
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
…evitamin/sky-experiments into fix-spot-status-for-cluster-name
…evitamin/sky-experiments into disallow-termination-spot
May need to merge with master :) |
…nto disallow-termination-spot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Some UX comments.
…nto disallow-termination-spot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. LGTM, this should serve as a much stronger guardrail than before.
sky/cli.py
Outdated
@@ -2316,7 +2331,7 @@ def _down_or_stop_clusters( | |||
# TODO(zhwu): We can only have one reserved cluster (spot | |||
# controller). | |||
assert len(reserved_clusters) == 1, reserved_clusters | |||
_hint_for_down_spot_controller(reserved_clusters[0]) | |||
_hint_or_raise_for_down_spot_controller(reserved_clusters[0]) | |||
|
|||
if not no_confirm: | |||
click.confirm('Proceed?', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option, for discussion: we can check
value = click.prompt('Type 'delete' to proceed', type=str)
if value == 'delete':
...
else:
raise click.Abort() from None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of feel this is a bit too long; was proposing delete
since it's used in AWS console for certain VPC resources. Wdyt? Feel free to do a poll too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I used the I understand
because that will remind the user to read the messages above, but I am not feeling strongly about this. Switched back to delete
.
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
…n/sky-experiments into disallow-termination-spot
… exist (skypilot-org#1667) * Expose failure reason for spot jobs * Add failure reason for normal failure * Failure reason hint for sky logs sky-spot-controller * require failure reason for all * Fix the conftest * fix controller name * revert SKYPILOT_USER * Show controller processs logs with sky spot logs for better UX * revert usage user ID * do not overwrite failure reason for resource unavailable * format * lint * address comments * fix comment * Update docs/source/examples/spot-jobs.rst Co-authored-by: Zongheng Yang <[email protected]> * improve readability and refactoring * address comments * format * Add comment * address comments * format * Add failover history to the error rasied by _launch * Add comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * refactor * Address comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * format * format * fix exception name * Disallow termination of spot controller when spot job is running * Fix message * Fix message * address comments * Update sky/cli.py Co-authored-by: Zongheng Yang <[email protected]> * Fix UX * Update sky/cli.py Co-authored-by: Zongheng Yang <[email protected]> * change confirm str to `delete` --------- Co-authored-by: Zongheng Yang <[email protected]>
… exist (skypilot-org#1667) * Expose failure reason for spot jobs * Add failure reason for normal failure * Failure reason hint for sky logs sky-spot-controller * require failure reason for all * Fix the conftest * fix controller name * revert SKYPILOT_USER * Show controller processs logs with sky spot logs for better UX * revert usage user ID * do not overwrite failure reason for resource unavailable * format * lint * address comments * fix comment * Update docs/source/examples/spot-jobs.rst Co-authored-by: Zongheng Yang <[email protected]> * improve readability and refactoring * address comments * format * Add comment * address comments * format * Add failover history to the error rasied by _launch * Add comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * refactor * Address comment * Update sky/spot/recovery_strategy.py Co-authored-by: Zongheng Yang <[email protected]> * format * format * fix exception name * Disallow termination of spot controller when spot job is running * Fix message * Fix message * address comments * Update sky/cli.py Co-authored-by: Zongheng Yang <[email protected]> * Fix UX * Update sky/cli.py Co-authored-by: Zongheng Yang <[email protected]> * change confirm str to `delete` --------- Co-authored-by: Zongheng Yang <[email protected]>
Fixes #1666
With this PR the following message will show if
![image](https://user-images.githubusercontent.com/6753189/216758225-2654732a-ea1d-4800-8b84-1e5fa2927bcc.png)
sky down
the spot controller with in-progress spot jobs:Tested:
sky down sky-spot-controller-<hash>
with in-progress spot jobs