-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sky start
on the spot controller aware of autostop.
#1453
Conversation
@@ -1174,7 +1174,7 @@ def launch( | |||
and they undergo job queue scheduling. | |||
""" | |||
backend_utils.check_cluster_name_not_reserved( | |||
cluster, operation_str='Launching task on it') | |||
cluster, operation_str='Launching tasks on it') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it
refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
» sky launch -c sky-spot-controller-8a3968f2 echo hi
ValueError: Cluster 'sky-spot-controller-8a3968f2' is reserved for managed spot controller. Launching task on it is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing in operation_str
(adding is not allowed
may not be general for all reserved clusters), pass in error_str
with error_str= Launching task on {cluster} is not allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks fine to me:
» ag -w check_cluster_name_not_reserved -A1
sky/execution.py
359: backend_utils.check_cluster_name_not_reserved(cluster_name,
360- operation_str='sky.launch')
--
437: backend_utils.check_cluster_name_not_reserved(cluster_name,
438- operation_str='sky.exec')
sky/backends/backend_utils.py
1950:def check_cluster_name_not_reserved(
1951- cluster_name: Optional[str],
sky/core.py
442: backend_utils.check_cluster_name_not_reserved(
443- cluster_name, operation_str='Cancelling jobs')
sky/cli.py
1176: backend_utils.check_cluster_name_not_reserved(
1177- cluster, operation_str='Launching tasks on it')
--
1314: backend_utils.check_cluster_name_not_reserved(
1315- cluster, operation_str='Executing task on it')
as it doesn't repeat "is not allowed". cc @Michaelvll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be better to show more detailed information as the operation_str
does, instead of a general is not allowed
. It will provide more information for the user as well as the developer for debugging.
Sorry, I miss understood the proposal. I think avoiding the repeat might be fine for now? We can change the behavior, when we meet other situations that require different strings for the last part?
sky/core.py
Outdated
@@ -149,7 +163,10 @@ def start( | |||
|
|||
Raises: | |||
ValueError: the specified cluster does not exist; or if ``down`` is set | |||
to True but ``idle_minutes_to_autostop`` is None. | |||
to True but ``idle_minutes_to_autostop`` is None; or if the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision boundrary is getting rather complex here. Possible to simplify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed it is complex. We could show "argument values are invalid" instead, but I slightly prefer being more precise in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, maybe split this into bullet points, i.e.
-
specified cluster does not exists
-
idle_minutes_to_autostop
is None anddown
=True
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
I am leaning towards having the user able to customize the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards having the user able to customize the
autostop
for the spot controller, allowing bothsky autostop -i 30 sky-spot-controller-<>
andsky start -i 20 sky-spot-controller
, but havesky start sky-spot-controller [--force]
set the autostop by default.
Discussion result:
- if we allowed it, it may be easy to make mistakes like leaving spot controller up for a long time
- so for now we should try to make it “managed” until users requested otherwise.
@@ -1174,7 +1174,7 @@ def launch( | |||
and they undergo job queue scheduling. | |||
""" | |||
backend_utils.check_cluster_name_not_reserved( | |||
cluster, operation_str='Launching task on it') | |||
cluster, operation_str='Launching tasks on it') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
» sky launch -c sky-spot-controller-8a3968f2 echo hi
ValueError: Cluster 'sky-spot-controller-8a3968f2' is reserved for managed spot controller. Launching task on it is not allowed.
sky/core.py
Outdated
@@ -149,7 +163,10 @@ def start( | |||
|
|||
Raises: | |||
ValueError: the specified cluster does not exist; or if ``down`` is set | |||
to True but ``idle_minutes_to_autostop`` is None. | |||
to True but ``idle_minutes_to_autostop`` is None; or if the specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed it is complex. We could show "argument values are invalid" instead, but I slightly prefer being more precise in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @concretevitamin! LGTM. I think it is good to go after the comments are resolved.
Thanks both. All addressed. Merging. |
Fixes #1447.
If controller is to be started:
Tested:
CLI
Python API
core.start()
on a stopped controller; verfied autostop is set to 10mSmoke
bash tests/run_smoke_tests.sh test_spot