Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sky start on the spot controller aware of autostop. #1453

Merged
merged 5 commits into from
Dec 1, 2022

Conversation

concretevitamin
Copy link
Member

Fixes #1447.

If controller is to be started:

  • disallow: -i, --down; disallow being specified with other clusters

Tested:

CLI

# Expectedly fail:
sky start sky-cpunode-zongheng sky-spot-controller-8a3968f2
sky start sky-spot-controller-8a3968f2 -i1
sky start sky-spot-controller-8a3968f2 -i1 --down
sky start sky-spot-controller-8a3968f2 --down

# Starting a stopped controller: OK. Autostop reset to 10m.
sky start sky-spot-controller-8a3968f2  

# Starting an UP controller: OK. Autostop reset to 10m.
sky start sky-spot-controller-8a3968f2 --force 

Python API

  • repeated the above and observed appropriate ValueError
  • manually called core.start() on a stopped controller; verfied autostop is set to 10m

Smoke

  • bash tests/run_smoke_tests.sh test_spot

@@ -1174,7 +1174,7 @@ def launch(
and they undergo job queue scheduling.
"""
backend_utils.check_cluster_name_not_reserved(
cluster, operation_str='Launching task on it')
cluster, operation_str='Launching tasks on it')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it refer to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

» sky launch -c sky-spot-controller-8a3968f2 echo hi
ValueError: Cluster 'sky-spot-controller-8a3968f2' is reserved for managed spot controller. Launching task on it is not allowed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in operation_str (adding is not allowed may not be general for all reserved clusters), pass in error_str with error_str= Launching task on {cluster} is not allowed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks fine to me:

» ag -w check_cluster_name_not_reserved -A1
sky/execution.py
359:    backend_utils.check_cluster_name_not_reserved(cluster_name,
360-                                                  operation_str='sky.launch')
--
437:    backend_utils.check_cluster_name_not_reserved(cluster_name,
438-                                                  operation_str='sky.exec')

sky/backends/backend_utils.py
1950:def check_cluster_name_not_reserved(
1951-        cluster_name: Optional[str],

sky/core.py
442:    backend_utils.check_cluster_name_not_reserved(
443-        cluster_name, operation_str='Cancelling jobs')

sky/cli.py
1176:    backend_utils.check_cluster_name_not_reserved(
1177-        cluster, operation_str='Launching tasks on it')
--
1314:    backend_utils.check_cluster_name_not_reserved(
1315-        cluster, operation_str='Executing task on it')

as it doesn't repeat "is not allowed". cc @Michaelvll

Copy link
Collaborator

@Michaelvll Michaelvll Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be better to show more detailed information as the operation_str does, instead of a general is not allowed. It will provide more information for the user as well as the developer for debugging.

Sorry, I miss understood the proposal. I think avoiding the repeat might be fine for now? We can change the behavior, when we meet other situations that require different strings for the last part?

sky/cli.py Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
sky/core.py Outdated
@@ -149,7 +163,10 @@ def start(

Raises:
ValueError: the specified cluster does not exist; or if ``down`` is set
to True but ``idle_minutes_to_autostop`` is None.
to True but ``idle_minutes_to_autostop`` is None; or if the specified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision boundrary is getting rather complex here. Possible to simplify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed it is complex. We could show "argument values are invalid" instead, but I slightly prefer being more precise in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, maybe split this into bullet points, i.e.

  1. specified cluster does not exists

  2. idle_minutes_to_autostop is None and down=True

etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

@Michaelvll
Copy link
Collaborator

I am leaning towards having the user able to customize the autostop for the spot controller, allowing both sky autostop -i 30 sky-spot-controller-<> and sky start -i 20 sky-spot-controller, but have sky start sky-spot-controller [--force] set the autostop by default.
Reasons:

  1. It will not affect the normal user's usage
  2. It is more flexible for the advanced user to increase or cancel the autostop for the spot controller, to either debug or reduce the overhead for launching the spot jobs that may come within 15 mins, for example.

Copy link
Member Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards having the user able to customize the autostop for the spot controller, allowing both sky autostop -i 30 sky-spot-controller-<> and sky start -i 20 sky-spot-controller, but have sky start sky-spot-controller [--force] set the autostop by default.

Discussion result:

  • if we allowed it, it may be easy to make mistakes like leaving spot controller up for a long time
  • so for now we should try to make it “managed” until users requested otherwise.

@@ -1174,7 +1174,7 @@ def launch(
and they undergo job queue scheduling.
"""
backend_utils.check_cluster_name_not_reserved(
cluster, operation_str='Launching task on it')
cluster, operation_str='Launching tasks on it')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

» sky launch -c sky-spot-controller-8a3968f2 echo hi
ValueError: Cluster 'sky-spot-controller-8a3968f2' is reserved for managed spot controller. Launching task on it is not allowed.

sky/cli.py Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
sky/core.py Outdated
@@ -149,7 +163,10 @@ def start(

Raises:
ValueError: the specified cluster does not exist; or if ``down`` is set
to True but ``idle_minutes_to_autostop`` is None.
to True but ``idle_minutes_to_autostop`` is None; or if the specified
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed it is complex. We could show "argument values are invalid" instead, but I slightly prefer being more precise in this case.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @concretevitamin! LGTM. I think it is good to go after the comments are resolved.

@concretevitamin
Copy link
Member Author

Thanks both. All addressed. Merging.

@concretevitamin concretevitamin merged commit 92895d2 into master Dec 1, 2022
@concretevitamin concretevitamin deleted the ctrler branch December 1, 2022 01:10
concretevitamin added a commit that referenced this pull request Dec 1, 2022
* Make `sky start` on the spot controller aware of autostop.

* pytest

* Fix

* Typo

* Bullet points
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[spot] Interaction between sky start [--force] and spot controller's autostop
3 participants