Skip to content

Commit

Permalink
[SkyServe] Support mixture of spot and on-demand (#3194)
Browse files Browse the repository at this point in the history
* rebase

* shift string

* rebase & add autoscaler

* add exmaple and format

* fix bug

* reimplement

* fix

* clear cnt when decision is made

* scale down status order

* fix bug & policy

* dont count overprovision

* log

* fix

* fix

* fix including not ready

* fix bootstrap

* fix status

* rewrite autoscaler logic

* cap num ondemand to scale up

* move to active after successful launch

* move to preemption list if the sky launch failed

* added on-demand policy;

* shrink downscale factor

* e2e experiment info dump

* fix on-demand check

* format

* fix policy

* update

* dynamicfailoverspot

* comment out logg

* checkpt

* overprovision=2

* zone awareness

* remove insufficient capacity

* filter out od fallback

* fix

* add num_extra as a yaml parameter

* fix pytest

* setstate

* format.sh

* add safety net

* get info.zone

* uncomment add to preemption list

* fix bug

* bug fix

* num_init_replicas

* delete todos: (tian): Change spot_mixer to boolean

* .

* deprecate original RequestRateAutoscaler

* spot zones

* bug fix

* bug fix

* fix

* remove cooldown

* drain the replica at scale down or up

* tmp

* update AutoscalerDecision, use target_qps, migrate _get_desired_num_replicas

* bug fix

* fix bugs

* update templates

* remove test.py

* move parameters to user config

* clean up code after merging master

* refactor code, autoscaler and controller

* update yaml and change spot_mixer to autoscaler

* added yaml examples and fix bugs

* added no spot zones examples

* min_on_demand_replicas

* add min_on_demand_replicas and a todo for preemption warning

* fix bug and update spot_placer

* spot_placer rename

* code review

* fix comments

* remove evenspread

* address code reviews

* fix pr

* deprecate spot_zones, infer spot_zones from resource field

* update yamls

* update yaml

* remove ordered

* num_extra_on_demand

* address pr reviews

* update resource handling

* fix print and dictionary issue

* address some comments

* use filter instead of list comprehension

* refactor autoscaler

* get_feasible_launchable_resources

* fix pr

* update evenly spreading zones among active zones

* fix any_of bug

* merge autoscalers

* _fill_in_launchable_resources

* add yaml description

* update yamls

* update existing zone

* Update sky/task.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/serve/spot_policy.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/cli.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/serve/autoscalers.py

Co-authored-by: Tian Xia <[email protected]>

* code review

* code review

* update other spotautoscaler variables

* fix

* remove anyof

* use defaultdict

* update name for init_subclass

* running format

* fixing pr

* fix PR

* update any_of

* ordered resources

* ordered resources

* add with ux_utils.print_exception_no_traceback():

* print final target_num_replicas

* update target based on qps target

* added check for both up and update

* add skyserve tests

* remove spot_placer, use spot_policy instead

* spot_placer

* format

* update use_spot or non_use_spot

* fix pr and add a newline to yaml

* dataclass

* update autoscaler

* bump autoscaler version

* request timestamps update

* update tests and error handling

* fix Location hash

* update comment

* name change

* _serve_check_service

* add newline to yamls

* spot_policies

* Update sky/serve/replica_managers.py

Co-authored-by: Tian Xia <[email protected]>

* code review

* format

* fix import

* add skyserve spot policy

* format

* assert len(task.resources) >= 1

* fix bug and added SpotOnDemandMix

* bug fix and edit wording on yaml

* spot_policy_str

* update examples/serve/policy/spot_on_demand_mix.yaml yaml

* yaml doc

* not expose the autoscaler option to the user

* update interface

* update initialization

* require use_spot explicitly

* went through spec

* remove NAME

* update yaml

* revert is True

* update yamls

* interface fix

* added multi accelerator support

* fix yaml

* fix UI issues

* fix pr reviews

* replica_ids_to_scale_down

* update autoscaler names

* remove initialization and add back checking active

* remove target_qps_per_replicas

* max_replicas required where target_qps_per_replica is set

* pr

* fix nits

* code review

* code review

* remove Autoscaler.from_spec

* num_ready_spot

* removed spot placer

* update yaml

* delete spot_only yaml

* format

* # use_spot is needed for ondemand fallback

* error msg

* update print

* Update sky/serve/autoscalers.py

Co-authored-by: Zhanghao Wu <[email protected]>

* code review

* added a todo

* added todo

* smoke test for base on demand fallback

* pr review

* update ports

* replace up validation to _validate_service_task

* updated status order

* get_dynamic_states and load dynamic_states

* move function locations

* add other statuses

* code review

* fix pr

* move interrupted position

* added smoke test test_skyserve_dynamic_ondemand_fallback

* _terminate_gcp_replica

* updated smoke tests

* update first_ready_time

* code review

* update smoke test sleep

---------

Co-authored-by: cblmemo <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
  • Loading branch information
3 people authored Feb 28, 2024
1 parent 991d6fe commit a2f4a68
Show file tree
Hide file tree
Showing 15 changed files with 825 additions and 229 deletions.
21 changes: 21 additions & 0 deletions examples/serve/spot_policy/base_on_demand_fallback_replicas.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances.
# The policy will maintain `base_ondemand_fallback_replicas` number of on-demand instances, in addition to spot instances.
# On-demand instances are counted in autoscaling decisions (i.e., between `min_replicas` and `max_replicas`).

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
base_ondemand_fallback_replicas: 1

resources:
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
23 changes: 23 additions & 0 deletions examples/serve/spot_policy/dynamic_on_demand_fallback.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances.
# The policy will dynamically fallback to on-demand instances when spot instances are not available.

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
dynamic_ondemand_fallback: true

resources:
any_of:
- zone: us-central1-a
- region: us-east1
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
23 changes: 23 additions & 0 deletions examples/serve/spot_policy/multi_accelerators.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SkyServe YAML to launch a service with mixed spot and on-demand instances and an ordered preference for accelerators.
# The policy will maintain `base_ondemand_fallback_replicas` number of on-demand instances, in addition to spot instances.

service:
readiness_probe: /health
replica_policy:
min_replicas: 2
max_replicas: 3
target_qps_per_replica: 1
base_ondemand_fallback_replicas: 1

resources:
ordered:
- accelerators: V100
- accelerators: T4
ports: 8081
cpus: 2+
# use_spot is needed for ondemand fallback
use_spot: true

workdir: examples/serve/http_server

run: python3 server.py
5 changes: 4 additions & 1 deletion sky/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,8 +274,11 @@ def _execute(
task)

if not cluster_exists:
# If spot is launched by skyserve controller or managed spot controller,
# We don't need to print out the logger info.
if (Stage.PROVISION in stages and task.use_spot and
not _is_launched_by_spot_controller):
not _is_launched_by_spot_controller and
not _is_launched_by_sky_serve_controller):
yellow = colorama.Fore.YELLOW
bold = colorama.Style.BRIGHT
reset = colorama.Style.RESET_ALL
Expand Down
Loading

0 comments on commit a2f4a68

Please sign in to comment.