Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SkyServe] Support mixture of spot and on-demand (#3194)
* rebase * shift string * rebase & add autoscaler * add exmaple and format * fix bug * reimplement * fix * clear cnt when decision is made * scale down status order * fix bug & policy * dont count overprovision * log * fix * fix * fix including not ready * fix bootstrap * fix status * rewrite autoscaler logic * cap num ondemand to scale up * move to active after successful launch * move to preemption list if the sky launch failed * added on-demand policy; * shrink downscale factor * e2e experiment info dump * fix on-demand check * format * fix policy * update * dynamicfailoverspot * comment out logg * checkpt * overprovision=2 * zone awareness * remove insufficient capacity * filter out od fallback * fix * add num_extra as a yaml parameter * fix pytest * setstate * format.sh * add safety net * get info.zone * uncomment add to preemption list * fix bug * bug fix * num_init_replicas * delete todos: (tian): Change spot_mixer to boolean * . * deprecate original RequestRateAutoscaler * spot zones * bug fix * bug fix * fix * remove cooldown * drain the replica at scale down or up * tmp * update AutoscalerDecision, use target_qps, migrate _get_desired_num_replicas * bug fix * fix bugs * update templates * remove test.py * move parameters to user config * clean up code after merging master * refactor code, autoscaler and controller * update yaml and change spot_mixer to autoscaler * added yaml examples and fix bugs * added no spot zones examples * min_on_demand_replicas * add min_on_demand_replicas and a todo for preemption warning * fix bug and update spot_placer * spot_placer rename * code review * fix comments * remove evenspread * address code reviews * fix pr * deprecate spot_zones, infer spot_zones from resource field * update yamls * update yaml * remove ordered * num_extra_on_demand * address pr reviews * update resource handling * fix print and dictionary issue * address some comments * use filter instead of list comprehension * refactor autoscaler * get_feasible_launchable_resources * fix pr * update evenly spreading zones among active zones * fix any_of bug * merge autoscalers * _fill_in_launchable_resources * add yaml description * update yamls * update existing zone * Update sky/task.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/spot_policy.py Co-authored-by: Tian Xia <[email protected]> * Update sky/cli.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/autoscalers.py Co-authored-by: Tian Xia <[email protected]> * code review * code review * update other spotautoscaler variables * fix * remove anyof * use defaultdict * update name for init_subclass * running format * fixing pr * fix PR * update any_of * ordered resources * ordered resources * add with ux_utils.print_exception_no_traceback(): * print final target_num_replicas * update target based on qps target * added check for both up and update * add skyserve tests * remove spot_placer, use spot_policy instead * spot_placer * format * update use_spot or non_use_spot * fix pr and add a newline to yaml * dataclass * update autoscaler * bump autoscaler version * request timestamps update * update tests and error handling * fix Location hash * update comment * name change * _serve_check_service * add newline to yamls * spot_policies * Update sky/serve/replica_managers.py Co-authored-by: Tian Xia <[email protected]> * code review * format * fix import * add skyserve spot policy * format * assert len(task.resources) >= 1 * fix bug and added SpotOnDemandMix * bug fix and edit wording on yaml * spot_policy_str * update examples/serve/policy/spot_on_demand_mix.yaml yaml * yaml doc * not expose the autoscaler option to the user * update interface * update initialization * require use_spot explicitly * went through spec * remove NAME * update yaml * revert is True * update yamls * interface fix * added multi accelerator support * fix yaml * fix UI issues * fix pr reviews * replica_ids_to_scale_down * update autoscaler names * remove initialization and add back checking active * remove target_qps_per_replicas * max_replicas required where target_qps_per_replica is set * pr * fix nits * code review * code review * remove Autoscaler.from_spec * num_ready_spot * removed spot placer * update yaml * delete spot_only yaml * format * # use_spot is needed for ondemand fallback * error msg * update print * Update sky/serve/autoscalers.py Co-authored-by: Zhanghao Wu <[email protected]> * code review * added a todo * added todo * smoke test for base on demand fallback * pr review * update ports * replace up validation to _validate_service_task * updated status order * get_dynamic_states and load dynamic_states * move function locations * add other statuses * code review * fix pr * move interrupted position * added smoke test test_skyserve_dynamic_ondemand_fallback * _terminate_gcp_replica * updated smoke tests * update first_ready_time * code review * update smoke test sleep --------- Co-authored-by: cblmemo <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]>
- Loading branch information