-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SkyServe] Introducing smoke test and fix bugs #2411
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
infwinston
approved these changes
Aug 18, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
cblmemo
added a commit
that referenced
this pull request
Nov 15, 2023
* init * format * format * reademe * update * [SkyServe] add http server example (#2260) add http server example * [SkyServe] `sky serve` CLI prototype (#2276) * Add service schema * use new serve YAML * change to qpm * change to fix node * refactor init of SkyServiceSpec * change http example to new yaml format * update default value of from_yaml_config and handle service in task * Launching successfully * use argument in controller & redirector * resolve comments * use qps instead * raise when multiple task found * change to qps * introduce constants * introduce constants & fix bugs * add sky down * add Services No existing services. without STATUS (but with #healthy replica * format * add llama2 example * add fields to service db * status with replica information * fix policy parsing bug * add auth todo * add replica status todo * change cluster name prefix and order of the column * minor fixes * reorder status * change name: controller --> control plane * change name: middleware --> controller * clean code * rename default service name * env vars * add purge and skip identity check on serve controller * upload filemounts and workdir to storage & enhance --purge * [SkyServe] Refactoring, Introducing multiprocess for provisioning and `sky serve logs` prototype (#2311) * introducing multiprocessing prototype * add run env to controller & redirector * reefactor and format * add control-plane and redirector logs * minor * minor * Refactor: move to infra provider * Refactor: move load balancer to redirector * refactor, add more logging * add replica status * resolve some TODOs * add post data feature * rename, format * add error message handling * bug fix & logging * fix a bug in continuous unhealthy * add error when user port is same with control plane * fix None post_data bug * add stable diffusion example * remove response body when code == 200 * add some TODOs and change RUNNING to READY * add failed status * add TODO for return failed replica info * fix sky serve status --help error * add console help messages * remove redundant stable diffusion setup files * rename healthy_replica --> ready_replica * adopt advice from code review * rename to service_name * adopt advice from comment * [SkyServe] Use SSH for Authentication, new replica status, `sky serve logs` for replica info (#2353) * introducing multiprocessing prototype * add run env to controller & redirector * reefactor and format * add control-plane and redirector logs * minor * minor * Refactor: move to infra provider * Refactor: move load balancer to redirector * refactor, add more logging * add replica status * resolve some TODOs * add post data feature * rename, format * add error message handling * bug fix & logging * fix a bug in continuous unhealthy * add error when user port is same with control plane * fix None post_data bug * add stable diffusion example * remove response body when code == 200 * add some TODOs and change RUNNING to READY * add failed status * add TODO for return failed replica info * fix sky serve status --help error * add console help messages * remove redundant stable diffusion setup files * rename healthy_replica --> ready_replica * finish replica info & num * finish * adopt advice from code review * rename to service_name * finish state machine; TODO property based implementation * adopt advice from comment * adopt comments in #2311 * finish new replica status * modify http example more resonable * UX details & set default controller resources to VCPU=4 * add spinner for launching contorl plane & redirector process * add sky serve logs CLI for replicas * add uptime section for service table * relaunch replicas which terminated by exceeding consecutive failure threshold * UX details * code style * move serve dependency to controller yaml setup section * add launch log for replica * add resources preview * stop jupyter service to avoid port conflict * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * fix userjob failed and launch failed not terminate replica; replica status FAILED --> CLEANUP_FAILED since we terminate all FAILED replica immediately now; remove --purge in termination * ux nits * 0.0.0.0 -> localhost * new log logic: use cluster status == UP instead of waiting 10s; early quit for replica not exist; skip all detailed file sync log * ux nits * change readiness timeout to initial delay seconds * disable some logging when SKYPILOT_DEBUG is not set * restore debug yaml * remove debug message * sync down log before teardown * rename failed status name (replica) * change controller resources vcpu to 4+ to avoid no 4 vcpu cloud * disable -c, -r, -i in sky serve logs CLI * add REPLICA column in service status * add CONTROLLER_FAILED status; wait until control plane & redirector job to be running. * add color for CONTROLLER_FAILED and a prompt to cleanup first if re-up a failed service * change uptime to first time ready * format * add comment for replica/service status in sky serve status -h * simplify yaml design * remove controller resources cloud=gcp * remove controller resources cloud=gcpsome comment * redirect setup logs to devnull * redirector listen on 0.0.0.0 & add app_port to controller resources * ux * fix readiness suffix * fix * fix * remove cloud=gcp * ux: remove reduncant str * disable launch & down & stop with reserved prefix controller- * support sky serve down service-* * ux * cleanup cloud storage when terminate * enable customized controller resources * abort if ports specified in resources * reorder service status column * new sky serve status: show replica all the time; refresh in parallel; check network first * remove name since we have service name column * at least one replica is ready -> service ready * Update sky/cli.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/status_lib.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/redirector.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/redirector.py Co-authored-by: Wei-Lin Chiang <[email protected]> * add vllm example * upd http example * change uptime to None and merge get_uptime and get_replica_info * restore debug comment out code * add comment for DEFAULT_INITIAL_DELAY_SECONDS * min_replica -> min_replcias * format * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * upd tgi example * upd examples * format, remove unnecessary refresh in sky serve logs, raise valueerror instead of click.secho red * add minimal http example * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * fix typo * Apply suggestions from code review --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * [SkyServe] Final changes for v0 release (#2396) * add vicuna v1.5 example * add replica ip in table; rename some vars * warning if sky launch a service yaml * format * start progress after error log * fix type name * log format * logger with skylogging format * dump user app fail to control plane log * ux * add launched_at and service_yaml to local DB; delete cloud storage locally * rapid bootstraping * format * move skyserve controller to separate section in sky status * add hint to see detailed sky serve status * restore example * rename control plane to controller * rename to hello_skyserve * rename to hello_skyserve * change port to align doc * inline controller failed checking * override user resources parameter * format * add some todos * remove redundant return * use handle to store information * fix error const name * simplify resources representation * check cluster status earlier * minor * minor * add back service section since we still need it in controller * restore vicuna example * print all info when use sky serve status -a * better handling of unknown status * add warning for status that cannot be sky.down * minor comment fixes * remove Tip: to reuse an existing cluster * enable extra port on controller * more detailed info when acc is None * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * add doc string --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * fix serve example (#2406) fix * surface debug msg (#2407) * add msg * shorten * fix * add msg * [SkyServe] Fix port failover (#2408) fix * [SkyServe] Introducing smoke test and fix bugs (#2411) * add gcp tests * add azure and aws test * fix cloud dependencies * use larger disk size to enable azure controller * mixed cloud test & install gcloud cli * format * fix * add prehook * minor & add smoke test function * [SkyServe] Add cancel and gorilla example (#2417) * add cancel and gorilla example * update yaml & add readme * add CLI request cancel * Update sky/serve/examples/gorilla/gorilla.yaml Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/serve/examples/misc/cancel/service.yaml Co-authored-by: Wei-Lin Chiang <[email protected]> * advice from code review * upd fschat installation --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * [SkyServe] Fix interrupt process group and format (#2449) * fix * format * update skyserve prompt * resolve comments * fix vllm * add authentication init * typo fix & remove prehook * fix --no-follow in replica log and disable cancel log when skyserve down * set controller task resources for when controller failed to provision best resources * finish llm & interrupt test * early check cluster name is valid * add hint message for tailing replica job status * make dict thread safe * upd doc * rename redirector to lb * rename sky serve controller prefix * restore example * upd smoke test * use asyncio * change core with underlying function to avoid usage collection on status * reatore comma * add comment * adopt comments in #2473 * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * Apply suggestions from code review * fix example * Fix serve probe (#2513) reset * [SkyServe] Add option to auto restart (#2518) * add auto restart * add smoke test * add task yaml for smoke test * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * apply suggestions from code review --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * [SkyServe] Fix auto restart (#2521) * add auto restart * add smoke test * add task yaml for smoke test * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * apply suggestions from code review * fix * fix --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * [SkyServe] Using Multi-service controller (#2489) * finish * remove finished TODO * remove task debug * fix cli flags * fix cli * format * filter our not UP controller * add --new-controller * add autodown * not expose controller * fix * early exit when job is pending * fix typo * add pending cnt * move get ports to serve_utils * early store service handle * move controller_port to infraprovider * add constants * format * format * format * add autodown hint for skyservecontroller * add todo * fix bug * extent timeout * broad except * group by services * fix bug * endpoint -> endpoint ip; disable user controller port * remove local replica infoo; move controller cluster name to DB; add service nam * auto switch to new controller * fix & add TODO * nit * Apply suggestions from code review Co-authored-by: Wei-Lin Chiang <[email protected]> * Apply suggestions from code review * prototype. todo: debug * fix some bugs * finish most of them. TODO: merge all jobs into one * fix * merge all jobs into serve-controller.yaml * add service name precheck * minor * make python API more robust * apply suggestion from code review * aggregate env vars for controller * rename record * upd comments * Update sky/execution.py Co-authored-by: Wei-Lin Chiang <[email protected]> * change to set, format * extract _get_service_num_on_controller into another function * use autostop * set task default vCPU to 0.125 * minor * cleanup utility files after sky serve down * fix spot controller * Update sky/execution.py Co-authored-by: Wei-Lin Chiang <[email protected]> * apply suggestions from code review * Update sky/cli.py Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Wei-Lin Chiang <[email protected]> * apply suggestion from code review --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * Update sky/cli.py Co-authored-by: Zongheng Yang <[email protected]> * Update sky/serve/examples/stable_diffusion_service.yaml Co-authored-by: Zongheng Yang <[email protected]> * style for example * format & add comments & rephrase * move serve_down to core and refactor * Update sky/execution.py Co-authored-by: Zhanghao Wu <[email protected]> * apply suggestions from code review * move examples * Update sky/cli.py Co-authored-by: Zhanghao Wu <[email protected]> * use _make_task_or_dag_from_entrypoint_with_overrides & minor * make sky serve status accept multiple service names * minor * minor * upd docstring * fix * better programmatic api * ux * use flag to control logging * combine reserved prefix & name * fix * expand user * better UX for auto restart * fix consecutive timeout threshold * minor * nnit * temp remove * add back * only open ports used * remove redundant task yaml for load balancer * move task ports handling to python API * fix controller generation bug * UX nits * nit * Fix sky serve down --purge when storage cleanup failed * ux * reuse service handle * revert * add todo * [SkyServe] Add Ray Serve example (#2621) * Add Ray Serve example * Update serve YAML * restore job id type * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * lint * move service section to the top * add docstr * make controller port not optinal * remove cpu demand for gpu workloads * make ReservedClusterGroup.get_group accept none arg * nit * remove yaml_only and task_only in _make_task_or_dag_from_entrypoint_with_overrides * cli nits * remove get_glob_service_names * fix pop CPU * remove CPU demand for job when presented in CLI * remove cancel and use os.kill now * add db on controller VM, remove job id and use skylet to refresh service status * minor * merge controllers with normal clusters * deprecate controller port adn refresh in infra provider * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> * use enum in serve logs + minor * remove stop hint * env vars & monir * use cluster regex and remove service regex * remove pylint hint & rename app_port -> replica_port * change constants used in _maybe_translate_local_file_mounts_and_sync_up * add --endpoint and ux * new termination of controller & lb; minor suggestions * move statuses to serve_state, minors * minor * speedup terminate * replica db mypy & pylint passed * fix * fix scale down & cleanup failed replica * minor * move controller resources to config.yaml * fix smoke test * refactor request information report * architecture: autoscaler no longer talk with infra provider again * fix auto restart * sync down logs and then streaming * terminate log streaming when service is downed * refactor autoscaler & argument pass in controller * move several var from local db * use a control process to handle signal and terminate service * move port selection to the controller VM * minor * refactor dataabse * comments, docs and function reorder for infra providers * change launch/terminate replica to python API * fix pass task in multiprocessing * minor & fix smoke test * rename infra_provider to replica_manager * ux * not count as failure if UP for more than initial_delay_seconds * default auto restart to true * add todo * upd test for auto_restart * Use only one controller and remove local database. TODO: change to SERVICE_ID to avoid name conflict. * minor * Apply suggestions from code review Co-authored-by: Zongheng Yang <[email protected]> * apply suggestion from code review * mske sky status showing service as well * fix * replica manager ux; use sky logger for uvicorn * UX, refactoring * rephrase hint after sky serve up * Update sky/execution.py Co-authored-by: Zongheng Yang <[email protected]> * comments * add service name check before sky serve up * rename reserved cluster to controller * upd schema * change to async function for fastapi * add multiple ports TODO * fix outdated example * [SkyServe] Serving with Spot (#2749) * rebase and fix bugs * fix PR reviews * fix * fix comments * rename tests * fix yaml replica_num * fix sky status pool wait * fix sync down logs failed * upd examples * add gorilla notebook * add todo for customizable setup commands * add launch log to streaming * move comment position * catch error and print log * align output * ux * fix storage cleanup failure * fix extra newline * comments * Apply suggestions from code review Co-authored-by: Zongheng Yang <[email protected]> * rename * Update sky/core.py Co-authored-by: Zongheng Yang <[email protected]> * apply suggestion from code review * apply suggestion from code review * format * move controller related functions/classes to controller_utils * apply suggestion from code review * Update sky/exceptions.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/serve/replica_managers.py Co-authored-by: Zhanghao Wu <[email protected]> * import * Update sky/serve/autoscalers.py Co-authored-by: Zhanghao Wu <[email protected]> * move max #sky.launch to replica manager and limit total # across services * refactor autoscaler * pass json dict rather than pickle * apply suggestion from code review * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * apply suggestion from code review * bug fix & apply suggestion from code review * apply suggestions * comments * move port to resources * add cancel test * fix controller resources cloud not specified * ux * ux for down * add smoke retry teardown * resolve conflict * add endpoint, move check for name conflict to _execute * apply suggestion from code review * fix jinja2 var * rename reserved cluster, move controller_utils function back * stream logs * fix not showing controller * move controllers back to controller_utils * fix wrong controller resources when controller is exist * use return value to indicate success * default controller resources & better error handling * stress test passed * nits * fix examples * refactor: moving some funcs in execution.py to controller_utils * smoke test passed * remove --target & minor * teardown failed services with --purge flag * move core api to sky/serve/api.py * resolve controller_utils circular import * fix spot config * minor ux * add todo for default argument for sky serve logs * resolve circular import. * fix all circular import * minor * apply suggestion from code review --------- Co-authored-by: Wei-Lin Chiang <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Isaac Ong <[email protected]> Co-authored-by: Ziming Mao <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh