Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing SkyServe #2458

Merged
merged 242 commits into from
Nov 15, 2023
Merged

Introducing SkyServe #2458

merged 242 commits into from
Nov 15, 2023

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 24, 2023

This PR introduces SkyServe, a lightweight multi-cloud serving system.

User doc: https://docs.google.com/document/d/1vVmzLF-EkG3Moj-q47DQBGvFipK4PNfkz0V6LyaPstE/edit#heading=h.m6tcx7d50j23

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

infwinston and others added 30 commits July 13, 2023 22:51
* Add service schema

* use new serve YAML

* change to qpm

* change to fix node

* refactor init of SkyServiceSpec

* change http example to new yaml format

* update default value of from_yaml_config and handle service in task

* Launching successfully

* use argument in controller & redirector

* resolve comments

* use qps instead

* raise when multiple task found

* change to qps

* introduce constants

* introduce constants & fix bugs

* add sky down

* add Services
No existing services. without STATUS (but with #healthy replica

* format

* add llama2 example

* add fields to service db

* status with replica information

* fix policy parsing bug

* add auth todo

* add replica status todo

* change cluster name prefix and order of the column

* minor fixes

* reorder status

* change name: controller --> control plane

* change name: middleware --> controller

* clean code

* rename default service name

* env vars

* add purge and skip identity check on serve controller

* upload filemounts and workdir to storage & enhance --purge
… `sky serve logs` prototype (#2311)

* introducing multiprocessing prototype

* add run env to controller & redirector

* reefactor and format

* add control-plane and redirector logs

* minor

* minor

* Refactor: move  to infra provider

* Refactor: move load balancer to redirector

* refactor, add more logging

* add replica status

* resolve some TODOs

* add post data feature

* rename, format

* add error message handling

* bug fix & logging

* fix a bug in continuous unhealthy

* add error when user port is same with control plane

* fix None post_data bug

* add stable diffusion example

* remove response body when code == 200

* add some TODOs and change RUNNING to READY

* add failed status

* add TODO for return failed replica info

* fix sky serve status --help error

* add console help messages

* remove redundant stable diffusion setup files

* rename healthy_replica --> ready_replica

* adopt advice from code review

* rename to service_name

* adopt advice from comment
… logs` for replica info (#2353)

* introducing multiprocessing prototype

* add run env to controller & redirector

* reefactor and format

* add control-plane and redirector logs

* minor

* minor

* Refactor: move  to infra provider

* Refactor: move load balancer to redirector

* refactor, add more logging

* add replica status

* resolve some TODOs

* add post data feature

* rename, format

* add error message handling

* bug fix & logging

* fix a bug in continuous unhealthy

* add error when user port is same with control plane

* fix None post_data bug

* add stable diffusion example

* remove response body when code == 200

* add some TODOs and change RUNNING to READY

* add failed status

* add TODO for return failed replica info

* fix sky serve status --help error

* add console help messages

* remove redundant stable diffusion setup files

* rename healthy_replica --> ready_replica

* finish replica info & num

* finish

* adopt advice from code review

* rename to service_name

* finish state machine; TODO property based implementation

* adopt advice from comment

* adopt comments in #2311

* finish new replica status

* modify http example more resonable

* UX details & set default controller resources to VCPU=4

* add spinner for launching contorl plane & redirector process

* add sky serve logs CLI for replicas

* add uptime section for service table

* relaunch replicas which terminated by exceeding consecutive failure threshold

* UX details

* code style

* move serve dependency to controller yaml setup section

* add launch log for replica

* add resources preview

* stop jupyter service to avoid port conflict

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* fix userjob failed and launch failed not terminate replica; replica status FAILED --> CLEANUP_FAILED since we terminate all FAILED replica immediately now; remove --purge in termination

* ux nits

* 0.0.0.0 -> localhost

* new log logic: use cluster status == UP instead of waiting 10s; early quit for replica not exist; skip all detailed file sync log

* ux nits

* change readiness timeout to initial delay seconds

* disable some logging when SKYPILOT_DEBUG is not set

* restore debug yaml

* remove debug message

* sync down log before teardown

* rename failed status name (replica)

* change controller resources vcpu to 4+ to avoid no 4 vcpu cloud

* disable -c, -r, -i in sky serve logs CLI

* add REPLICA column in service status

* add CONTROLLER_FAILED status; wait until control plane & redirector job to be running.

* add color for CONTROLLER_FAILED and a prompt to cleanup first if re-up a failed service

* change uptime to first time ready

* format

* add comment for replica/service status in sky serve status -h

* simplify yaml design

* remove controller resources cloud=gcp

* remove controller resources cloud=gcpsome comment

* redirect setup logs to devnull

* redirector listen on 0.0.0.0 & add app_port to controller resources

* ux

* fix readiness suffix

* fix

* fix

* remove cloud=gcp

* ux: remove reduncant str

* disable launch & down & stop with reserved prefix controller-

* support sky serve down service-*

* ux

* cleanup cloud storage when terminate

* enable customized controller resources

* abort if ports specified in resources

* reorder service status column

* new sky serve status: show replica all the time; refresh in parallel; check network first

* remove name since we have service name column

* at least one replica is ready -> service ready

* Update sky/cli.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/status_lib.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/redirector.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/redirector.py

Co-authored-by: Wei-Lin Chiang <[email protected]>

* add vllm example

* upd http example

* change uptime to None and merge get_uptime and get_replica_info

* restore debug comment out code

* add comment for DEFAULT_INITIAL_DELAY_SECONDS

* min_replica -> min_replcias

* format

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* upd tgi example

* upd examples

* format, remove unnecessary refresh in sky serve logs, raise valueerror instead of click.secho red

* add minimal http example

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* fix typo

* Apply suggestions from code review

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* add vicuna v1.5 example

* add replica ip in table; rename some vars

* warning if sky launch a service yaml

* format

* start progress after error log

* fix type name

* log format

* logger with skylogging format

* dump user app fail to control plane log

* ux

* add launched_at and service_yaml to local DB; delete cloud storage locally

* rapid bootstraping

* format

* move skyserve controller to separate section in sky status

* add hint to see detailed sky serve status

* restore example

* rename control plane to controller

* rename to hello_skyserve

* rename to hello_skyserve

* change port to align doc

* inline controller failed checking

* override user resources parameter

* format

* add some todos

* remove redundant return

* use handle to store information

* fix error const name

* simplify resources representation

* check cluster status earlier

* minor

* minor

* add back service section since we still need it in controller

* restore vicuna example

* print all info when use sky serve status -a

* better handling of unknown status

* add warning for status that cannot be sky.down

* minor comment fixes

* remove Tip: to reuse an existing cluster

* enable extra port on controller

* more detailed info when acc is None

* Apply suggestions from code review

Co-authored-by: Wei-Lin Chiang <[email protected]>

* add doc string

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* add msg

* shorten

* fix

* add msg
* add gcp tests

* add azure and aws test

* fix cloud dependencies

* use larger disk size to enable azure controller

* mixed cloud test & install gcloud cli

* format

* fix

* add prehook

* minor & add smoke test function
* add cancel and gorilla example

* update yaml & add readme

* add CLI request cancel

* Update sky/serve/examples/gorilla/gorilla.yaml

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Update sky/serve/examples/misc/cancel/service.yaml

Co-authored-by: Wei-Lin Chiang <[email protected]>

* advice from code review

* upd fschat installation

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>
* fix

* format

* update skyserve prompt

* resolve comments

* fix vllm
@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 14, 2023

mipl@ip-192-168-1-3 skypilot % sky serve status
E 11-14 14:21:46 subprocess_utils.py:73] Traceback (most recent call last):
E 11-14 14:21:46 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 11-14 14:21:46 subprocess_utils.py:73] AttributeError: module 'sky.serve.serve_utils' has no attribute 'get_service_status_encoded'. Did you mean: 'get_serve_status_encoded'?
E 11-14 14:21:46 subprocess_utils.py:73] 
Services
Failed to fetch service statuses due to connection issues. Please try again later. Details: [RuntimeError] Failed to fetch services
mipl@ip-192-168-1-3 skypilot % 

getting this on latest commits

This is because we made an API change. sky serve down -a -y and re-up them should fix the issue 🫡

@manishiitg
Copy link
Contributor

E 11-15 09:05:20 subprocess_utils.py:73] Traceback (most recent call last):
E 11-15 09:05:20 subprocess_utils.py:73]   File "<string>", line 1, in <module>
E 11-15 09:05:20 subprocess_utils.py:73] TypeError: terminate_services() got an unexpected keyword argument 'purge'
E 11-15 09:05:20 subprocess_utils.py:73] 
sky.exceptions.CommandError: Command python3 -u -c 'from sky.serve import serve_state; from sky.serve import serve_utils; msg = serve_utils.terminate_services(None, purge=False); print(msg, end="", flush=True)' failed with return code 1.
Failed to terminate service

The above exception was the direct cause of the following exception:

RuntimeError: Failed to terminate service
mipl@ip-192-168-1-3 ~ % 

getting the same issue stil

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 15, 2023


E 11-15 09:05:20 subprocess_utils.py:73] Traceback (most recent call last):

E 11-15 09:05:20 subprocess_utils.py:73]   File "<string>", line 1, in <module>

E 11-15 09:05:20 subprocess_utils.py:73] TypeError: terminate_services() got an unexpected keyword argument 'purge'

E 11-15 09:05:20 subprocess_utils.py:73] 

sky.exceptions.CommandError: Command python3 -u -c 'from sky.serve import serve_state; from sky.serve import serve_utils; msg = serve_utils.terminate_services(None, purge=False); print(msg, end="", flush=True)' failed with return code 1.

Failed to terminate service



The above exception was the direct cause of the following exception:



RuntimeError: Failed to terminate service

mipl@ip-192-168-1-3 ~ % 



getting the same issue stil

Oh, probably try roll back to previous commit, run sky serve down and then switch to latest commit?

@manishiitg
Copy link
Contributor

hi,

i went to a previous version and did sky server down -a -y and it worked.

but when i again switch to latest commits i get the same errors again?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 15, 2023

hi,

i went to a previous version and did sky server down -a -y and it worked.

but when i again switch to latest commits i get the same errors again?

Sorry for not making it clear 🤧 You can use sky start -f <your-controller-name> to update the code in your controller. Or sky serve up any new services should work too 🫡

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo! LGTM. Just tried the branch in a new docker container and it works nicely.

sky/serve/api.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/utils/resources_utils.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

nit: a UX question: is it possible for our load balancer to notice that the user is using a curl instead of a curl -L and return a output that -L should be added?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 15, 2023

nit: a UX question: is it possible for our load balancer to notice that the user is using a curl instead of a curl -L and return a output that -L should be added?

It seems like -L is only a client-side argument and it only affects if the curl will retry in the client if receives a redirect response. See man curl here:

       -L, --location
              (HTTP)  If  the  server  reports  that  the requested page has moved to a different location (indicated with a Location: header and a 3XX response
              code), this option will make curl redo the request on the new place. If used together with -i, --include or -I, --head, headers from all requested
              pages  will  be  shown.  When authentication is used, curl only sends its credentials to the initial host. If a redirect takes curl to a different
              host, it won't be able to intercept the user+password. See also --location-trusted on how to change this. You can limit the amount of redirects to
              follow by using the --max-redirs option.

              When  curl follows a redirect and if the request is a POST, it will do the following request with a GET if the HTTP response was 301, 302, or 303.
              If the response code was any other 3xx code, curl will re-send the following request using the same unmodified method.

              You can tell curl to not change POST requests to GET after a 30x response by using the  dedicated  options  for  that:  --post301,  --post302  and
              --post303.

              The method set with -X, --request overrides the method curl would otherwise select to use.

Also, I tried to print the fastapi.Request information w/ & w/o -L and indeed found them identical:

image

@cblmemo cblmemo merged commit cddfef8 into master Nov 15, 2023
19 checks passed
@cblmemo cblmemo deleted the serve-dev branch November 15, 2023 18:14
@cblmemo cblmemo restored the serve-dev branch November 16, 2023 01:41
@manishiitg
Copy link
Contributor

i got stuck due the -L issue a lot.

i was using https://github.com/huggingface/text-generation-inference to deploy by models for inference and using their official client https://github.com/huggingface/text-generation-inference/tree/main/clients/python

there client doesn't support redirects, took few hours for me to figure this out :)

this should be mentioned in documentation as a note

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 16, 2023

i got stuck due the -L issue a lot.

i was using https://github.com/huggingface/text-generation-inference to deploy by models for inference and using their official client https://github.com/huggingface/text-generation-inference/tree/main/clients/python

there client doesn't support redirects, took few hours for me to figure this out :)

this should be mentioned in documentation as a note

Thanks! We'll definitely mentioned it in the doc 🫡

@concretevitamin
Copy link
Member

@manishiitg Do you have a preference for the default behavior of sky serve logs without any arguments?

  1. print a replica's logs (e.g., the oldest / latest live replica)
  2. print controller logs
  3. (current behavior) don't print any logs; print hints for arguments
  4. no preference
  5. something else?

@manishiitg
Copy link
Contributor

No preference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants