Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DAG API Enhancements: Introducing Downstream Task Parsing and Explici…
…t Flow Definition (#4067) * provide an example, edited from pipeline.yml * more focus on dependencies for user dag lib * more powerful user interface * load and dump new yaml format * fix * fix: reversed logic in add_edge * [docs] Unroll k8s internal load balancer docs (#4083) unroll load balancer docs * rename * refactor due to reviewer's comments * generate task.name if not given * [docs] `sky status --kubernetes` docs (#4064) * observability docs * comments * [UX] Show log after failure and fix the color issue with narrow window (#4084) * fix narrow window and show log path during exception * format * format * [k8s] `sky status --k8s` refactor (#4079) * refactor * lint * refactor, dataclass * refactor, dataclass * refactor * lint * add comments for add_edge * add `print_exception_no_traceback` when raise * make `Dag.tasks` a property * print dependencies for `__repr__` * move `get_unique_task_name` to common_utils * [Performance] Use new GCP custom images (#4027) * [Performance] Use new custom image to create GCP GPU VMs * update image tags for both CPU and GPU * always generate .sky/python_path --------- Co-authored-by: Yika Luo <[email protected]> * [GCP] Add H100 mega (#4099) * Add H100 mega support on GCP * fix for some other regions * format * fix resource type * fix catalog fetching * [GCP] Add gVNIC support (#4095) * add gvnic support through config.yaml * lint * docs * [Lambda] Lambda Cloud SkyPilot provisioner (#3865) * feat: lambda cloud new provisioner * feat: address cblmemo reviews and other reviews + make multi-node work again * fix: quotes * fix: address some reviews * chore: rm unused option * chore: update typedef * feat: use lists directly * fix: formatting * chore: address reviews * fix: formatting * chore: rm query ports since default impl per review * feat: add back query ports * fix: formatting * chore: add newline at eof * feat: try removing query ports again * [Docs] GKE Nvidia Driver installation instructions update (#4106) * docs * docs * docs * [Performance] Use new AWS custom images (#4091) * rename methods to use downstream/edge terminology * [Performance] Add Packer image generation scripts for GCP and AWS (#4068) * [Performance] Add Packer image generation scripts for GCP and AWS * Add docker install and tests * solve nvidia container issue * Install cuDNN * [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073) * [Performance] Add AWS script to copy images for all regions * script to delete all AWS images across regions * Add cloud dependencies to image --------- Co-authored-by: Yika Luo <[email protected]> * Disable AWS images.csv refreshing (#4116) * [Docs] .skyignore doc (#4114) * [Docs] .skyignore doc * Correct typos Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * [Core] Raise error for none existing cluster when endpoint is called (#4117) raise error for none existing cluster * Refresh local aws images.csv when image not found (#4127) Refresh local aws images.csv by pulling from github catalog when image tag not found * [Docs] News revamps. (#4126) * News revamps. updates updates updates updates updates updates updates updates * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Serve] Support manually terminating a replica and with purge option (#4032) * define replica id param in cli * create endpoint on controller * call controller endpoint to scale down replica * add classmethod decorator * add handler methods for readability in cli * update docstr and error msg, and inline in cli * update log and return err msg * add docstr, catch and reraise err, add stopped and nonexistent message * inline constant to avoid circular import * fix error statement and return encoded str * add purge feature * add purge replica usage in docstr * use .get to handle unexpected packages * fix: diff terminate replica when failed/purging or not * fix: stay up to date for `is_controller_accessible` * revert * up to date with current APIs * error handling * when purged remove record in the main loop * refactor due to reviewer's suggestions * combine functions * fix: terminate the healthy replica even with purge option * remove abbr * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/cli.py Co-authored-by: Tian Xia <[email protected]> * got services hint * check if not yes in the outside if branch * fix some output messages * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * set conflict status code for already scheduled termination * combine purge and normal terminating down branch together * bump version * global exception handler to render a json response with error messages * fix: use responses.JSONResponse for dict serialize * error messages for old controller * fix: check version mismatch in generated code * revert mistakenly change update_service * refine already in terminating message * fix: branch code workaround in cls.build * wording Co-authored-by: Tian Xia <[email protected]> * refactor due to reviewer's comments * fix use ux_utils Co-authored-by: Tian Xia <[email protected]> * add changelog as comments * fix messages * edit the message for mismatch error Co-authored-by: Tian Xia <[email protected]> * no traceback when raising in `terminate_replica` * messages decode * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * format * forma * Empty commit --------- Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> * [Provisioner] Support docker in Lambda Cloud and TPU (#4115) * [Provisioner] Support docker in Lambda Cloud * fix permission issue * merge with check docker installed * add tpu support & test * patch lambda cloud * add comment * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * change wording all to up/downstream style * Add unique suffix to task names, fallback to timestamp if unnamed * Unify handling of single and multiple tasks without dependencies * Refactor tasks initialization: use list comprehension and fail fast * Fix remove task dependency description: upstream, not downstream Co-authored-by: Tian Xia <[email protected]> * Remove duplicated `self.edges`, use nx api instead * [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111) * add `ux_utils.print_exception_no_traceback()` for cleaner error output * Empty commit * remove unnecessary with block * Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130) fix unnecessary with block for returning * Revert "Add unique suffix to task names, fallback to timestamp if unnamed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352. * comment the checking used as upstream logic * [examples] Deepspeed fixes + k8s support (#4124) deepspeed kubernetes fixes * Empty commit * [OCI] Support more OS types in addition to ubuntu (#4080) * Bug fix for sky config file path resolution. * format * [OCI] Bug fix for image_id in Task YAML * [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu. * format * Disable system firewall * Bug fix for validation of the Marketplace images * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * variable/function naming * address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image. * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments --------- Co-authored-by: Zhanghao Wu <[email protected]> * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * fix: typing.cast * add TODOs for future function migration * remove dependencies wording to reduce ambiguity * temporarily add github actions --------- Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: yika-luo <[email protected]> Co-authored-by: Yika Luo <[email protected]> Co-authored-by: Kote Mushegiani <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Hysun He <[email protected]>
- Loading branch information