Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add support for detach setup #1379

Merged
merged 22 commits into from
Nov 7, 2022
Merged

[Core] Add support for detach setup #1379

merged 22 commits into from
Nov 7, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 5, 2022

Closes #730.

This PR adds the following option to sky.launch and sky launch: --async-setup

Note:

  1. The setup will be run on all the nodes in the clusters no matter how many nodes are required by the task so that all the nodes will be in the same state to avoid inconsistent behavior on different nodes after many setups are run. (There is still a barrier after setup, i.e. the setups should finish on all the nodes before the run is executed).
  2. Added a JobStatus called SETUP and a life cycle of a job will be INIT -> SETUP -> PENDING -> RUNNING -> SUCCEEDED/FAILED
  3. Discussion: Should we combine the --async-setup and --detach-run together to --detach, i.e. setup and run will detach together?
    Pro: simpler API for the user
    Con: Less control, some user's logic will need to be changed if they rely on the previous behavior.
    My thought: We can add the --async-setup first in this PR, to see how people feel about this, and defer the combination.

Example:

> sky launch -c async-setup4 -s --num-nodes 4 examples/minimal.yaml
Task from YAML spec: examples/minimal.yaml
I 11-05 15:21:13 optimizer.py:605] == Optimizer ==
I 11-05 15:21:13 optimizer.py:628] Estimated cost: $1.5 / hour
I 11-05 15:21:13 optimizer.py:628] 
I 11-05 15:21:13 optimizer.py:684] Considered resources (4 nodes):
I 11-05 15:21:13 optimizer.py:713] ------------------------------------------------------------------
I 11-05 15:21:13 optimizer.py:713]  CLOUD   INSTANCE      vCPUs   ACCELERATORS   COST ($)   CHOSEN   
I 11-05 15:21:13 optimizer.py:713] ------------------------------------------------------------------
I 11-05 15:21:13 optimizer.py:713]  AWS     m6i.2xlarge   8       -              1.54          ✔     
I 11-05 15:21:13 optimizer.py:713] ------------------------------------------------------------------
I 11-05 15:21:13 optimizer.py:713] 
Launching a new cluster 'async-setup4'. Proceed? [Y/n]: 
I 11-05 15:21:14 cloud_vm_ray_backend.py:2813] Creating a new cluster: "async-setup4" [4x AWS(m6i.2xlarge)].
I 11-05 15:21:14 cloud_vm_ray_backend.py:2813] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 11-05 15:21:14 cloud_vm_ray_backend.py:958] To view detailed progress: tail -n100 -f /Users/zhwu/sky_logs/sky-2022-11-05-15-21-13-272410/provision.log
I 11-05 15:21:15 cloud_vm_ray_backend.py:1218] Launching on AWS us-west-2 (us-west-2d,us-west-2b,us-west-2a,us-west-2c)
I 11-05 15:22:45 log_utils.py:45] Head node is up.
I 11-05 15:23:55 cloud_vm_ray_backend.py:1326] Successfully provisioned or found existing head VM. Waiting for workers.
I 11-05 15:26:18 cloud_vm_ray_backend.py:1053] Successfully provisioned or found existing VMs.
I 11-05 15:26:38 cloud_vm_ray_backend.py:2079] Preparing setup for 4 nodes.
I 11-05 15:26:45 cloud_vm_ray_backend.py:2155] Job submitted with Job ID: 1
I 11-05 22:26:45 log_lib.py:388] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
(setup pid=23992) running setup
(setup pid=22961, ip=172.31.37.209) running setup
(setup pid=22958, ip=172.31.43.180) running setup
(setup pid=22955, ip=172.31.37.178) running setup
INFO: Setup finished.
INFO: Waiting for task resources on 4 nodes. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.31.43.180', '172.31.37.209', '172.31.45.100', '172.31.37.178']
(node-2 pid=23992) # conda environments:
(node-2 pid=23992) #
(node-2 pid=23992) base                  *  /opt/conda
(node-2 pid=23992) pytorch                  /opt/conda/envs/pytorch
(node-0 pid=22958, ip=172.31.43.180) # conda environments:
(node-0 pid=22958, ip=172.31.43.180) #
(node-0 pid=22958, ip=172.31.43.180) base                  *  /opt/conda
(node-0 pid=22958, ip=172.31.43.180) pytorch                  /opt/conda/envs/pytorch
(node-3 pid=22955, ip=172.31.37.178) # conda environments:
(node-3 pid=22955, ip=172.31.37.178) #
(node-3 pid=22955, ip=172.31.37.178) base                  *  /opt/conda
(node-3 pid=22955, ip=172.31.37.178) pytorch                  /opt/conda/envs/pytorch
(node-1 pid=22961, ip=172.31.37.209) # conda environments:
(node-1 pid=22961, ip=172.31.37.209) #
(node-1 pid=22961, ip=172.31.37.209) base                  *  /opt/conda
(node-1 pid=22961, ip=172.31.37.209) pytorch                  /opt/conda/envs/pytorch
(node-1 pid=22961, ip=172.31.37.209) 
(node-0 pid=22958, ip=172.31.43.180) 
(node-2 pid=23992) 
(node-3 pid=22955, ip=172.31.37.178) 
INFO: Job finished (status: SUCCEEDED).
Shared connection to 35.88.43.10 closed.
I 11-05 15:28:32 cloud_vm_ray_backend.py:2184] Job ID: 1
I 11-05 15:28:32 cloud_vm_ray_backend.py:2184] To cancel the job:       sky cancel async-setup4 1
I 11-05 15:28:32 cloud_vm_ray_backend.py:2184] To stream the logs:      sky logs async-setup4 1
I 11-05 15:28:32 cloud_vm_ray_backend.py:2184] To view the job queue:   sky queue async-setup4
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] 
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] Cluster name: async-setup4
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] To log into the head VM: ssh async-setup4
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] To submit a job:         sky exec async-setup4 yaml_file
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] To stop the cluster:     sky stop async-setup4
I 11-05 15:28:32 cloud_vm_ray_backend.py:2297] To teardown the cluster: sky down async-setup4
Clusters
NAME          LAUNCHED    RESOURCES                        STATUS   AUTOSTOP  COMMAND                        
async-setup4  2 mins ago  4x AWS(m6i.2xlarge)              UP       -         sky launch -c async-setup4...  
smoke-test    2 days ago  1x AWS(m6i.xlarge)               STOPPED  -         sky start smoke-test           
workspace     1 week ago  1x AWS(m6i.large, disk_size=50)  UP       -         sky launch -c workspace -...   

Managed spot controller (will be autostopped if idle for 10min)
NAME                          LAUNCHED    RESOURCES                          STATUS   AUTOSTOP  COMMAND                  
sky-spot-controller-9ce1ce58  5 days ago  1x AWS(m6i.2xlarge, disk_size=50)  STOPPED  10m       sky spot launch echo hi  

Local clusters:
NAME         USER  HEAD_IP  RESOURCES  COMMAND  
gpu-cluster  -     -        -          -        

TODO:

  • Add tests for this PR.

Tested:

  • sky launch -c async-setup -s examples/minimal.yaml and the status shows SETUP correctly during setup and transit to RUNNING later.
  • sky launch -c async-setup4 -s --num-nodes 4 examples/minimal.yaml
  • ./tests/run_smoke_tests.sh test_n_node_job_queue
  • ./tests/run_smoke_tests.sh
  • sky launch -c async-setup -s examples/minimal.yaml (with exit 1 in the setup`)
I 11-07 00:01:25 log_utils.py:45] Head node is up.
I 11-07 00:02:46 cloud_vm_ray_backend.py:1063] Successfully provisioned or found existing VM.
I 11-07 00:02:47 cloud_vm_ray_backend.py:2094] Preparing setup for 1 node.
I 11-07 00:02:50 cloud_vm_ray_backend.py:2173] Job submitted with Job ID: 1
I 11-07 08:02:51 log_lib.py:388] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
(setup pid=24911) running setup
ERROR: Job 1 setup failed with return code list: [1]
INFO: Job finished (status: FAILED_SETUP).
Shared connection to 35.86.185.118 closed.
> sky queue async-setup
Fetching and parsing job queue...

Job queue of cluster async-setup
ID  NAME     SUBMITTED  STARTED  DURATION  RESOURCES     STATUS        LOG                                        
1   minimal  1 min ago  -        -         1x [CPU:0.5]  FAILED_SETUP  ~/sky_logs/sky-2022-11-06-23-59-54-232565  

examples/job_queue/job_multinode_long_setup.yaml Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
handle: ResourceHandle,
task: 'task_lib.Task',
detach_run: bool,
setup_cmd: Optional[str] = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from reading the interface -- this seems odd as I'd expect we can read from task.setup?

Copy link
Collaborator Author

@Michaelvll Michaelvll Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a bit odd. We now pass the setup_cmd through the member variable in the backend object instead.

@concretevitamin
Copy link
Member

This is great @Michaelvll!

Quick look. Some thoughts, having skimmed through only:

  1. --async-setup may be misunderstood as "run setup asynchronously in parallel to other stages (e.g., run)". --detach-setup seems like a great choice, which if specified will imply --detach-run; it will have no effect if --no-setup is set. Wdyt?
  2. Seems like setup commands is now part of a job?
  • how does a user cancel a long-running setup (e.g., 1-2hr)? previously they can ctrl-c.
  • we should think more about implications. The first bullet point in PR description is one of them -- but I think it may be fine since it's the status quo.
  1. As a user-facing state, SETTING_UP seems more readable than SETUP (I assume this will be displayed in sky queue).
  2. handle / may need to update help strs
  -d, --detach-run                If True, run setup first (blocking), then
                                  detach from the job's execution.

e.g., may not be blocking now.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Nov 6, 2022

--async-setup may be misunderstood as "run setup asynchronously in parallel to other stages (e.g., run)". --detach-setup seems like a great choice, which if specified will imply --detach-run; it will have no effect if --no-setup is set. Wdyt?

I tried --detach-setup at the beginning, but I found this option name can imply: when not specified, the setup process can also be detached with Ctrl-C just as how --detach-run works, which is not true.

I would prefer to rename the --detach-run to --detach-logs (or --no-wait as ray job submit does), so that --async-setup (can think more for this name) will not imply --detach-logs, as it can still stream the output for the job (including both setup and run). Wdyt @concretevitamin ?

The current behavior does follow that when --no-setup is specified, the --asnyc-setup will not take effect.

how does a user cancel a long-running setup (e.g., 1-2hr)? previously they can ctrl-c.

Now sky cancel cluster-name job-id will cancel the setup as well if the job is in the SETTING_UP state.

we should think more about implications. The first bullet point in PR description is one of them -- but I think it may be fine since it's the status quo.

Sorry, I did not get this one. could you say more about this? What are the implications you refer to?

As a user-facing state, SETTING_UP seems more readable than SETUP (I assume this will be displayed in sky queue).

Good point. Renamed it.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Nov 6, 2022

After discussing offline, we decided to keep the --detach-setup and --detach-run for now, and don't let the --detach-setup indicate the --detach-run.
There are several things that remain to be decided:

  1. Whether we should rename --detach-run to another name, such as --no-log-stream, since our logs can also contain setup logs.
  2. When and where should the setup be executed? The current behavior is that the setup will be executed immediately on all the nodes when the job is sky launch'ed, but the job will wait until the required resources are fulfilled.

@Michaelvll Michaelvll marked this pull request as ready for review November 6, 2022 21:00
@concretevitamin
Copy link
Member

Previously, users can enter Y/n if a confirmation is prompted during setup, e.g.,

conda install pytorch==1.12.1 cudatoolkit=11.3 -c pytorch

Does this work with --detach-setup? What would it look like (block forever)?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @Michaelvll! Did a pass.

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
@@ -115,7 +117,7 @@ def __lt__(self, other):
# reserved resources and actually started running: it will set the
# status in the DB to RUNNING.
'PENDING': JobStatus.INIT,
'RUNNING': JobStatus.PENDING,
'RUNNING': JobStatus.SETUP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update L109-118?

Need a comment: what if there's no "setup" provided in the task? Will this SETTING_UP status be shown to users somehow (sky queue?)? That may be confusing to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our generated ray program will set the status of the job to PENDING if there is no setup specified in the task as soon as it is started (ray's status becomes RUNNING), i.e. it will be very rare that the job will be set to SETTING_UP by the update_job_status.
The case will only happens when:

  1. The generated ray program starts and changed ray's status to RUNNING.
  2. Before the generated ray program reach the line where set the job status to PENDING.
  3. At the same time, the skylet updates the job status to SETTING_UP.
  4. Before the generated ray program change the status to PENDING, the user execute sky queue.

Also, I think it may be fine even if the user see the status SETTING_UP for a job without setup section, since the SETTING_UP can also be considered as setting up some sky related runtime for that job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this explanation to L120 here? Just so in the future we know why and possibly how to change this behavior.

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
with_ray=True,
use_sudo={is_local},
) for i in range(total_num_nodes)]
ray.get(setup_workers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: setup_tasks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will _tasks be a bit confusing with the sky task?

sky/spot/recovery_strategy.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll changed the title [Core] Add support for async setup [Core] Add support for detach setup Nov 7, 2022
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
@@ -115,7 +117,7 @@ def __lt__(self, other):
# reserved resources and actually started running: it will set the
# status in the DB to RUNNING.
'PENDING': JobStatus.INIT,
'RUNNING': JobStatus.PENDING,
'RUNNING': JobStatus.SETUP,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this explanation to L120 here? Just so in the future we know why and possibly how to change this behavior.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with --detach-setup? What would it look like (block forever)?

It seems the conda install or sudo apt install will detect that the script is running non-interactively and use the default value to continue (yes).

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/skylet/job_lib.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for shipping this massive UX improvement @Michaelvll!

sky/execution.py Outdated Show resolved Hide resolved
sky/skylet/log_lib.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Thank you for the detailed review @concretevitamin! Passed ./tests/run_smoke_tests.sh. Merging.

@Michaelvll Michaelvll merged commit b2d3555 into master Nov 7, 2022
@Michaelvll Michaelvll deleted the detached-setup branch November 7, 2022 22:19
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Jan 15, 2023
* Add support for async setup

* Fix logging

* Add test for async setup

* add parens

* fix

* refactor a bit

* Fix status

* fix smoke test

* rename

* fix is_cluster_idle function

* format

* address comments

* fix

* Add setup failed

* Fix failed setup

* Add comment

* Add comments

* format

* fix logs

* format

* address comments
Michaelvll added a commit that referenced this pull request Jan 18, 2023
* add cost tracking for clusters that handles launching, re-starting, getting status, stopping and downing clusters, but no auto-stopping

* address Romil PR comments

* address Zhanghao PR comments

* fix nit

* address more PR comments

* address last wave of PR comments

* sky

* address fixing argument for requested resources and fixing spot tests for CI

* address more PR comments

* make tests resources a list to prevent errors

* fix tests again

* address PR comments, including adding fetchall to fix status one cluster only bug

* fix PR comments

* change progress bar interference on stop/down

* add sky report instead of showing cost on other commands

* address cost report PR comments

* address more PR comments on sky report

* [Core] Port ray 2.0.1 (#1133)

* update ray node provider to 2.0.0

update patches

Adapt to ray functions in 2.0.0

update azure-cli version for faster installation

format

[Onprem] Automatically install sky dependencies (#1116)

* Remove root user, move ray cluster to admin

* Automatically install sky dependencies

* Fix admin alignment

* Fix PR

* Address romil's comments

* F

* Addressed Romil's comments

Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207)

* Add --retry-until-up flag for interactive nodes

* Add --region flag for interactive nodes

* Add --idle-minutes-to-autostop flag for interactive nodes

* Add --zone flag for interactive nodes

* Update help messages

* Address nit

Add all region option in catalog fetcher and speed up azure fetcher (#1204)

* Port changes

* format

* add t2a exclusion back

* fix A100 for GCP

* fix aws fetching for p4de.24xlarge

* Fill GPUInfo

* fix

* address part of comments

* address comments

* add test for A100

* patch GpuInfo

* Add generation info

* Add capabilities back to azure and fix aws

* fix azure catalog

* format

* lint

* remove zone from azure

* fix azure

* Add analyze for csv

* update catalog analysis

* format

* backward compatible for azure_catalog

* yapf

* fix GCP catalog

* fix A100-80GB

* format

* increase version number

* only keep useful columns for aws

* remove capabilities from azure

* add az to AWS

Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes" (#1220)

Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (#1207)"

This reverts commit f06416d.

[Storage] Add `StorageMode` to __init__ (#1223)

* Add storage mode to __init__

* fix

[Example] Minimal containerized app example (#1212)

* Container example

* parenthesis

* Add explicit StorageMode

* lint

Fix Mac Version in Setup.py (#1224)

Fix mac

Reduce iops for aws instances (#1221)

* set the default iops to be same as console for AWS

* fix

Revert "Reduce iops for aws instances" (#1229)

Revert "Reduce iops for aws instances (#1221)"

This reverts commit 29f1458.

update back compat test

* parent 06afd93
author Zhanghao Wu <[email protected]> 1665364265 -0700
committer Zhanghao Wu <[email protected]> 1665899898 -0700

parent 06afd93
author Zhanghao Wu <[email protected]> 1665364265 -0700
committer Zhanghao Wu <[email protected]> 1665899681 -0700

Support for autodown

Change API to terminate

fix flag

address comment

format

Rename terminate to down

add smoke test

format

fix syntax

use gcp for autodown test

fix smoke test

fix smoke test

address comments

Switch back to terminate

Change back to tear down

Change to tear down

fix comment

* Fix rebase issue

* address comments

* address

* fix setup.py

* upgrade to 2.0.1

* Fix docs for ray version

* Fix example

* fix backward compatibility test

* Fix onprem job submission

* add steps for backward compat test

* docs: Remove version from docs html titles. (#1303)

Remove version from docs html titles.

* Fix unnecessary ssh hanging issue on Ray (#851)

* Fix ray hanging ssh issue

* Fix

* change the order back

* Update node status after first attempt

* Set `--rename-dir-limit` for gcsfuse to allow dir renames (#1296)

Set rename_dir_lim for gcsfuse

* Docs: polish `sky.Task` doc strings. (#1302)

* WIP

* Polish sky.Task doc strings.

* docs: expose Task (a subset of methods); hide Dag.

* Tweak Task method order; in docs display methods by source order.

* CLI docs: tweak order; tweak `spot launch`.

* Address comments.

* Code block formatting.

* [Launch/Backward Compatibility] Fix incorrect Ray YAML issue (#1287)

* Fix incorrect Ray YAML issue

* yapf

* fix

* comments

* [Storage] add `--implicit-dirs` for gcsfuse (#1312)

add --implicit-dirs

* Improving README. (#1308)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Port landing paras to docs index.rst.

* [UX] Disable python output buffer by default (#1290)

diable python output buffer

* [Storage][Filemounts] Set relative dir root to workdir (#1315)

Set relative dir root to workdir for file_mounts

* Fix Sky Storage Delete more than 256 Items/folders + Bulk Deletion Tests (#1285)

* fix

* Add romil's suggestions

* Add bulk deletion tests

* ok

* Fix

* [Storage] Add lazy unmount flag (#1320)

Add lazy unmount flag

* [Core] Fix skylet checking (#1325)

* Fix skylet checking

* exclude grep

* [UX] remove stacktrace for pipe and ssh info (#1324)

* UX: remove stacktrace for pipe and ssh info

* Add comment

* Avoid ray output in the logs

* format

* revert ssh quiet option

* [Dependency] Fix colorama dependency issue with awscli (#1323)

* Fix colorama dependency issue with awscli

* fix ux for storage delete

* Add roadmap. (#1317)

* Add roadmap.

* Update ROADMAP.md

* Fix SKY_NODE_RANK environment variable (#1291)

* Add flag for retrieving internal node ips

* Ensure SKY_NODE_RANK is 0 for head and stable

* Clearer comment for get_internal_ips

* Handle different num nodes correctly

* Address PR comments

* Address nits

* [Spot] Fix race condition for spot logs (#1329)

* Fix race condition for spot logs

* fix

* fix comment

* address comments

* add comment

* Add TPU Pod to doc (#1318)

* add pod to doc

* Apply suggestions from code review

Co-authored-by: Zongheng Yang <[email protected]>

* comments

* comments

* update bucket note

* Apply suggestions from code review

Co-authored-by: Zongheng Yang <[email protected]>

* update

* update

* fix

* fix

* comments

* fix

Co-authored-by: Zongheng Yang <[email protected]>

* [UX] Add environment variable `SKY_NUM_GPUS_PER_NODE` (#1337)

* add SKY_NUM_GPUS_PER_NODE

* increase multi-node progress timeout

* pin torch version

* add comment

* address comment

* fix smoke test

* address comments

* [Image] Fix blocking by unattended-upgrade (#1347)

* Fix blocking by unattended-upgrade

* adopt to gcp and azure

* [Test/Azure] Fix the torch version in examples for smoke test and change the credential for Azure  (#1330)

* Upgrade images for three clouds

* Fix cuda version

* pin cuda version for torch

* Fix torch version

* fix comments

* Fix azure provider

* fix credential

* revert back to previous azure image

* switch back to cuda 11.3 for pytorch due to azure's image

* fix torch installation

* increase the multi-node timeout

* Update sky/clouds/azure.py

Co-authored-by: Zongheng Yang <[email protected]>

* revert aws image version

* pin cu113 for huggingface

* Add comment

* format

* Update sky/clouds/aws.py

Co-authored-by: Zongheng Yang <[email protected]>

* Update sky/clouds/gcp.py

Co-authored-by: Zongheng Yang <[email protected]>

* revert gcp image

* Fix doc

Co-authored-by: Zongheng Yang <[email protected]>

* [Docs] Reorganizing docs. (#1316)

* Reorganizing docs.

* V2.

* Reorg + rewording

* Address comments

* Remove 'convenient'

* Update `SKY_NODE_RANK` docs (#1350)

* Add tip for node rank to docs

* Update formatting

* Indent fix.

Co-authored-by: Zongheng Yang <[email protected]>

* Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (v2) (#1297)

* Add --region, --zone, --idle-minutes-to-autostop, and --retry-until-up for
interactive nodes

* Update user_requested_resources

* Add --down for interactive nodes and refactor auto{stop,down} edge case

* Refactor click options

* Revert "Refactor click options"

This reverts commit 10a06a9.

* Fix TPU Pod (#1358)

* fix pod

* yapf

* Minor fix for yapf warnings (#1362)

* [Docs] Clarify Storage mounting details (#1365)

* fix incorrect statements

* fix incorrect statements

* fix incorrect statements

* Fix bugs in GCP A100 prices (#1368)

* Fix GCP A100 price bugs

* yapf

* [Custom Image] Support tag for the images and global regions (#1366)

* Support image tag for AWS

* add gcp image support

* address comments

* fix

* remove pandas warning

* Add example for using ubuntu1804

* add ubuntu 1804 in the test

* Enforce trying us regions first

* format

* address comments

* address comments

* Add docs and rename methods

* Add fetch global regions for GCP

* Add all regions for Azure

* rename and add doc

* remvoe accidently added folder

* fix service_catalog

* remove extra line

* Address comments

* mkdir for catalog path

* increase waiting time in test

* fix test recovery

* format

* [UX/Doc] Add disk size in resource display and a minor fix for the doc (#1371)

Minor fix for docs and ux

* [Onprem] Support for Different Type of GPUs + Small Bugfix (#1356)

* Ok

* Great suggestion from Zhanghao

* fix

* Update tutorial.rst

* Pin `torch` in various examples to avoid cuda version issues. (#1378)

* tutorial.rst: pin `torch` to avoid version issues.

Tested:
- Ran on both AWS and GCP.

* Fixes two more yamls

* [Env] SKYPILOT_JOB_ID for all tasks (#1377)

* Add run id for normal job

* add example for the run id

* fix env_check

* fix env_check

* fix

* address comments

* Rename to SKYPILOT_JOB_ID

* rename the controller's job id to avoid confusion

* rename env variables

* fix

* [Core] Add support for detach setup (#1379)

* Add support for async setup

* Fix logging

* Add test for async setup

* add parens

* fix

* refactor a bit

* Fix status

* fix smoke test

* rename

* fix is_cluster_idle function

* format

* address comments

* fix

* Add setup failed

* Fix failed setup

* Add comment

* Add comments

* format

* fix logs

* format

* address comments

* Minor UX fix: `sky cancel` should not print stacktraces. (#1385)

* Minor UX fix: `sky cancel` should not print stacktraces.

* Wording fix.

* exit 1

* [UX] Disable ssh connection sharing for setup (#1390)

* Disable ssh connection sharing for setup

* format

* remove redundant

* fix type hint

* Docs: multi-node clarifications, and ssh into workers. (#1363)

* Fixes #1338: add docs on logging into workers.

* Fixes #1340 and fixes #1339.

* Address comments

* Reword.

* Hint.

* Fix Logging for `sky launch` on new machine (#1382)

* ok

* ok

* Ok

* ok

* Unify methods

* ok

* fix

* [Image] Support passing AMIs for different regions (#1384)

* image dict in resources

* fix

* fix

* add tests

* add per region example

* address comments

* Fix checking

* fix

* fix smoke test

* [LocalDockerBackend] Update `is_local_cluster` check for docker backend (#1396)

Update is_local_cluster check for LocalDockerBackend

* [Setup] unset CUDA_VISIBLE_DEVICES for detach setup (#1404)

* unset CUDA_VISIBLE_DEVICES

* add env check example

* Add setting CUDA_VISIBLE_DEVICES test

* fix

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zongheng Yang <[email protected]>

* format

Co-authored-by: Zongheng Yang <[email protected]>

* [Spot] Keep SKYPILOT_JOB_ID the same for the same spot job (#1400)

* fix SKYPILOT_JOB_ID

* Fix test

* fix

* format

* Add SKYPILOT_JOB_ID to sky spot queue

* nit

* don't set job_id_env_var for spot controller task

* address comments

* Revert SKYPILOT_JOB_ID in spot queue

* format

* Change default value of task.envs to dict

* [UX] fix the error for the first time `sky launch` (#1405)

* fix ux

* test

* fix no public cloud

* address comments

* Fix logging

* format

* Remove the error type for CLI

* yellow

* fix

* Fix logging

* [Spot] Fix spot recovery for multi node (#1411)

* Add cluster status check even job is RUNNING for multi-node

* Disable autoscaler logs and fix finished when partially preempted

* format

* Add test

* address comments

* update

* Add time

* [Release] Fix pypi description (#1416)

* Fix pypi description

* fix

* format

* [Bug fix] head_ip extraction from Ray stdout (#1421)

* Fix bug in head_ip extraction from Ray stdout after launching cluster by using regex to exactly match ip.

* Remove unneeded comment.

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Run yapf and pylint

Co-authored-by: Zhanghao Wu <[email protected]>

* [Global Regions] Add data fetchers into wheel (#1406)

* Add data fetchers into wheel

* yapf

* Fix gcp fetcher

* Add check

* exclude analyze.py

* Link to blog on README and docs. (#1430)

* [Spot] Let cancel interrupt the spot job (#1414)

* Let cancel interrupt the job

* Add test

* Fix test

* Cancel early

* fix test

* fix test

* Fix exceptions

* pass test

* increase waiting time

* address comments

* add job id

* remove 'auto' in ray.init

* Revert "[Spot] Let cancel interrupt the spot job" (#1432)

Revert "[Spot] Let cancel interrupt the spot job (#1414)"

This reverts commit 3bbf4aa.

* [AWS] Avoid key pair permission issue by using cloud-init for authorized keys (#1427)

* Switch to UserData to add public key for AWS

* fix

* Avoid hardcoding username

* Fix backward compatibility test

* address comments

* address comments

* Minor spot logs fix: don't print job id not provided on spot launch. (#1434)

Minor spot logs fix: don't print job id not provided.

* [Catalog] Remove hardcoded A2 pricing URL & Fix a bug in A2 machine zones (#1426)

* Update no 16xA100-40GB zones

* [Catalog] Remove GCP A2 price URL & Fix GCP A100 zone issues

* Add more type annotations

* Minor

* yapf

* Do not add GCP URL prefix

* Minor

* Address comments

* Address comment1

* Minor

* Add comments about the case when a100.empty is True

* Assert not duplicated

* [Spot] Let cancel interrupt the spot job (#1414) (#1433)

* Let cancel interrupt the job

* Add test

* Fix test

* Cancel early

* fix test

* fix test

* Fix exceptions

* pass test

* increase waiting time

* address comments

* add job id

* remove 'auto' in ray.init

* Fix serialization problem

* refactor a bit

* Fix

* Add comments

* format

* pylint

* revert a format change

* Add docstr

* Move ray.init

* replace ray with multiprocess.Process

* Add test for setup cancelation

* Fix logging

* Fix test

* lint

* Use SIGTERM instead

* format

* Change exception type

* revert to KeyboardInterrupt

* remove

* Fix test

* fix test

* fix test

* typo

* [Usage] Robustify the user hash to avoid empty string (#1442)

* Robustify the user hash to avoid empty string

* fix

* Check valid user hash with hexdecimal

* format

* fix

* Add fallback

* Add comment

* lint

* [Storage] Support multiple files in Storage (#1311)

* Set rename_dir_lim for gcsfuse

* Add support for list of sources for Storage

* fix demo yaml

* tests

* lint

* lint

* test

* add validation

* address zhwu comments

* add error on basename conflicts

* use gsutil cp -n instead of gsutil rsync

* lint

* fix name

* parallelize gsutil rsync

* parallelize aws s3 rsync

* lint

* address comments

* refactor

* lint

* address comments

* update schema

* Logging fixes. (#1452)

* Logging fixes.

* yapf

* sys.exit(1)

* [Storage] Fix copy monuts for file with s3 bucket url (#1457)

* test file download with s3

* fix test

* fix storage file mounts

* format

* remove mkdir for `make_sync_dir_command`

* Print errors for GCP timeout. (#1454)

* [autostop] Support restarting the autostop timer. (#1458)

* [autostop] Support restarting the autostop timer.

* Logging

* Make each job submission call set_active_time_to_now().

* Fix test and pylint.

* Fix comments.

* Change tests; some fixes

* logging remnant

* remnant

* [Spot] Make sure the cluster status is not None when showing (#1464)

* Make sure the cluster status is not None when showing

* Fix another potential issue with NoneType of handle

* Add assert

* fix

* format

* Address comments

* Address comments

* format

* format

* fix

* fix

* fix spot cancellation

* format

* Add a few small warnings to README and CONTRIBUTING. (#1422)

* Add a couple small warnings to README and CONTRIBUTING.

* Update README.md

Co-authored-by: Zongheng Yang <[email protected]>

Co-authored-by: Zongheng Yang <[email protected]>

* Hotfix for spot TPU pod recovery (#1470)

* hotfix

* comment

* [Spot] Better spot logs (#1412)

* Add cluster status check even job is RUNNING for multi-node

* Disable autoscaler logs and fix finished when partially preempted

* format

* Add test

* Better spot logging

* Add logs

* format

* address comments

* address comments part 2

* Finish the logging early

* format

* better logging

* Address comments

* Fix message

* Address comments

* Improve UX for logs to include SSH name and rank (#1380)

* Messy WIP

* Fixes two more yamls

* Improve log UX and ensure stableness

* Remove print statement

* Remove task name from logs

* Fix name for single-node tasks

* Update var names and comments for clarity

* Update logic for single and multi-node clusters

* Cache stable cluster IP list in ResourceHandle

* Properly cache and invalidate stable list

* Add back SKYPILOT_NODE_IPS

* Update log file name

* Refactor backend to use cached stable IP list

* Fix spot test

* Fix formatting

* Refactor ResourceHandle

* Fixes for correctness

* Remove unneeded num_nodes arg

* Fix _gang_schedule_ray_up

* Ensure stable IP list is cached

* Formatting fixes

* Refactor updating stable IPs to be part of handle

* Merge max attempts constant

* Fix ordering for setting TPU name

* Fix bugs and clean up code

* Fix backwards compatibility

* Fix bug with old autostopped clusters

* Fix comment

* Fix assertion statement

* Update assertion message

Co-authored-by: Zhanghao Wu <[email protected]>

* Fix linting

* Fix retrieving IPs for TPU vm

* Add optimization for updating IPs

* Linting fix

* Update comment

Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>

* add cost tracking for clusters that handles launching, re-starting, getting status, stopping and downing clusters, but no auto-stopping

* fix some artifacts from rebase error

* handle linting

* make it cost-report

* address PR changes for approval

* last changes

* address last changes

* move around comments for sort

* add cost_report func to init.all list

Co-authored-by: Sumanth <[email protected]>
Co-authored-by: Sumanth <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Wei-Lin Chiang <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Michael Luo <[email protected]>
Co-authored-by: Isaac Ong <[email protected]>
Co-authored-by: ewzeng <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Donny Greenberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support asynchronous setup and file_mounts
2 participants