Skip to content

Commit

Permalink
Add docker support for SkyPilot (#1910)
Browse files Browse the repository at this point in the history
* successfully launched on GCP

* update image with rsync installed && fix name error in conda command

* support dockerimage on task yaml && reformat code

* recoup sudo in setup commands

* reformat code

* support direct ssh to docker && fix mesg warning && reformat

* fix gcp port issue

* aws successfully launched

* fix port range

* successfullt launched on Azure

* remove unused exception class

* reformat setup command to azure.py in plain text for readbility

* fix job queue cannot cancel

* reformat code

* switch to port 22 for docker ssh

* move docker image to resources: image_id: docker:<image> and change to an optional function

* minor fix for ray yaml j2

* fix error when image_id is a dict

* support muli-node

* mode docker user setup after handle is created

* support images without rsync

* add aws && azure support

* remove redundant pip3 install

* remove error merging

* move docker image to resources

* minor fixes

* format

* adjust extrack docker image

* change back to port 22 for host and port 10022 for docker, passed stop-start recovery test

* add ulimit and gcp 10022 enable outside oslogin

* fix gcp suthentication & move ssh authorized keys setup to run_init

* fix len(image_id) when image_id is None

* use docker stop & start to recover

* temporary remove conflict

* add back docker user

* fix wrong username in add job

* use ssh jump server to access docker

* remove inbloud rules of 10022

* update comment

* now ssh into docker

* ux: raise rather than assert

* move some setup commands to SkyDockerCommandRunner

* monir fix

* format

* fix multinode ssh config

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/azure.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/skylet/providers/gcp/config.py

Co-authored-by: Zhanghao Wu <[email protected]>

* minor fixes

* format

* move get_docker_user to backend_utils.py

* move two constants to skylet and move resources vars to make deployment vars

* move SkyDockerCommandRunner to skylet/providers

* format

* support proxy command with docker

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/aws.py

Co-authored-by: Zhanghao Wu <[email protected]>

* add comment

* quote docker ssh proxy command

* explicit checking for Optional object

* fix credentials

* remove -m in bash script

* fix job queue owner

* fix job owner and username

* add job queue smoke test

* add test_docker_preinstalled_package

* fix restart error

* nit: code style for proxy command

* update CloudVmRayResourceHandle version

* disable unattended-upgrade with cloud-init

* fix UnboundLocalError

* fix variable shadow

* move checking for targetTags to #2210

* rename

* restore some deprecated changes

* format

* disable tpu with docker

* fix acc inexisting problem

* add progress bar to docker image pulling

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* Apply suggestions from code review

* Try cloud init without base64 encode

Co-authored-by: Zhanghao Wu <[email protected]>

* Revert "Try cloud init without base64 encode"

This reverts commit 418c912.

* add comment in cmd runner

* add failover for clous that not support docker yet

* temporary remove check of docker image

* add docker to resources.get_required_cloud_features

* format

* install pip in run_init

* format

* disable ssh control when docker is used

* add docker user to resource handle's repr

* stash some changes for easier merge

* minor

* add back previously stashed function

* disable docker with proxy for now

* change constants.DEFAULT_DOCKER_PORT to int

* rename NATIVE_DOCKER_SUPPORT to DOCKER_IMAGE

* add todo for only support debian-based images

* add check for proxy command

* fix docker_config={}

* upd docker test

* upd docker test

* move proxy command check to cloud.check_features_are_supported

---------

Co-authored-by: Zhanghao Wu <[email protected]>
  • Loading branch information
cblmemo and Michaelvll authored Aug 11, 2023
1 parent 8dbdc89 commit 5dd9aa1
Show file tree
Hide file tree
Showing 31 changed files with 854 additions and 89 deletions.
11 changes: 9 additions & 2 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,15 @@ Available fields:
tpu_vm: False # False to use TPU nodes (the default); True to use TPU VMs.
# Custom image id (optional, advanced). The image id used to boot the
# instances. Only supported for AWS and GCP. If not specified, SkyPilot
# will use the default debian-based image suitable for machine learning tasks.
# instances. Only supported for AWS and GCP (for non-docker image). If not
# specified, SkyPilot will use the default debian-based image suitable for
# machine learning tasks.
#
# Docker support
# You can specify docker image to use by setting the image_id to
# `docker:<image name>` for Azure, AWS and GCP. For example,
# image_id: docker:ubuntu:latest
# Currently, only debian and ubuntu images are supported.
#
# AWS
# To find AWS AMI ids: https://leaherb.com/how-to-find-an-aws-marketplace-ami-image-id
Expand Down
11 changes: 11 additions & 0 deletions examples/job_queue/cluster_docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# A dummy task for cluster creation.
#
# Runs a dummy task that provision a cluster.
#
# Usage:
# sky launch -c djq cluster_docker.yaml
# sky exec djq job_docker.yaml

resources:
accelerators: T4
image_id: docker:ubuntu:20.04
24 changes: 24 additions & 0 deletions examples/job_queue/job_docker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# A task submitted to an existing cluster.
#
# Runs a task on a existing cluster with docker.
#
# Usage:
# sky launch -c djq cluster_docker.yaml
# sky exec djq job_docker.yaml

name: job_docker

resources:
accelerators: T4:0.5
image_id: docker:ubuntu:20.04

setup: |
echo "running setup"
run: |
timestamp=$(date +%s)
conda env list
for i in {1..120}; do
echo "$timestamp $i"
sleep 1
done
Loading

0 comments on commit 5dd9aa1

Please sign in to comment.