Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Open ports on AWS & GCP #2210

Merged
merged 45 commits into from
Aug 1, 2023
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jul 11, 2023

This PR resolves #1255 on AWS & GCP.

Interface:

resources:
  cloud: gcp
  ports:
    - 33828
    - 18000-18100

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud gcp; sky down <cluster-name> and check firewall rules on the GCP console
    • sky launch --cloud aws; sky down <cluster-name> and check firewall rules on the AWS console
    • Launching HTTP server on 2 node cluster
    • Stop & restart cluster
    • TPU-related tests: launch the following YAMLs and access http://<head_ip>:33828 in a web browser
name: mnist-tpu-node-port

resources:
  ports:
    - 33828
  instance_type: n1-highmem-8
  accelerators: tpu-v2-8
  accelerator_args:
    runtime_version: 2.7.0

workdir: examples/http_server_with_custom_ports  # For the server.py

run: python3 server.py
name: mnist-tpu-vm-port

resources:
  ports:
    - 33828
  accelerators: tpu-v2-8
  accelerator_args:
    runtime_version: tpu-vm-base
    tpu_vm: True

workdir: examples/http_server_with_custom_ports  # For the server.py

run: python3 server.py
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_gcp_http_server_with_custom_ports
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_aws_http_server_with_custom_ports
    • Somehow all pytest marked with aws will be skipped in my machine, so I manually run the script and it performs as expected.
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 11, 2023

API design discussion: Currently I use a section in ~/.sky/config.yaml since the port will open for all VM launched by skypilot, but not VMs launched for a specified task. We might want to discuss on this since configuring ~/.sky/config.yaml is not intuitive and will destroy our one-click solution for a single task.yaml.

@concretevitamin
Copy link
Member

API design discussion: Currently I use a section in ~/.sky/config.yaml since the port will open for all VM launched by skypilot, but not VMs launched for a specified task. We might want to discuss on this since configuring ~/.sky/config.yaml is not intuitive and will destroy our one-click solution for a single task.yaml.

Agreed that it's not intuitive. If we put ports in task YAML, can we make it work as follows: create or reuse a port-specific security group (AWS concept; for other clouds we need to check) that opens the requested ports, then let this cluster use this specific security group?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 11, 2023

API design discussion: Currently I use a section in ~/.sky/config.yaml since the port will open for all VM launched by skypilot, but not VMs launched for a specified task. We might want to discuss on this since configuring ~/.sky/config.yaml is not intuitive and will destroy our one-click solution for a single task.yaml.

Agreed that it's not intuitive. If we put ports in task YAML, can we make it work as follows: create or reuse a port-specific security group (AWS concept; for other clouds we need to check) that opens the requested ports, then let this cluster use this specific security group?

In my vague memory, the overhead of creating a vpc network in GCP is not neglectable; gonna do some benchmark later.

BTW, the concept of a port-specific security group might become complex when considering users might want to open 2 or more ports in one single task. For example, in task1.yaml we have

ports:
  - TCP:10022
  - UDP:10023

and in task2.yaml we have

ports:
  - TCP:10022
  - UDP:10024

IIUC, these two tasks cannot use the same security group and it is likely that none of the tasks could share one security group. Considering that checking whether a security group meets the requirements (and likely none of the existing security groups satisfy the requirements) takes some time (and it is not neglectable), we might want to consider task-specific security groups too.

cblmemo added a commit to cblmemo/skypilot that referenced this pull request Jul 12, 2023
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 13, 2023

Current implementation logic: each cluster have its own 'Security Group' (for GCP, a series of fire wall rules).

For GCP, we create a new firewall rule for each port specified. This rule only applies to the VM inside the current cluster and is deleted on termination of the cluster.

For AWS, we create a security group for each cluster and delete it on termination of the cluster.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 14, 2023

UPD:

  • Now it will raise an error if both ports and cloud that is not GCP, AWS are specified;
  • Auto failover is enabled: when no cloud is specified, it will only choose from GCP and AWS if port is used;
  • Post cleanup after failover is handled here
    CloudVmRayBackend().teardown_no_lock(handle,

@cblmemo
Copy link
Collaborator Author

cblmemo commented Jul 25, 2023

@Michaelvll The PR is ready for another review 🙂

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update of this PR @cblmemo! The code looks mostly good to me, but an issue with TPUVM.

Comment on lines 153 to 154
# We don't wait for the instances to be terminated, as it can take a long
# time (same as what we did in ray's node_provider).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GCP, does it need to wait the instance to be actually terminated before cleaning up the ports or it is fine to clean up the ports during the instance in the status of TERMINATING?

Here, we don't wait the instance to be actually terminated (deleted).

Copy link
Collaborator Author

@cblmemo cblmemo Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The firewall rules have no dependency on the VM; I think it is safe to clean up the ports during the instance when terminating.

Comment on lines 168 to 169
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}'
f'-{common_utils.hash_cluster_name(cluster_name)}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following for better readability from the console?

Suggested change
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}'
f'-{common_utils.hash_cluster_name(cluster_name)}')
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}'
f'-{cluster_name[:10]}-{common_utils.hash_cluster_name(cluster_name)}')

The 10 may need to be changed based on the total length of the string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the common_utils.hash_cluster_name function already implements this: it will check cluster name length and truncate and append a hash if too long, otherwise we just return the original cluster name. The function name is a little bit confusing 🤧 I'll consider other name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now renamed to truncate_and_hash_cluster_name. If you have another better name, please let me know 👀

sky/skylet/providers/gcp/config.py Outdated Show resolved Hide resolved
Comment on lines 882 to 886
if "tags" not in node_config:
node_config["tags"] = {"items": []}
# Add cluster name to the tags so that firewall rules will apply to
# the created VM.
node_config["tags"]["items"].append(config["cluster_name"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not working with the TPUVM. To reproduce:

sky launch -c test-tpuvm examples/tpu/tpuvm_mnist.yaml --use-spot -i 0 --down

The following error occurs even the user does not specify the ports.

Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/skypilot-375900/locations/us-central1-c/nodes?nodeId=ray-test-tpuvm-head-480658f3-tpu&alt=json returned "Invalid value at 'node' (tags), Starting an object on a scalar field". Details: "[{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'node', 'description': "Invalid value at 'node' (tags), Starting an object on a scalar field"}]}]">

It would be nice the run the smoke test as well. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Plus, this version that disabled TPU with ports has passed all smoke tests. 🫡

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
When I use this feature to down a cluster got an error as below:

Traceback (most recent call last):
File "/usr/local/bin/sky", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 239, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 1110, in invoke
return super().invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2500, in down
_down_or_stop_clusters(clusters,
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2755, in _down_or_stop_clusters
subprocess_utils.run_in_parallel(_down_or_stop, clusters)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/subprocess_utils.py", line 54, in run_in_parallel
return list(p.imap(func, args))
File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2727, in _down_or_stop
core.down(name, purge=purge)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/core.py", line 365, in down
backend.teardown(handle, terminate=True, purge=purge)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 239, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/backend.py", line 110, in teardown
self._teardown(handle, terminate, purge)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3122, in _teardown
self.teardown_no_lock(
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3404, in teardown_no_lock
self.post_teardown_cleanup(handle, terminate, purge)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3622, in post_teardown_cleanup
provision_api.cleanup_ports(repr(cloud), handle.cluster_name,
File "/usr/local/lib/python3.8/dist-packages/sky/provision/init.py", line 29, in _wrapper
return impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/provision/aws/instance.py", line 121, in cleanup_ports
list(sgs)[0].delete()
File "/usr/local/lib/python3.8/dist-packages/boto3/resources/factory.py", line 580, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/resources/action.py", line 88, in call
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (DependencyViolation) when calling the DeleteSecurityGroup operation: resource sg-01d0e8429fdc77751 has a dependent object

Seems cleanup security group making some errors?

Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which commit are you at? This is a bug that should have been fixed, and I cannot reproduce this error in the latest commit.
And could you provide some information? Such as the provisioning log, teardown log, task.yaml you used and the security group name shown in the cloud console?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. The environment has been destroyed. I found that this change was not committed to master. So I used this repo: https://github.com/skypilot-org/skypilot/tree/f3ffaaf25744a8fe0926e89dcc0a8ee69febd690. I am not sure if it is the suitable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the corrupted environment, maybe you could use sky down -p <cluster_name> to force teardown the cluster and manually cleanup resources on the cloud console; Then you could use the stable version on the current master branch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean stable version on master is support open port feature now?

Copy link
Collaborator Author

@cblmemo cblmemo Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we just merged this PR. By checkout to the master's HEAD and installing from source, you could use the open port features now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's very nice. Thanks a lot for the great job that you have done.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @cblmemo! Many users have requested it.

(Maybe another PR) We need to add that in our document for minimal permission for AWS/GCP, we probably should add another item about the permission for opening the ports. https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/gcp.html#user

I just tried to launch juypter lab on VMs on AWS, GCP, and also TPU node / TPU VM. All works well.

Here is my testing yaml, I guess we can have a jupyter example in the future.

resources:
  # accelerators: tpu-v2-8
  # accelerator_args:
  #   runtime_version: 2.12.1
  # use_spot: true
  ports:
    - 29324


run: |
  pip install jupyter
  myip=`curl ifconfig.me`
  echo $myip
  jupyter lab --port 29324 --no-browser --ip=0.0.0.0

sky/skylet/providers/gcp/config.py Show resolved Hide resolved
@@ -0,0 +1,7 @@
resources:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is nice. It would be great to have an end-to-end example for jupyter notebook in a next PR. : )


workdir: ./examples/http_server_with_custom_ports

run: python3 server.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide an env var in the run and setup section that the user can do something like the following:

run: |
  # python serve.py --port $SKYPILOT_PORT
  jupyter notebook --port $SKYPILOT_PORT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior when multiple ports are specified? Undefined?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm. I guess I could reference what SKYPILOT_NODE_IPS do: multiple lines and each line of them represent one item in our task.yaml

@cblmemo cblmemo merged commit fa4dd65 into skypilot-org:master Aug 1, 2023
@cblmemo cblmemo deleted the open-ports branch August 1, 2023 22:58
cblmemo added a commit that referenced this pull request Aug 11, 2023
* successfully launched on GCP

* update image with rsync installed && fix name error in conda command

* support dockerimage on task yaml && reformat code

* recoup sudo in setup commands

* reformat code

* support direct ssh to docker && fix mesg warning && reformat

* fix gcp port issue

* aws successfully launched

* fix port range

* successfullt launched on Azure

* remove unused exception class

* reformat setup command to azure.py in plain text for readbility

* fix job queue cannot cancel

* reformat code

* switch to port 22 for docker ssh

* move docker image to resources: image_id: docker:<image> and change to an optional function

* minor fix for ray yaml j2

* fix error when image_id is a dict

* support muli-node

* mode docker user setup after handle is created

* support images without rsync

* add aws && azure support

* remove redundant pip3 install

* remove error merging

* move docker image to resources

* minor fixes

* format

* adjust extrack docker image

* change back to port 22 for host and port 10022 for docker, passed stop-start recovery test

* add ulimit and gcp 10022 enable outside oslogin

* fix gcp suthentication & move ssh authorized keys setup to run_init

* fix len(image_id) when image_id is None

* use docker stop & start to recover

* temporary remove conflict

* add back docker user

* fix wrong username in add job

* use ssh jump server to access docker

* remove inbloud rules of 10022

* update comment

* now ssh into docker

* ux: raise rather than assert

* move some setup commands to SkyDockerCommandRunner

* monir fix

* format

* fix multinode ssh config

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/azure.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/skylet/providers/gcp/config.py

Co-authored-by: Zhanghao Wu <[email protected]>

* minor fixes

* format

* move get_docker_user to backend_utils.py

* move two constants to skylet and move resources vars to make deployment vars

* move SkyDockerCommandRunner to skylet/providers

* format

* support proxy command with docker

* Update sky/backends/backend_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

* Update sky/clouds/aws.py

Co-authored-by: Zhanghao Wu <[email protected]>

* add comment

* quote docker ssh proxy command

* explicit checking for Optional object

* fix credentials

* remove -m in bash script

* fix job queue owner

* fix job owner and username

* add job queue smoke test

* add test_docker_preinstalled_package

* fix restart error

* nit: code style for proxy command

* update CloudVmRayResourceHandle version

* disable unattended-upgrade with cloud-init

* fix UnboundLocalError

* fix variable shadow

* move checking for targetTags to #2210

* rename

* restore some deprecated changes

* format

* disable tpu with docker

* fix acc inexisting problem

* add progress bar to docker image pulling

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* Apply suggestions from code review

* Try cloud init without base64 encode

Co-authored-by: Zhanghao Wu <[email protected]>

* Revert "Try cloud init without base64 encode"

This reverts commit 418c912.

* add comment in cmd runner

* add failover for clous that not support docker yet

* temporary remove check of docker image

* add docker to resources.get_required_cloud_features

* format

* install pip in run_init

* format

* disable ssh control when docker is used

* add docker user to resource handle's repr

* stash some changes for easier merge

* minor

* add back previously stashed function

* disable docker with proxy for now

* change constants.DEFAULT_DOCKER_PORT to int

* rename NATIVE_DOCKER_SUPPORT to DOCKER_IMAGE

* add todo for only support debian-based images

* add check for proxy command

* fix docker_config={}

* upd docker test

* upd docker test

* move proxy command check to cloud.check_features_are_supported

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Provisioner] Add support for opening ports on VMs
4 participants