-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Open ports on AWS & GCP #2210
Conversation
API design discussion: Currently I use a section in |
Agreed that it's not intuitive. If we put |
In my vague memory, the overhead of creating a vpc network in GCP is not neglectable; gonna do some benchmark later. BTW, the concept of a port-specific security group might become complex when considering users might want to open 2 or more ports in one single task. For example, in ports:
- TCP:10022
- UDP:10023 and in ports:
- TCP:10022
- UDP:10024 IIUC, these two tasks cannot use the same security group and it is likely that none of the tasks could share one security group. Considering that checking whether a security group meets the requirements (and likely none of the existing security groups satisfy the requirements) takes some time (and it is not neglectable), we might want to consider task-specific security groups too. |
Current implementation logic: each cluster have its own 'Security Group' (for GCP, a series of fire wall rules). For GCP, we create a new firewall rule for each port specified. This rule only applies to the VM inside the current cluster and is deleted on termination of the cluster. For AWS, we create a security group for each cluster and delete it on termination of the cluster. |
UPD:
|
@Michaelvll The PR is ready for another review 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update of this PR @cblmemo! The code looks mostly good to me, but an issue with TPUVM.
# We don't wait for the instances to be terminated, as it can take a long | ||
# time (same as what we did in ray's node_provider). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On GCP, does it need to wait the instance to be actually terminated before cleaning up the ports or it is fine to clean up the ports during the instance in the status of TERMINATING
?
Here, we don't wait the instance to be actually terminated (deleted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The firewall rules have no dependency on the VM; I think it is safe to clean up the ports during the instance when terminating.
sky/provision/aws/instance.py
Outdated
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}' | ||
f'-{common_utils.hash_cluster_name(cluster_name)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following for better readability from the console?
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}' | |
f'-{common_utils.hash_cluster_name(cluster_name)}') | |
sg_name = (f'sky-sg-{common_utils.user_and_hostname_hash()}' | |
f'-{cluster_name[:10]}-{common_utils.hash_cluster_name(cluster_name)}') |
The 10
may need to be changed based on the total length of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the common_utils.hash_cluster_name
function already implements this: it will check cluster name length and truncate and append a hash if too long, otherwise we just return the original cluster name. The function name is a little bit confusing 🤧 I'll consider other name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now renamed to truncate_and_hash_cluster_name
. If you have another better name, please let me know 👀
sky/skylet/providers/gcp/config.py
Outdated
if "tags" not in node_config: | ||
node_config["tags"] = {"items": []} | ||
# Add cluster name to the tags so that firewall rules will apply to | ||
# the created VM. | ||
node_config["tags"]["items"].append(config["cluster_name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not working with the TPUVM. To reproduce:
sky launch -c test-tpuvm examples/tpu/tpuvm_mnist.yaml --use-spot -i 0 --down
The following error occurs even the user does not specify the ports.
Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://tpu.googleapis.com/v2alpha1/projects/skypilot-375900/locations/us-central1-c/nodes?nodeId=ray-test-tpuvm-head-480658f3-tpu&alt=json returned "Invalid value at 'node' (tags), Starting an object on a scalar field". Details: "[{'@type': 'type.googleapis.com/google.rpc.BadRequest', 'fieldViolations': [{'field': 'node', 'description': "Invalid value at 'node' (tags), Starting an object on a scalar field"}]}]">
It would be nice the run the smoke test as well. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. Plus, this version that disabled TPU with ports has passed all smoke tests. 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
When I use this feature to down a cluster got an error as below:
Traceback (most recent call last):
File "/usr/local/bin/sky", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 239, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 1110, in invoke
return super().invoke(ctx)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2500, in down
_down_or_stop_clusters(clusters,
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2755, in _down_or_stop_clusters
subprocess_utils.run_in_parallel(_down_or_stop, clusters)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/subprocess_utils.py", line 54, in run_in_parallel
return list(p.imap(func, args))
File "/usr/lib/python3.8/multiprocessing/pool.py", line 868, in next
raise value
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 2727, in _down_or_stop
core.down(name, purge=purge)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/core.py", line 365, in down
backend.teardown(handle, terminate=True, purge=purge)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 260, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 239, in _record
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/backend.py", line 110, in teardown
self._teardown(handle, terminate, purge)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3122, in _teardown
self.teardown_no_lock(
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3404, in teardown_no_lock
self.post_teardown_cleanup(handle, terminate, purge)
File "/usr/local/lib/python3.8/dist-packages/sky/backends/cloud_vm_ray_backend.py", line 3622, in post_teardown_cleanup
provision_api.cleanup_ports(repr(cloud), handle.cluster_name,
File "/usr/local/lib/python3.8/dist-packages/sky/provision/init.py", line 29, in _wrapper
return impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sky/provision/aws/instance.py", line 121, in cleanup_ports
list(sgs)[0].delete()
File "/usr/local/lib/python3.8/dist-packages/boto3/resources/factory.py", line 580, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/boto3/resources/action.py", line 88, in call
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.8/dist-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (DependencyViolation) when calling the DeleteSecurityGroup operation: resource sg-01d0e8429fdc77751 has a dependent object
Seems cleanup security group making some errors?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which commit are you at? This is a bug that should have been fixed, and I cannot reproduce this error in the latest commit.
And could you provide some information? Such as the provisioning log, teardown log, task.yaml you used and the security group name shown in the cloud console?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. The environment has been destroyed. I found that this change was not committed to master. So I used this repo: https://github.com/skypilot-org/skypilot/tree/f3ffaaf25744a8fe0926e89dcc0a8ee69febd690. I am not sure if it is the suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the corrupted environment, maybe you could use sky down -p <cluster_name>
to force teardown the cluster and manually cleanup resources on the cloud console; Then you could use the stable version on the current master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean stable version on master is support open port feature now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we just merged this PR. By checkout to the master's HEAD and installing from source, you could use the open port features now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's very nice. Thanks a lot for the great job that you have done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @cblmemo! Many users have requested it.
(Maybe another PR) We need to add that in our document for minimal permission for AWS/GCP, we probably should add another item about the permission for opening the ports. https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/gcp.html#user
I just tried to launch juypter lab on VMs on AWS, GCP, and also TPU node / TPU VM. All works well.
Here is my testing yaml, I guess we can have a jupyter example in the future.
resources:
# accelerators: tpu-v2-8
# accelerator_args:
# runtime_version: 2.12.1
# use_spot: true
ports:
- 29324
run: |
pip install jupyter
myip=`curl ifconfig.me`
echo $myip
jupyter lab --port 29324 --no-browser --ip=0.0.0.0
@@ -0,0 +1,7 @@ | |||
resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is nice. It would be great to have an end-to-end example for jupyter notebook in a next PR. : )
|
||
workdir: ./examples/http_server_with_custom_ports | ||
|
||
run: python3 server.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to provide an env var in the run and setup section that the user can do something like the following:
run: |
# python serve.py --port $SKYPILOT_PORT
jupyter notebook --port $SKYPILOT_PORT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected behavior when multiple ports are specified? Undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm. I guess I could reference what SKYPILOT_NODE_IPS
do: multiple lines and each line of them represent one item in our task.yaml
Co-authored-by: Zhanghao Wu <[email protected]>
* successfully launched on GCP * update image with rsync installed && fix name error in conda command * support dockerimage on task yaml && reformat code * recoup sudo in setup commands * reformat code * support direct ssh to docker && fix mesg warning && reformat * fix gcp port issue * aws successfully launched * fix port range * successfullt launched on Azure * remove unused exception class * reformat setup command to azure.py in plain text for readbility * fix job queue cannot cancel * reformat code * switch to port 22 for docker ssh * move docker image to resources: image_id: docker:<image> and change to an optional function * minor fix for ray yaml j2 * fix error when image_id is a dict * support muli-node * mode docker user setup after handle is created * support images without rsync * add aws && azure support * remove redundant pip3 install * remove error merging * move docker image to resources * minor fixes * format * adjust extrack docker image * change back to port 22 for host and port 10022 for docker, passed stop-start recovery test * add ulimit and gcp 10022 enable outside oslogin * fix gcp suthentication & move ssh authorized keys setup to run_init * fix len(image_id) when image_id is None * use docker stop & start to recover * temporary remove conflict * add back docker user * fix wrong username in add job * use ssh jump server to access docker * remove inbloud rules of 10022 * update comment * now ssh into docker * ux: raise rather than assert * move some setup commands to SkyDockerCommandRunner * monir fix * format * fix multinode ssh config * Update sky/backends/backend_utils.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/backends/backend_utils.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/azure.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/skylet/providers/gcp/config.py Co-authored-by: Zhanghao Wu <[email protected]> * minor fixes * format * move get_docker_user to backend_utils.py * move two constants to skylet and move resources vars to make deployment vars * move SkyDockerCommandRunner to skylet/providers * format * support proxy command with docker * Update sky/backends/backend_utils.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/aws.py Co-authored-by: Zhanghao Wu <[email protected]> * add comment * quote docker ssh proxy command * explicit checking for Optional object * fix credentials * remove -m in bash script * fix job queue owner * fix job owner and username * add job queue smoke test * add test_docker_preinstalled_package * fix restart error * nit: code style for proxy command * update CloudVmRayResourceHandle version * disable unattended-upgrade with cloud-init * fix UnboundLocalError * fix variable shadow * move checking for targetTags to #2210 * rename * restore some deprecated changes * format * disable tpu with docker * fix acc inexisting problem * add progress bar to docker image pulling * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * Apply suggestions from code review * Try cloud init without base64 encode Co-authored-by: Zhanghao Wu <[email protected]> * Revert "Try cloud init without base64 encode" This reverts commit 418c912. * add comment in cmd runner * add failover for clous that not support docker yet * temporary remove check of docker image * add docker to resources.get_required_cloud_features * format * install pip in run_init * format * disable ssh control when docker is used * add docker user to resource handle's repr * stash some changes for easier merge * minor * add back previously stashed function * disable docker with proxy for now * change constants.DEFAULT_DOCKER_PORT to int * rename NATIVE_DOCKER_SUPPORT to DOCKER_IMAGE * add todo for only support debian-based images * add check for proxy command * fix docker_config={} * upd docker test * upd docker test * move proxy command check to cloud.check_features_are_supported --------- Co-authored-by: Zhanghao Wu <[email protected]>
This PR resolves #1255 on AWS & GCP.
Interface:
Tested (run the relevant ones):
bash format.sh
sky launch --cloud gcp
;sky down <cluster-name>
and check firewall rules on the GCP consolesky launch --cloud aws
;sky down <cluster-name>
and check firewall rules on the AWS consolehttp://<head_ip>:33828
in a web browserpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_gcp_http_server_with_custom_ports
pytest tests/test_smoke.py::test_aws_http_server_with_custom_ports
bash tests/backward_comaptibility_tests.sh