-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Fix --disk-size
for Custom Machine Images
#2718
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing to SkyPilot! This is exciting ; ) Left a suggestion 🙂
Just want to make sure, from your PR description it seems like the resize will happen only after a restart? Is that true? If this is the case, shall we add some CLI hints to notify the user?
…mages only, Address Formatting Issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix! Left several comments 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Left several comments 👀
Every remaining issue we talked about in the meeting should have been resolved. Please let me know if any additional clarification is needed. |
Could you provide all the tests you have done? I don't see any TPUVM test in your PR description 🤔 |
Updated the PR descriptions to include additional tests for |
If this is the case, what if we specify a custom (machine) image with a size that is more than 100GB to a TPU VM? Will it directly fail? If so, we should remove related code in the |
Thanks for the prompt reply! No, it will not fail directly. This issue isn't specifically tied to the custom image itself or #2488. Even when launching a TPU VM with Here is the output of the
|
Then at least we should remove related code in the |
Yes, that's a good point. Regarding the issue tracking, I've already filed it under issue #2387. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done..! Left several nits
Also, it would be great to run the smoke test to make sure the PR doesn't break anything ; )
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Thank you very much for your contributions to this PR! Let me know if there are any more changes needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one nit! It should be good to go ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work @jackyk02! Left a comment. ; )
sky/resources.py
Outdated
@@ -834,13 +834,15 @@ def _try_validate_image_id(self) -> None: | |||
# Check the image exists and get the image size. | |||
# It will raise ValueError if the image does not exist. | |||
image_size = self.cloud.get_image_size(image_id, region) | |||
if image_size > self.disk_size: | |||
if image_size >= self.disk_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=
seems unexpected. Do you mean you cannot create a machine with the same size as the image? I think we did try creating image with 256GB and launch a VM from the image with the same disk size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be able to create instances with the same size as the image in the latest commit. I’ve updated the PR description to include additional tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @jackyk02! Just tried this PR:
sky launch -c test-machine-image --cloud gcp "echo hi > hello.txt"
- create a machine image for test-machine-image on the dashboard
sky launch -c test-use-machine-image --cloud gcp --image-id projects/xxxx/global/machineIma ges/test-machine-image-resize --disk-size 500
It works well! I am going to merge this PR soon.
@@ -18,7 +18,7 @@ docker: | |||
{%- endif %} | |||
|
|||
provider: | |||
# We use a custom node provider for GCP to support instance stop and reuse. | |||
# We use a custom node provider for GCP to create, stop and reuse instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Issue Addressed: #2488
As reported in the issue, it's clear that when launching a GCP cluster with a machine image originating from a 150 GB disk, specifying the
--disk-size 200
parameter results in an instance retaining the original 150 GB size.Based on GCP's documentation on resizing persistent disks, when we create a custom Linux image, we have to manually increase the size of the boot and non-boot disks. To achieve this, I used the Compute Engine API to modify disk sizes within our custom GCP node provider. Comments have been added to the code within the commits to provide additional context into the changes made. Please feel free to comment if further modifications are required.
Custom Image Test:
Launch the custom image with a specified disk size:
sky launch -c custom-image --cloud gcp --image-id [image id path] --disk-size 200
sky launch -c custom-image --cloud gcp --image-id [image id path] --disk-size 150
Confirmed the disk size has been increased to 200GB from the original 150GB for the first test, and ensured that users are able to create an instance with the image's original size.
Multi-Node Test:
sky launch -c my-cluster --num-nodes 8 --image-id [image id path] --cloud gcp --disk-size 200
TPU Test:
Launch the custom image with a specified TPU accelerator and disk size:
sky launch tpu_node.yaml -c custom-image --cloud gcp --image-id [image id path] --disk-size 200
, where the YAML file is provided below:Verified that the disk has been resized to 200GB for TPU node.
Note: TPU VMs include a 100GB boot disk that cannot be resized. Users have to add a Persistent Disk to expand their local disk capacity according to the documentation. This requirement may be associated with a feature request at: #2387