Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Fix --disk-size for Custom Machine Images #2718

Merged
merged 27 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
91fb6b6
initial commit for gcp resizing disk
jackyk02 Oct 18, 2023
3162784
disk resizing for new instances
jackyk02 Oct 19, 2023
2a7474d
Merge branch 'skypilot-org:master' into master
jackyk02 Oct 25, 2023
92053b7
Change from compute_v1 to googleapiclient, Disk Resizing for custom i…
jackyk02 Oct 26, 2023
5b845a9
Address Formatting Issues
jackyk02 Oct 26, 2023
0916d2c
Address Formatting Issues
jackyk02 Oct 26, 2023
432fbfc
update gcp-ray.yml comments
jackyk02 Oct 26, 2023
9c2771f
address reformatting
jackyk02 Oct 26, 2023
b95f717
Create documentation for custom image cloud permissions
jackyk02 Oct 26, 2023
8d75181
[GCP] Minimal permissions for custom image
jackyk02 Oct 26, 2023
6358322
Check exisiting disk size & Move implementation to GCPCompute
jackyk02 Oct 28, 2023
ac79d11
Fix Format
jackyk02 Oct 28, 2023
91234b8
Removed Unnecessary Imports & Return type
jackyk02 Oct 28, 2023
f480eda
remove unnecessary check and resizing for restart
jackyk02 Nov 6, 2023
30059e0
add disk resizing for TPUVMs
jackyk02 Nov 6, 2023
a3e3311
Add TODO for TPUVM resize
jackyk02 Nov 7, 2023
caf9846
Update reference to github issue for TPUVM
jackyk02 Nov 7, 2023
f0c0699
Return None for TPU Resize
jackyk02 Nov 7, 2023
df4adff
shorten comments for TPU function
jackyk02 Nov 7, 2023
f409bb9
Remove redundant check
jackyk02 Nov 7, 2023
ae554f7
Updated try_validate_image_id to raise error when users specify the s…
jackyk02 Nov 7, 2023
3eb0778
update try_validate_image_id
jackyk02 Nov 7, 2023
b639adb
update resources.py
jackyk02 Nov 7, 2023
a193406
Fix formatting for resources.py
jackyk02 Nov 7, 2023
c3f4fe9
Update resources.py
jackyk02 Nov 8, 2023
9cce23f
Update resources.py
jackyk02 Nov 8, 2023
b73f9a9
Allowing users to create instances with the same size as the image
jackyk02 Nov 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,17 +134,26 @@ User
compute.firewalls.list
compute.firewalls.update

8. Click **Create** to create the role.
9. Go back to the "IAM" tab and click on **GRANT ACCESS**.
10. Fill in the email address of the user in the “Add principals” section, and select ``minimal-skypilot-role`` in the “Assign roles” section. Click **Save**.
8. **Optional**: If the user needs to use custom machine images with ``sky launch --image-id``, you can additionally add the following permissions:
jackyk02 marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: text

compute.disks.get
jackyk02 marked this conversation as resolved.
Show resolved Hide resolved
compute.disks.resize
compute.images.get
compute.images.useReadOnly

9. Click **Create** to create the role.
10. Go back to the "IAM" tab and click on **GRANT ACCESS**.
11. Fill in the email address of the user in the “Add principals” section, and select ``minimal-skypilot-role`` in the “Assign roles” section. Click **Save**.


.. image:: ../../images/screenshots/gcp/create-iam.png
:width: 80%
:align: center
:alt: GCP Grant Access

11. The user should receive an invitation to the project and should be able to setup SkyPilot by following the instructions in :ref:`Installation <installation-gcp>`.
12. The user should receive an invitation to the project and should be able to setup SkyPilot by following the instructions in :ref:`Installation <installation-gcp>`.

.. note::

Expand Down
68 changes: 66 additions & 2 deletions sky/skylet/providers/gcp/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,16 @@ def create_instance(
"""
return

@abc.abstractmethod
def resize_disk(
self, base_config: dict, instance_name: str, wait_for_operation: bool = True
) -> dict:
"""Resize a Google Cloud disk based on the provided configuration.

Returns the response of resize operation.
"""
return
jackyk02 marked this conversation as resolved.
Show resolved Hide resolved

def create_instances(
self,
base_config: dict,
Expand Down Expand Up @@ -518,7 +528,6 @@ def _convert_resources_to_urls(
def create_instance(
self, base_config: dict, labels: dict, wait_for_operation: bool = True
) -> Tuple[dict, str]:

config = self._convert_resources_to_urls(base_config)
# removing TPU-specific default key set in config.py
config.pop("networkConfig", None)
Expand Down Expand Up @@ -621,6 +630,53 @@ def delete_instance(self, node_id: str, wait_for_operation: bool = True) -> dict

return result

def resize_disk(
self, base_config: dict, instance_name: str, wait_for_operation: bool = True
) -> dict:
"""Resize a Google Cloud disk based on the provided configuration."""

# Extract the specified disk size from the configuration
new_size_gb = base_config["disks"][0]["initializeParams"]["diskSizeGb"]

# Fetch the instance details to get the disk name and current disk size
response = (
self.resource.instances()
.get(
project=self.project_id,
zone=self.availability_zone,
instance=instance_name,
)
.execute()
)
disk_name = response["disks"][0]["source"].split("/")[-1]

try:
# Execute the resize request and return the response
operation = (
self.resource.disks()
.resize(
project=self.project_id,
zone=self.availability_zone,
disk=disk_name,
body={
"sizeGb": str(new_size_gb),
},
)
.execute()
)
except HttpError as e:
# Catch HttpError when provided with invalid value for new disk size.
# Allowing users to create instances with the same size as the image
logger.warning(f"googleapiclient.errors.HttpError: {e.reason}")
return {}

if wait_for_operation:
result = self.wait_for_operation(operation)
else:
result = operation

return result


class GCPTPU(GCPResource):
"""Abstraction around GCP TPU resource"""
Expand Down Expand Up @@ -698,7 +754,6 @@ def _list_instances(
label_filters[TAG_RAY_CLUSTER_NAME] = self.cluster_name

def filter_instance(instance: GCPTPUNode) -> bool:

labels = instance.get_labels()
if label_filters:
for key, value in label_filters.items():
Expand Down Expand Up @@ -839,3 +894,12 @@ def delete_instance(self, node_id: str, wait_for_operation: bool = True) -> dict
result = operation

return result

def resize_disk(
self, base_config: dict, instance_name: str, wait_for_operation: bool = True
) -> dict:
"""
TODO: Implement the feature to attach persistent disks for TPU VMs.
The boot disk of TPU VMs is not resizable, and users need to add a
persistent disk to expand disk capacity. Related issue: #2387
"""
3 changes: 3 additions & 0 deletions sky/skylet/providers/gcp/node_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,9 @@ def get_order_key(node):
count -= len(reuse_node_ids)
if count:
results = resource.create_instances(base_config, labels, count)
if "sourceMachineImage" in base_config:
for _, instance_id in results:
jackyk02 marked this conversation as resolved.
Show resolved Hide resolved
resource.resize_disk(base_config, instance_id)
result_dict.update(
{instance_id: result for result, instance_id in results}
)
Expand Down
2 changes: 1 addition & 1 deletion sky/templates/gcp-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ docker:
{%- endif %}

provider:
# We use a custom node provider for GCP to support instance stop and reuse.
# We use a custom node provider for GCP to create, stop and reuse instances.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

type: external # type: gcp
module: sky.skylet.providers.gcp.GCPNodeProvider
region: {{region}}
Expand Down