Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Support multi level performance disk #1812

Merged
merged 41 commits into from
Apr 16, 2023

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Mar 25, 2023

This PR introduces multi-level performance disks in all clouds. By specifying disk_tier in sky.Resources to one of high, medium, and low, a user could use a disk with custom performance. All clouds' current decisions and prices (256 GB for example) are shown below.

disk_tier=low

Cloud Disk Type Benchmarked Read Throughput Benchmarked Read IOPS Benchmarked Write Throughput Benchmarked Write IOPS Price (GiB/mo) Price (IOPS/mo) Total Price Each Month
GCP pd-standard 13.87 MB/s 211.63 37.17 MB/s 567.13 0.048 - 12.288
Azure Standard_LRS 861.92 MB/s 13151.92 36.67 MB/s 559.54 - - 11.328
AWS standard 20.33 MB/s 310.14 282.19 MB/s 4305.91 0.066 - 16.896

disk_tier=medium

Cloud Disk Type Benchmarked Read Throughput Benchmarked Read IOPS Benchmarked Write Throughput Benchmarked Write IOPS Price (GiB/mo) Price (IOPS/mo) Total Price Each Month
GCP pd-balanced 222.86 MB/s 3400.58 222.25 MB/s 3391.25 0.12 - 30.72
Azure Premium_LRS 211.80 MB/s 3231.88 176.12 MB/s 2687.44 - - 36.29
AWS gp3 with IOPS=3500 241.43 MB/s 3683.87 199.15 MB/s 3038.86 0.09 0.005 25.54

disk_tier=high

Cloud Disk Type Benchmarked Read Throughput Benchmarked Read IOPS Benchmarked Write Throughput Benchmarked Write IOPS Price (GiB/mo) Price (IOPS/mo) Total Price Each Month
GCP pd-ssd 342.91 MB/s 5232.42 268.17 MB/s 4091.91 0.203 - 51.968
Azure - - - - - - - -
AWS gp3 with IOPS=7000 323.54 MB/s 4936.80 254.82 MB/s 3888.22 0.09 0.005 43.04

Azure disk_tier=high is disabled now since a high-performance disk in Azure cannot be launched as an OS disk.

Benchmark command:

fio --name=sync_rand_64k_{read, write}s  --rw=rand{read, write}  --direct=1 --ioengine=sync --bs=64k --numjobs=4 --iodepth=128 --size=1G --group_reporting --directory=/tmp/

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll linked an issue Mar 25, 2023 that may be closed by this pull request
@Michaelvll
Copy link
Collaborator

Awesome! Thanks for submitting the PR. Will get to it soon. Could you also add the performance (claimed by cloud, and benchmarked result) for each disk type to the table in your PR description, so we can have an idea of how close each tier would be? Thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @cblmemo! Thanks for adding this in a short time! Left some initial comments. : )

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/skylet/providers/azure/azure-vm-template.json Outdated Show resolved Hide resolved
sky/skylet/providers/azure/azure-vm-template.json Outdated Show resolved Hide resolved
sky/templates/aws-ray.yml.j2 Show resolved Hide resolved
sky/templates/gcp-ray.yml.j2 Outdated Show resolved Hide resolved
@Michaelvll Michaelvll added this to the v0.3 milestone Mar 26, 2023
@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 27, 2023

Awesome! Thanks for submitting the PR. Will get to it soon. Could you also add the performance (claimed by cloud, and benchmarked result) for each disk type to the table in your PR description, so we can have an idea of how close each tier would be? Thanks!

🫡 will upload results asap

@cblmemo cblmemo marked this pull request as ready for review April 5, 2023 03:04
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for fixing the PR @cblmemo! It looks pretty close to being ready. Left several comments mostly for better code style.
Several general comments:

  1. If we try to launch a VM with disk_type: high, will it successfully failover through all the clouds, instead of directly failing when we meet Azure or Lambda? We can actually try it by sky launch --gpus A100:8 to see if it will failover through the clouds.
  2. Can we add a CLI as well? We should have a sky launch --disk-type option.

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/clouds/azure.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/templates/aws-ray.yml.j2 Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/resources.py Show resolved Hide resolved
@concretevitamin
Copy link
Member

Skimming the description, this looks great! One quick thing I noticed is disk_type=low maps to very different performance on the three clouds. Is this desirable or by design?

Asking because previous users have found that the default GCP disk is too slow. If we default to medium, this may be fine.

@Michaelvll
Copy link
Collaborator

Skimming the description, this looks great! One quick thing I noticed is disk_type=low maps to very different performance on the three clouds. Is this desirable or by design?

Please correct me if I am wrong @cblmemo. That is the best compromisation we can get to match the performance among all three clouds, as the worst disks on those clouds have very different performances.
It should be fine, as we just guarantee the lower bound of the disk performance.

Asking because previous users have found that the default GCP disk is too slow. If we default to medium, this may be fine.

Yep, good idea. Defaulting to medium is a good choice.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @cblmemo! The latest code looks good to me. I am testing it for now.

Several questions:

  1. should we update sky/utils/schemas.py to ensure the resources field can take the disk_tier argument?
  2. If we have the current PR merged, will [Optimizer] Add GCP disk price to the optimizer #1708, [Optimizer] Add Azure disk price to the optimizer #1744 need to be updated?

docs/source/reference/yaml-spec.rst Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 10, 2023

Thanks for the quick fix @cblmemo! The latest code looks good to me. I am testing it for now.

Several questions:

  1. should we update sky/utils/schemas.py to ensure the resources field can take the disk_tier argument?
  2. If we have the current PR merged, will [Optimizer] Add GCP disk price to the optimizer #1708, [Optimizer] Add Azure disk price to the optimizer #1744 need to be updated?

For the second question, yes, we need to add multi-tier prices into the disk price calculation. The previous implementation just fetches the default tier disk type's price and uses it to calculate the total price.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! The code looks pretty good with a small nit.
A problem I met during testing:

  1. sky launch --disk-tier medium. This will error out due to the Azure instance_type not supporting the disk-tier, which is unexpected, as we should let Azure create CPU instance. Also, it seems inconsistent that if I use sky launch directly, there will be no error. Is it safe to let Azure default to medium for CPU instances?

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/azure.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll self-requested a review April 14, 2023 19:55
@Michaelvll Michaelvll modified the milestones: v0.3, llm/training Apr 14, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this fantastic feature @cblmemo! This will significantly improve the performance of large model training.

Several TODOs (maybe in the future PR, please file issues for them if we decide to leave it to the future):

  • Identity the performance of the lambda's disk performance and enable it when the user specifies the corresponding disk_tier.
  • Add disk cost into the optimizer

sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/aws.py Outdated Show resolved Hide resolved
sky/clouds/azure.py Outdated Show resolved Hide resolved
sky/clouds/azure.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/aws_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/lambda_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @cblmemo! It looks good to me. Let's merge this if the master branch is merged and smoke tests are all passed.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Apr 16, 2023

I've resolved the merge conflict and passed all smoke tests. I think it is ready to merge. 🫡

@Michaelvll
Copy link
Collaborator

Michaelvll commented Apr 16, 2023

Thanks for the awesome work and the quick fixes @cblmemo! This is an important feature for workloads that requires high performance disk, such as llm training.
The PR looks good to me and I also tested it with the smoke tests on GCP, manually tried several combinations of the options with the disk tier, and launched a VM on azure. They all look good! Merging the PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Different Types of Disk Storage
3 participants