[Provisioner] Support multi level performance disk #1812

cblmemo · 2023-03-25T17:45:35Z

This PR introduces multi-level performance disks in all clouds. By specifying disk_tier in sky.Resources to one of high, medium, and low, a user could use a disk with custom performance. All clouds' current decisions and prices (256 GB for example) are shown below.

disk_tier=low

Cloud	Disk Type	Benchmarked Read Throughput	Benchmarked Read IOPS	Benchmarked Write Throughput	Benchmarked Write IOPS	Price (GiB/mo)	Price (IOPS/mo)	Total Price Each Month
GCP	pd-standard	13.87 MB/s	211.63	37.17 MB/s	567.13	0.048	-	12.288
Azure	Standard_LRS	861.92 MB/s	13151.92	36.67 MB/s	559.54	-	-	11.328
AWS	standard	20.33 MB/s	310.14	282.19 MB/s	4305.91	0.066	-	16.896

disk_tier=medium

Cloud	Disk Type	Benchmarked Read Throughput	Benchmarked Read IOPS	Benchmarked Write Throughput	Benchmarked Write IOPS	Price (GiB/mo)	Price (IOPS/mo)	Total Price Each Month
GCP	pd-balanced	222.86 MB/s	3400.58	222.25 MB/s	3391.25	0.12	-	30.72
Azure	Premium_LRS	211.80 MB/s	3231.88	176.12 MB/s	2687.44	-	-	36.29
AWS	gp3 with IOPS=3500	241.43 MB/s	3683.87	199.15 MB/s	3038.86	0.09	0.005	25.54

disk_tier=high

Cloud	Disk Type	Benchmarked Read Throughput	Benchmarked Read IOPS	Benchmarked Write Throughput	Benchmarked Write IOPS	Price (GiB/mo)	Price (IOPS/mo)	Total Price Each Month
GCP	pd-ssd	342.91 MB/s	5232.42	268.17 MB/s	4091.91	0.203	-	51.968
Azure	-	-	-	-	-	-	-	-
AWS	gp3 with IOPS=7000	323.54 MB/s	4936.80	254.82 MB/s	3888.22	0.09	0.005	43.04

Azure disk_tier=high is disabled now since a high-performance disk in Azure cannot be launched as an OS disk.

Benchmark command:

fio --name=sync_rand_64k_{read, write}s  --rw=rand{read, write}  --direct=1 --ioengine=sync --bs=64k --numjobs=4 --iodepth=128 --size=1G --group_reporting --directory=/tmp/

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll · 2023-03-25T19:31:29Z

Awesome! Thanks for submitting the PR. Will get to it soon. Could you also add the performance (claimed by cloud, and benchmarked result) for each disk type to the table in your PR description, so we can have an idea of how close each tier would be? Thanks!

Michaelvll

Awesome @cblmemo! Thanks for adding this in a short time! Left some initial comments. : )

sky/backends/backend_utils.py

sky/resources.py

sky/skylet/providers/azure/azure-vm-template.json

sky/templates/aws-ray.yml.j2

sky/templates/gcp-ray.yml.j2

cblmemo · 2023-03-27T01:44:35Z

Awesome! Thanks for submitting the PR. Will get to it soon. Could you also add the performance (claimed by cloud, and benchmarked result) for each disk type to the table in your PR description, so we can have an idea of how close each tier would be? Thanks!

🫡 will upload results asap

Michaelvll

Awesome! Thanks for fixing the PR @cblmemo! It looks pretty close to being ready. Left several comments mostly for better code style.
Several general comments:

If we try to launch a VM with disk_type: high, will it successfully failover through all the clouds, instead of directly failing when we meet Azure or Lambda? We can actually try it by sky launch --gpus A100:8 to see if it will failover through the clouds.
Can we add a CLI as well? We should have a sky launch --disk-type option.

sky/clouds/aws.py

sky/backends/backend_utils.py

sky/clouds/cloud.py

sky/clouds/azure.py

sky/resources.py

sky/templates/aws-ray.yml.j2

sky/clouds/cloud.py

sky/resources.py

concretevitamin · 2023-04-09T16:24:00Z

Skimming the description, this looks great! One quick thing I noticed is disk_type=low maps to very different performance on the three clouds. Is this desirable or by design?

Asking because previous users have found that the default GCP disk is too slow. If we default to medium, this may be fine.

Michaelvll · 2023-04-09T20:51:22Z

Skimming the description, this looks great! One quick thing I noticed is disk_type=low maps to very different performance on the three clouds. Is this desirable or by design?

Please correct me if I am wrong @cblmemo. That is the best compromisation we can get to match the performance among all three clouds, as the worst disks on those clouds have very different performances.
It should be fine, as we just guarantee the lower bound of the disk performance.

Asking because previous users have found that the default GCP disk is too slow. If we default to medium, this may be fine.

Yep, good idea. Defaulting to medium is a good choice.

Michaelvll

Thanks for the quick fix @cblmemo! The latest code looks good to me. I am testing it for now.

Several questions:

should we update sky/utils/schemas.py to ensure the resources field can take the disk_tier argument?
If we have the current PR merged, will [Optimizer] Add GCP disk price to the optimizer #1708, [Optimizer] Add Azure disk price to the optimizer #1744 need to be updated?

docs/source/reference/yaml-spec.rst

sky/clouds/aws.py

sky/clouds/cloud.py

sky/cli.py

cblmemo · 2023-04-10T02:40:54Z

Thanks for the quick fix @cblmemo! The latest code looks good to me. I am testing it for now.

Several questions:

should we update sky/utils/schemas.py to ensure the resources field can take the disk_tier argument?

If we have the current PR merged, will [Optimizer] Add GCP disk price to the optimizer #1708, [Optimizer] Add Azure disk price to the optimizer #1744 need to be updated?

For the second question, yes, we need to add multi-tier prices into the disk price calculation. The previous implementation just fetches the default tier disk type's price and uses it to calculate the total price.

Michaelvll

Thanks for the quick fix! The code looks pretty good with a small nit.
A problem I met during testing:

sky launch --disk-tier medium. This will error out due to the Azure instance_type not supporting the disk-tier, which is unexpected, as we should let Azure create CPU instance. Also, it seems inconsistent that if I use sky launch directly, there will be no error. Is it safe to let Azure default to medium for CPU instances?

sky/clouds/aws.py

sky/clouds/azure.py

sky/clouds/gcp.py

Michaelvll

Thank you for adding this fantastic feature @cblmemo! This will significantly improve the performance of large model training.

Several TODOs (maybe in the future PR, please file issues for them if we decide to leave it to the future):

Identity the performance of the lambda's disk performance and enable it when the user specifies the corresponding disk_tier.
Add disk cost into the optimizer

sky/clouds/aws.py

sky/clouds/azure.py

sky/clouds/service_catalog/aws_catalog.py

sky/clouds/service_catalog/lambda_catalog.py

sky/clouds/service_catalog/gcp_catalog.py

tests/test_smoke.py

Michaelvll

Thanks for the quick fix @cblmemo! It looks good to me. Let's merge this if the master branch is merged and smoke tests are all passed.

Co-authored-by: Zhanghao Wu <[email protected]>

cblmemo · 2023-04-16T15:36:23Z

I've resolved the merge conflict and passed all smoke tests. I think it is ready to merge. 🫡

Michaelvll · 2023-04-16T21:30:14Z

Thanks for the awesome work and the quick fixes @cblmemo! This is an important feature for workloads that requires high performance disk, such as llm training.
The PR looks good to me and I also tested it with the smoke tests on GCP, manually tried several combinations of the options with the disk tier, and launched a VM on azure. They all look good! Merging the PR now.

Michaelvll linked an issue Mar 25, 2023 that may be closed by this pull request

[Feature Request] Different Types of Disk Storage #1272

Closed

Michaelvll reviewed Mar 26, 2023

View reviewed changes

Michaelvll added this to the v0.3 milestone Mar 26, 2023

cblmemo marked this pull request as ready for review April 5, 2023 03:04

Michaelvll reviewed Apr 8, 2023

View reviewed changes

Michaelvll reviewed Apr 9, 2023

View reviewed changes

Michaelvll reviewed Apr 10, 2023

View reviewed changes

sky/clouds/aws.py Outdated Show resolved Hide resolved

sky/clouds/azure.py Outdated Show resolved Hide resolved

sky/clouds/gcp.py Outdated Show resolved Hide resolved

Michaelvll self-requested a review April 14, 2023 19:55

Michaelvll modified the milestones: v0.3, llm/training Apr 14, 2023

Michaelvll approved these changes Apr 15, 2023

View reviewed changes

This was referenced Apr 15, 2023

[Provisioner] Add disk_tier support for lambda #1866

Closed

[Provisioner] Add multi-tier disk price in optimizer #1867

Closed

Michaelvll approved these changes Apr 15, 2023

View reviewed changes

cblmemo added 7 commits April 16, 2023 02:15

GCP & AWS finished

fe86508

Azure finished

cfbe585

reformat code

d945c76

fix some of problem mentioned in PR discussion

8e2dbc7

fix wrong cloud disk type check and modify default disk type behaviour

dbf491c

fix aws bug & add type notation for disk

868e681

add aws throughput & reformat azure

81e5f29

cblmemo and others added 22 commits April 16, 2023 02:17

fix typo

b6c734b

Update docs/source/reference/yaml-spec.rst

6b41a10

Co-authored-by: Zhanghao Wu <[email protected]>

Update sky/cli.py

12ddabf

Co-authored-by: Zhanghao Wu <[email protected]>

Update sky/cli.py

ee25c2c

Co-authored-by: Zhanghao Wu <[email protected]>

Update sky/cli.py

e108a19

Co-authored-by: Zhanghao Wu <[email protected]>

Update sky/cli.py

b94419b

Co-authored-by: Zhanghao Wu <[email protected]>

default to medium tier

9f964ec

remove unnecessary API

61d9ba5

reformat code

6c8d757

update resources schema

bc45106

fix None->default_tier handle code style

f4b9f47

add auto selection for instance type corresponding to disk_tier

86d563e

use default Ds series to enable disk_tier=medium in Azure

4edfd5c

use s-series in default E too

471c2c1

fix unittest on default instance type

34e23e5

add aws unittest

c590fa0

add gcp unittest

bcdb529

fix None is s series bug

ddd8c5d

add azure unittest

654d9b6

better code style

20a8c91

quick workaround for basic_a is s series

1b3165f

reimplement Azure._is_s_series

8ab4d34

cblmemo force-pushed the multi-disk-type branch from 4eeac4c to 8ab4d34 Compare April 16, 2023 02:18

fix typos

f14d07f

Michaelvll merged commit 1b89f5e into skypilot-org:master Apr 16, 2023

Michaelvll mentioned this pull request Jun 12, 2023

[OCI] Support configurable boot volume size (disk_size) and performance (disk_tier) #2067

Merged

cblmemo deleted the multi-disk-type branch July 12, 2023 17:56

cblmemo mentioned this pull request Aug 27, 2024

[Core] Disk tier ultra for AWS and GCP #3860

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Provisioner] Support multi level performance disk #1812

[Provisioner] Support multi level performance disk #1812

cblmemo commented Mar 25, 2023 •

edited

Loading

Michaelvll commented Mar 25, 2023

Michaelvll left a comment

cblmemo commented Mar 27, 2023

Michaelvll left a comment •

edited

Loading

concretevitamin commented Apr 9, 2023

Michaelvll commented Apr 9, 2023

Michaelvll left a comment •

edited

Loading

cblmemo commented Apr 10, 2023

Michaelvll left a comment

Michaelvll left a comment

Michaelvll left a comment •

edited

Loading

cblmemo commented Apr 16, 2023

Michaelvll commented Apr 16, 2023 •

edited

Loading

[Provisioner] Support multi level performance disk #1812

[Provisioner] Support multi level performance disk #1812

Conversation

cblmemo commented Mar 25, 2023 • edited Loading

Michaelvll commented Mar 25, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

cblmemo commented Mar 27, 2023

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

concretevitamin commented Apr 9, 2023

Michaelvll commented Apr 9, 2023

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

cblmemo commented Apr 10, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

cblmemo commented Apr 16, 2023

Michaelvll commented Apr 16, 2023 • edited Loading

cblmemo commented Mar 25, 2023 •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll commented Apr 16, 2023 •

edited

Loading