-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Provisioner] Support multi level performance disk #1812
Conversation
Awesome! Thanks for submitting the PR. Will get to it soon. Could you also add the performance (claimed by cloud, and benchmarked result) for each disk type to the table in your PR description, so we can have an idea of how close each tier would be? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @cblmemo! Thanks for adding this in a short time! Left some initial comments. : )
🫡 will upload results asap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for fixing the PR @cblmemo! It looks pretty close to being ready. Left several comments mostly for better code style.
Several general comments:
- If we try to launch a VM with
disk_type: high
, will it successfully failover through all the clouds, instead of directly failing when we meet Azure or Lambda? We can actually try it bysky launch --gpus A100:8
to see if it will failover through the clouds. - Can we add a CLI as well? We should have a
sky launch --disk-type
option.
Skimming the description, this looks great! One quick thing I noticed is Asking because previous users have found that the default GCP disk is too slow. If we default to |
Please correct me if I am wrong @cblmemo. That is the best compromisation we can get to match the performance among all three clouds, as the worst disks on those clouds have very different performances.
Yep, good idea. Defaulting to medium is a good choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @cblmemo! The latest code looks good to me. I am testing it for now.
Several questions:
- should we update
sky/utils/schemas.py
to ensure theresources
field can take thedisk_tier
argument? - If we have the current PR merged, will [Optimizer] Add GCP disk price to the optimizer #1708, [Optimizer] Add Azure disk price to the optimizer #1744 need to be updated?
For the second question, yes, we need to add multi-tier prices into the disk price calculation. The previous implementation just fetches the default tier disk type's price and uses it to calculate the total price. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix! The code looks pretty good with a small nit.
A problem I met during testing:
sky launch --disk-tier medium
. This will error out due to the Azureinstance_type
not supporting the disk-tier, which is unexpected, as we should let Azure create CPU instance. Also, it seems inconsistent that if I usesky launch
directly, there will be no error. Is it safe to let Azure default to medium for CPU instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this fantastic feature @cblmemo! This will significantly improve the performance of large model training.
Several TODOs (maybe in the future PR, please file issues for them if we decide to leave it to the future):
- Identity the performance of the lambda's disk performance and enable it when the user specifies the corresponding disk_tier.
- Add disk cost into the optimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @cblmemo! It looks good to me. Let's merge this if the master branch is merged and smoke tests are all passed.
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
4eeac4c
to
8ab4d34
Compare
I've resolved the merge conflict and passed all smoke tests. I think it is ready to merge. 🫡 |
Thanks for the awesome work and the quick fixes @cblmemo! This is an important feature for workloads that requires high performance disk, such as llm training. |
This PR introduces multi-level performance disks in all clouds. By specifying
disk_tier
insky.Resources
to one ofhigh
,medium
, andlow
, a user could use a disk with custom performance. All clouds' current decisions and prices (256 GB for example) are shown below.disk_tier=low
disk_tier=medium
disk_tier=high
Azure
disk_tier=high
is disabled now since a high-performance disk in Azure cannot be launched as an OS disk.Benchmark command:
Tested (run the relevant ones):
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh