Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing & Provisioning Retries at the granularity of regions/zones #975

Merged
merged 108 commits into from
Dec 29, 2022

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Jul 15, 2022

This PR closes #930

For on-demand instances, the optimizer selects the cloud region (e.g., AWS us-west-1) to run the task on. And a provisioning request is made for that region. If it fails, Sky re-runs the optimizer and retries for the next cheapest cloud region (e.g., it can be GCP us-east1).

For spot instances, the optimizer additionally decides the availability zone (e.g., AWS us-west-1a). A provisioning request is made for that zone. If it fails, Sky re-runs the optimizer and retries for other zones (possibly in other clouds and regions).

This PR also fixes many bugs when a region is specified with sky launch. For example,

  • sky launch "" -c test --gpus V100 --cloud aws --region us-west-1: us-west-1 does not have p3 instances. However, Sky does not return errors before launching.
  • sky launch "" -c test --gpus K80 --region southcentralus: The price of Standard_NC6_Promo in the southcentralus region is $ 0.22 / hour, but the optimizer says it is $ 0.18/hour, which is the price of the instance in other regions.

Tests:

  • Smoke tests
  • sky launch "" --use-spot --cloud aws --gpus A100:8
  • sky tpunode --region europe-west4 (NOTE: this command fails because our GCP catalog does not have any N1 machines for the regions outside the US.)
  • sky tpunode --region europe-west4 --tpu-vm (NOTE: this command succeeds because TPU VMs do not require N1 machines.)

@WoosukKwon WoosukKwon requested a review from Michaelvll July 15, 2022 08:10
@Michaelvll
Copy link
Collaborator

Thanks for the fantastic and quick work! Just scanned the PR and have several questions:

  1. What are the bugs fixed in this PR? It would be nice to have a record here in the description.
  2. It would be nice to list some tests in the description.

I will take a deeper look at the modifications soon. : )

Future TODO: We may want to consider adding the sky storages with object store to the input automatically so that the optimizer can take the data movement of the file mounts into consideration.

@WoosukKwon
Copy link
Collaborator Author

Future TODO: We may want to consider adding the sky storages with object store to the input automatically so that the optimizer can take the data movement of the file mounts into consideration.

@Michaelvll That's a great idea. Especially in the copy mode, Sky can do that since it knows the amount of data in the bucket. But the problem is 1) I'm not sure if Sky can accurately estimate the egress cost (because I think the formula is pretty complex), 2) Sky should also consider the VM idle time caused by the data egress, which is hard to estimate.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @WoosukKwon! Did a quick pass, except for optimizer.py.


A high-level question:

» sky cpunode
I 08-10 19:57:09 optimizer.py:602] == Optimizer ==
I 08-10 19:57:09 optimizer.py:614] Target: minimizing cost
I 08-10 19:57:09 optimizer.py:625] Estimated cost: $0.4 / hour
I 08-10 19:57:09 optimizer.py:625]
I 08-10 19:57:09 optimizer.py:677] Considered resources (1 node):
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------
I 08-10 19:57:09 optimizer.py:705]  CLOUD   INSTANCE         ACCELERATORS   REGION (ZONE)    COST ($)   CHOSEN
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------
I 08-10 19:57:09 optimizer.py:705]  AWS     m6i.2xlarge      -              us-east-1        0.38          ✔
I 08-10 19:57:09 optimizer.py:705]  AWS     m6i.2xlarge      -              us-east-2        0.38
I 08-10 19:57:09 optimizer.py:705]  AWS     m6i.2xlarge      -              us-west-2        0.38
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              eastus           0.38
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              eastus2          0.38
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              northcentralus   0.38
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              westus2          0.38
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              centralus        0.43
I 08-10 19:57:09 optimizer.py:705]  AWS     m6i.2xlarge      -              us-west-1        0.45
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              westus           0.45
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              southcentralus   0.46
I 08-10 19:57:09 optimizer.py:705]  Azure   Standard_D8_v4   -              westcentralus    0.46
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-west1         0.47
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-central1      0.47
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-east1         0.47
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-east4         0.47
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-west2         0.47
I 08-10 19:57:09 optimizer.py:705]  GCP     n1-highmem-8     -              us-west4         0.47
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------

» sky gpunode --gpus V100                                                       1 ↵
I 08-10 19:57:34 optimizer.py:602] == Optimizer ==
I 08-10 19:57:34 optimizer.py:614] Target: minimizing cost
I 08-10 19:57:34 optimizer.py:625] Estimated cost: $3.0 / hour
I 08-10 19:57:34 optimizer.py:625]
I 08-10 19:57:34 optimizer.py:677] Considered resources (1 node):
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------
I 08-10 19:57:34 optimizer.py:705]  CLOUD   INSTANCE           ACCELERATORS   REGION (ZONE)    COST ($)   CHOSEN
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------
I 08-10 19:57:34 optimizer.py:705]  GCP     n1-highmem-8       V100:1         us-central1      2.95          ✔
I 08-10 19:57:34 optimizer.py:705]  GCP     n1-highmem-8       V100:1         us-west1         2.95
I 08-10 19:57:34 optimizer.py:705]  GCP     n1-highmem-8       V100:1         us-east1         2.95
I 08-10 19:57:34 optimizer.py:705]  AWS     p3.2xlarge         V100:1         us-east-1        3.06
I 08-10 19:57:34 optimizer.py:705]  AWS     p3.2xlarge         V100:1         us-east-2        3.06
I 08-10 19:57:34 optimizer.py:705]  AWS     p3.2xlarge         V100:1         us-west-2        3.06
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         eastus           3.06
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         eastus2          3.06
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         westus2          3.06
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         southcentralus   3.37
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         centralus        3.46
I 08-10 19:57:34 optimizer.py:705]  Azure   Standard_NC6s_v3   V100:1         westus           3.98
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------

Feels much longer than before. Do you think it makes sense to still keep 1 row per cloud (not showing REGION (ZONE)) but make the COST ($) column correctly pick the cheapest among valid region-zones per cloud?

sky/clouds/cloud.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
sky/resources.py Outdated
return False

if self.zone is not None and self.zone != other.zone:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something like L453?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seems like it was added by #1014?

sky/resources.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/__init__.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/aws_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/azure_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@WoosukKwon WoosukKwon marked this pull request as draft November 21, 2022 22:57
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/resources.py Outdated Show resolved Hide resolved
@WoosukKwon
Copy link
Collaborator Author

@concretevitamin @Michaelvll
I have a dumb question on one of our CLI tests: what is the right behavior for sky launch --cloud aws --instance-type p3.2xlarge --gpus V100:0.5 ?

  1. Raise an error
  2. Current master branch:
I 12-16 21:00:22 optimizer.py:606] == Optimizer ==
I 12-16 21:00:22 optimizer.py:629] Estimated cost: $3.1 / hour
I 12-16 21:00:22 optimizer.py:629] 
I 12-16 21:00:22 optimizer.py:685] Considered resources (1 node):
I 12-16 21:00:22 optimizer.py:714] -----------------------------------------------------------------
I 12-16 21:00:22 optimizer.py:714]  CLOUD   INSTANCE     vCPUs   ACCELERATORS   COST ($)   CHOSEN   
I 12-16 21:00:22 optimizer.py:714] -----------------------------------------------------------------
I 12-16 21:00:22 optimizer.py:714]  AWS     p3.2xlarge   8       V100:0.5       3.06          ✔     
I 12-16 21:00:22 optimizer.py:714] -----------------------------------------------------------------
  1. Promote V100:0.5 to V100:1 and launch the VM

@concretevitamin
Copy link
Member

Hmm, it's a tricky one. The intention to me reads like "I need a p3.2xlarge node, run this task on it with 0.5 V100 as requirements". That said, it's arguable this should be banned. What do you think about implementing this intention ("promote" 0.5 to 1 only in that optimizer display; for the actual task requirements, still use 0.5 for scheduling)?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort @WoosukKwon! LGTM. I think it is ready to ship.

A kind bump for the previous comment: #975 (review)

what is the right behavior for sky launch --cloud aws --instance-type p3.2xlarge --gpus V100:0.5 ?

For this, I agree with @concretevitamin that we may want to show V100:1 in the optimizer's output but still schedule with the V100:0.5.
Btw, with the current branch, I got ValueError: Accelerator "V100" is not available in "eu-west-2". and the error is non-deterministic (changed to ValueError: Accelerator "V100" is not available in "us-east-2". in the second run).

sky/clouds/gcp.py Show resolved Hide resolved
Comment on lines 1154 to 1157
if to_provision.zone is None:
message = ('Failed to acquire resources in all zones in '
f'{to_provision.region}. Try changing resource '
'requirements or use another region.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered whether the optimizer will always fill in the region for the to_provision for the public clouds. If it is true, I think it should be fine to keep the current way (I could not think of an exception).
nit: It would be nice if we could comment it somewhere saying that the optimizer will always fill region for the public clouds.

@WoosukKwon
Copy link
Collaborator Author

Btw, with the current branch, I got ValueError: Accelerator "V100" is not available in "eu-west-2". and the error is non-deterministic (changed to ValueError: Accelerator "V100" is not available in "us-east-2". in the second run).

Thanks @Michaelvll for pointing this out. The non-deterministic behavior was because we returned Set[Region] in the regions_with_offerings method. When the price is the same for two regions, the optimizer's tie-breaking mechanism respects the order of the regions in the set, which can be non-deterministic. Thus, I changed the return type to List[Region] and made it preserve the order determined by common.py#get_region_zones(), which uses price -> region_str -> zone_str for the sorting key.

@WoosukKwon
Copy link
Collaborator Author

Hmm, it's a tricky one. The intention to me reads like "I need a p3.2xlarge node, run this task on it with 0.5 V100 as requirements". That said, it's arguable this should be banned. What do you think about implementing this intention ("promote" 0.5 to 1 only in that optimizer display; for the actual task requirements, still use 0.5 for scheduling)?

Thanks @concretevitamin for the suggestion. I think we need more discussions on this though; I personally lean toward forbidding such cases. For this PR, I followed the current master's behavior. Let's address the issue in another PR.

sky/resources.py Show resolved Hide resolved
@Michaelvll Michaelvll mentioned this pull request Dec 26, 2022
4 tasks
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great effort for the fine-grained optimizer @WoosukKwon! Left two nits. It should be good to go. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/clouds/gcp.py Show resolved Hide resolved
@WoosukKwon WoosukKwon merged commit 63d057b into master Dec 29, 2022
@WoosukKwon WoosukKwon deleted the fine-grained-optimizer branch January 9, 2023 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fine-grained optimizer
3 participants