-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing & Provisioning Retries at the granularity of regions/zones #975
Conversation
Thanks for the fantastic and quick work! Just scanned the PR and have several questions:
I will take a deeper look at the modifications soon. : ) Future TODO: We may want to consider adding the sky storages with object store to the |
@Michaelvll That's a great idea. Especially in the copy mode, Sky can do that since it knows the amount of data in the bucket. But the problem is 1) I'm not sure if Sky can accurately estimate the egress cost (because I think the formula is pretty complex), 2) Sky should also consider the VM idle time caused by the data egress, which is hard to estimate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @WoosukKwon! Did a quick pass, except for optimizer.py.
A high-level question:
» sky cpunode
I 08-10 19:57:09 optimizer.py:602] == Optimizer ==
I 08-10 19:57:09 optimizer.py:614] Target: minimizing cost
I 08-10 19:57:09 optimizer.py:625] Estimated cost: $0.4 / hour
I 08-10 19:57:09 optimizer.py:625]
I 08-10 19:57:09 optimizer.py:677] Considered resources (1 node):
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------
I 08-10 19:57:09 optimizer.py:705] CLOUD INSTANCE ACCELERATORS REGION (ZONE) COST ($) CHOSEN
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------
I 08-10 19:57:09 optimizer.py:705] AWS m6i.2xlarge - us-east-1 0.38 ✔
I 08-10 19:57:09 optimizer.py:705] AWS m6i.2xlarge - us-east-2 0.38
I 08-10 19:57:09 optimizer.py:705] AWS m6i.2xlarge - us-west-2 0.38
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - eastus 0.38
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - eastus2 0.38
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - northcentralus 0.38
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - westus2 0.38
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - centralus 0.43
I 08-10 19:57:09 optimizer.py:705] AWS m6i.2xlarge - us-west-1 0.45
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - westus 0.45
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - southcentralus 0.46
I 08-10 19:57:09 optimizer.py:705] Azure Standard_D8_v4 - westcentralus 0.46
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-west1 0.47
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-central1 0.47
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-east1 0.47
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-east4 0.47
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-west2 0.47
I 08-10 19:57:09 optimizer.py:705] GCP n1-highmem-8 - us-west4 0.47
I 08-10 19:57:09 optimizer.py:705] ------------------------------------------------------------------------------
» sky gpunode --gpus V100 1 ↵
I 08-10 19:57:34 optimizer.py:602] == Optimizer ==
I 08-10 19:57:34 optimizer.py:614] Target: minimizing cost
I 08-10 19:57:34 optimizer.py:625] Estimated cost: $3.0 / hour
I 08-10 19:57:34 optimizer.py:625]
I 08-10 19:57:34 optimizer.py:677] Considered resources (1 node):
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------
I 08-10 19:57:34 optimizer.py:705] CLOUD INSTANCE ACCELERATORS REGION (ZONE) COST ($) CHOSEN
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------
I 08-10 19:57:34 optimizer.py:705] GCP n1-highmem-8 V100:1 us-central1 2.95 ✔
I 08-10 19:57:34 optimizer.py:705] GCP n1-highmem-8 V100:1 us-west1 2.95
I 08-10 19:57:34 optimizer.py:705] GCP n1-highmem-8 V100:1 us-east1 2.95
I 08-10 19:57:34 optimizer.py:705] AWS p3.2xlarge V100:1 us-east-1 3.06
I 08-10 19:57:34 optimizer.py:705] AWS p3.2xlarge V100:1 us-east-2 3.06
I 08-10 19:57:34 optimizer.py:705] AWS p3.2xlarge V100:1 us-west-2 3.06
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 eastus 3.06
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 eastus2 3.06
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 westus2 3.06
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 southcentralus 3.37
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 centralus 3.46
I 08-10 19:57:34 optimizer.py:705] Azure Standard_NC6s_v3 V100:1 westus 3.98
I 08-10 19:57:34 optimizer.py:705] --------------------------------------------------------------------------------
Feels much longer than before. Do you think it makes sense to still keep 1 row per cloud (not showing REGION (ZONE)
) but make the COST ($)
column correctly pick the cheapest among valid region-zones per cloud?
sky/resources.py
Outdated
return False | ||
|
||
if self.zone is not None and self.zone != other.zone: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something like L453?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seems like it was added by #1014?
@concretevitamin @Michaelvll
|
Hmm, it's a tricky one. The intention to me reads like "I need a p3.2xlarge node, run this task on it with 0.5 V100 as requirements". That said, it's arguable this should be banned. What do you think about implementing this intention ("promote" 0.5 to 1 only in that optimizer display; for the actual task requirements, still use 0.5 for scheduling)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort @WoosukKwon! LGTM. I think it is ready to ship.
A kind bump for the previous comment: #975 (review)
what is the right behavior for sky launch --cloud aws --instance-type p3.2xlarge --gpus V100:0.5 ?
For this, I agree with @concretevitamin that we may want to show V100:1
in the optimizer's output but still schedule with the V100:0.5
.
Btw, with the current branch, I got ValueError: Accelerator "V100" is not available in "eu-west-2".
and the error is non-deterministic (changed to ValueError: Accelerator "V100" is not available in "us-east-2".
in the second run).
sky/backends/cloud_vm_ray_backend.py
Outdated
if to_provision.zone is None: | ||
message = ('Failed to acquire resources in all zones in ' | ||
f'{to_provision.region}. Try changing resource ' | ||
'requirements or use another region.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered whether the optimizer will always fill in the region for the to_provision
for the public clouds. If it is true, I think it should be fine to keep the current way (I could not think of an exception).
nit: It would be nice if we could comment it somewhere saying that the optimizer will always fill region
for the public clouds.
Thanks @Michaelvll for pointing this out. The non-deterministic behavior was because we returned |
Thanks @concretevitamin for the suggestion. I think we need more discussions on this though; I personally lean toward forbidding such cases. For this PR, I followed the current master's behavior. Let's address the issue in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great effort for the fine-grained optimizer @WoosukKwon! Left two nits. It should be good to go. : )
This PR closes #930
For on-demand instances, the optimizer selects the cloud region (e.g., AWS us-west-1) to run the task on. And a provisioning request is made for that region. If it fails, Sky re-runs the optimizer and retries for the next cheapest cloud region (e.g., it can be GCP us-east1).
For spot instances, the optimizer additionally decides the availability zone (e.g., AWS us-west-1a). A provisioning request is made for that zone. If it fails, Sky re-runs the optimizer and retries for other zones (possibly in other clouds and regions).
This PR also fixes many bugs when a region is specified with
sky launch
. For example,sky launch "" -c test --gpus V100 --cloud aws --region us-west-1
: us-west-1 does not have p3 instances. However, Sky does not return errors before launching.sky launch "" -c test --gpus K80 --region southcentralus
: The price of Standard_NC6_Promo in the southcentralus region is $ 0.22 / hour, but the optimizer says it is $ 0.18/hour, which is the price of the instance in other regions.Tests:
sky launch "" --use-spot --cloud aws --gpus A100:8
sky tpunode --region europe-west4
(NOTE: this command fails because our GCP catalog does not have any N1 machines for the regions outside the US.)sky tpunode --region europe-west4 --tpu-vm
(NOTE: this command succeeds because TPU VMs do not require N1 machines.)