-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Add eager failover strategy #2234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this important improvement @Michaelvll -- did a pass. Probably some of the comments need discussions.
job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT, | ||
raise_on_failure=False) | ||
if job_submitted_at is None: | ||
# Failed to launch the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think about who should set self._launched_cloud_region to None reliably. In _launch()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should let the _launch()
to set the self._launched_cloud_region to None, because in that case we will not be able to control how many retries for exhausted failover without using the current region, before we start failover with the current region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.
sky/spot/recovery_strategy.py
Outdated
def recover(self) -> float: | ||
# 1. Terminate the current cluster | ||
# 2. Launch the cluster without retrying the previously launched region | ||
# 3. Launch the cluster with no cloud/region constraint or respect the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "or respect the original user specification." mean? It seems like we should respect the original requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to resources requirements. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)
The entire search space is defined by the original task request, task.resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated. Thanks!
@CodiumAI-Agent /review |
PR Analysis
PR Feedback
How to use
|
@@ -194,6 +195,9 @@ def __init__( | |||
self.estimated_outputs_size_gigabytes = None | |||
# Default to CPUNode | |||
self.resources = {sky.Resources()} | |||
# Resources that this task cannot run on. | |||
self.blocked_resources = blocked_resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment to explain the purpose of the blocked_resources
attribute in the Task
class. This will help developers understand its role and how it is used. [medium]
Added CurrentPolicy to the simulation. CurrentPolicy: Retry the current zone three times. If all preempted within a short time (e.g. changeover delay), then pick a different zone. 7 days of Spot V100 real traces in 8 US zones. Frequency: 3min. Job duration = 60 hours. Deadline: 7 days. Delay: 1.5 hours. Averaged over 100 runs
CurrentPolicy(3,1) - 3 retries, 1 = immediately preempted |
@MaoZiming Wow, this is very nice. IIUC, this is saying our master branch's policy optimizes for low cost but risks getting a lot of preemptions/wasting time? While the current PR's policy would significantly lower preemptions and possibly increase a little bit of cost. What does the second number in |
@concretevitamin I think so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll!
sky/spot/recovery_strategy.py
Outdated
In the YAML file, the user can specify the strategy to use for spot jobs. | ||
|
||
resources: | ||
spot_recovery: EAGER_FAILOVER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "EAGER_NEXT_REGION"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Fixed. Thanks!
sky/spot/recovery_strategy.py
Outdated
def recover(self) -> float: | ||
# 1. Terminate the current cluster | ||
# 2. Launch the cluster without retrying the previously launched region | ||
# 3. Launch the cluster with no cloud/region constraint or respect the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)
The entire search space is defined by the original task request, task.resources.
job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT, | ||
raise_on_failure=False) | ||
if job_submitted_at is None: | ||
# Failed to launch the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.
Tested:
|
Our user has been experience frequent preemption in the same region when running spot jobs. This is because our original recovery strategy will retry in the same region for a few times before it starts to failover to the other regions. It suffers from the case where the preempted cluster can always be relaunched in the same region, but the duration of the new instance is very short. That said, the original recovery strategy can lead to a cluster stuck in a single region, frequently preempted (not enough exploration).
This PR introduces another
EAGER_FAILOVER
strategy, which will skip the region the previous preempted cluster was in, and directly go to other regions in the first try. This trades off the locality/data egress for better availability.We now defaults to the
EAGER_FAILOVER
as it might be more desirable by our users.To use the new strategy, the user can specify the following field in their spot job:
Tested (run the relevant ones):
bash format.sh
sky spot launch -n test-aggressive test.yaml
; manually terminate the cluster on the console;sky spot logs --controller
to check the cluster go to another region.sky spot launch -n test-aggressive --cloud gcp --region us-central1 test.yaml
; manually terminate the cluster on the console;sky spot logs --controller
the cluster relaunched in the same region.pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh