[Spot] Add eager failover strategy #2234

Michaelvll · 2023-07-13T21:33:51Z

Our user has been experience frequent preemption in the same region when running spot jobs. This is because our original recovery strategy will retry in the same region for a few times before it starts to failover to the other regions. It suffers from the case where the preempted cluster can always be relaunched in the same region, but the duration of the new instance is very short. That said, the original recovery strategy can lead to a cluster stuck in a single region, frequently preempted (not enough exploration).

This PR introduces another EAGER_FAILOVER strategy, which will skip the region the previous preempted cluster was in, and directly go to other regions in the first try. This trades off the locality/data egress for better availability.

We now defaults to the EAGER_FAILOVER as it might be more desirable by our users.

To use the new strategy, the user can specify the following field in their spot job:

resources:
    use_spot: true

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky spot launch -n test-aggressive test.yaml; manually terminate the cluster on the console; sky spot logs --controller to check the cluster go to another region.
```
resources:
  cpus: 2
  use_spot: true
  spot_recovery: AGGRESSIVE_FAILOVER

run: |
  sleep 100000
```
- sky spot launch -n test-aggressive --cloud gcp --region us-central1 test.yaml; manually terminate the cluster on the console; sky spot logs --controller the cluster relaunched in the same region.
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin

Thanks for this important improvement @Michaelvll -- did a pass. Probably some of the comments need discussions.

sky/spot/recovery_strategy.py

concretevitamin · 2023-07-14T23:24:18Z

sky/spot/recovery_strategy.py

+            job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT,
+                                            raise_on_failure=False)
+            if job_submitted_at is None:
+                # Failed to launch the cluster.


We should think about who should set self._launched_cloud_region to None reliably. In _launch()?

I don't think we should let the _launch() to set the self._launched_cloud_region to None, because in that case we will not be able to control how many retries for exhausted failover without using the current region, before we start failover with the current region.

(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.

concretevitamin · 2023-07-14T23:25:16Z

sky/spot/recovery_strategy.py

+    def recover(self) -> float:
+        # 1. Terminate the current cluster
+        # 2. Launch the cluster without retrying the previously launched region
+        # 3. Launch the cluster with no cloud/region constraint or respect the


What does "or respect the original user specification." mean? It seems like we should respect the original requirements.

Changed to resources requirements. Thanks!

How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)

The entire search space is defined by the original task request, task.resources.

Good point! Updated. Thanks!

sky/spot/recovery_strategy.py

sky/task.py

sky/backends/cloud_vm_ray_backend.py

GadiZimerman · 2023-07-16T22:16:34Z

@CodiumAI-Agent /review

CodiumAI-Agent · 2023-07-16T22:17:19Z

PR Analysis

🎯 Main theme: Adding a new failover strategy for spot jobs
📌 Type of PR: Enhancement
🧪 Relevant tests added: Yes
✨ Focused PR: Yes, the PR is focused as it introduces a new failover strategy for spot jobs and all changes are related to this enhancement.
🔒 Security concerns: No, the PR does not introduce any apparent security concerns. It mainly deals with the failover strategy for spot jobs and does not involve any security-sensitive operations or data.

PR Feedback

💡 General PR suggestions: The PR is well-structured and the changes are well-documented. The new failover strategy is a good addition to handle frequent preemptions in the same region. However, it would be beneficial to add more comments in the code to explain the logic behind the new strategy and its expected behavior.

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review - Request a review of the latest update to the PR.
/describe - Modify the PR title and description based on the contents of the PR.
/improve - Suggest improvements to the code in the PR. These will be provided as pull request comments, ready to commit.
/ask - Pose a question about the PR.

CodiumAI-Agent · 2023-07-16T22:17:21Z

sky/task.py

@@ -194,6 +195,9 @@ def __init__(
        self.estimated_outputs_size_gigabytes = None
        # Default to CPUNode
        self.resources = {sky.Resources()}
+        # Resources that this task cannot run on.
+        self.blocked_resources = blocked_resources


Consider adding a comment to explain the purpose of the blocked_resources attribute in the Task class. This will help developers understand its role and how it is used. [medium]

MaoZiming · 2023-07-22T02:18:19Z

Added CurrentPolicy to the simulation.

CurrentPolicy: Retry the current zone three times. If all preempted within a short time (e.g. changeover delay), then pick a different zone.
Eager Random: Pick a random zone different from the current preempted one
Static Order: Pick a different zone following an order (e.g. by cost)
Epsilon Greedy: Exploit (pick a zone with fewer past preemptions in a window) + Explore (pick a different zone at random) with epsilon probability
Upper Confidence Bound: A time-window based multi-armed bandit. Exploit (pick a zone with more reward) + Explore (upweight zones that are seldom picked.). Reward: up time - changeover delay if up time > changeover delay else 0
Optimal: Solved with ILP

7 days of Spot V100 real traces in 8 US zones. Frequency: 3min. Job duration = 60 hours. Deadline: 7 days. Delay: 1.5 hours. Averaged over 100 runs

              strategy  job_fail_rate  avg_finish_time  total_vm_time  total_vm_cost
0   CurrentPolicy(3,1)         100.00              NaN            NaN            NaN
1  CurrentPolicy(3,30)          19.00           114.07          63.78       1,242.43
2          NaiveRandom          52.00           115.40          63.62       1,328.43
3          EagerRandom           0.00            84.38          71.75       1,402.70
4       StaticOrder(0)           0.00            83.35          71.57       1,332.24
5   EpsGreedy(200,0.2)           0.00            84.55          72.41       1,351.99
6             UCB(100)           0.00            83.15          69.60       1,292.75
7              Optimal           0.00            80.60          64.50       1,184.22

CurrentPolicy(3,1) - 3 retries, 1 = immediately preempted
CurrentPolicy(3,30) - 3 retries, preempted within changeover delay (30 * 3 mins)
StaticOrder(0) - Start from 0-th zone.
EpsGreedy(200,0.2) - Window size 200, Eps = 0.2
UCB(100) - Window Size 100

concretevitamin · 2023-07-22T03:09:15Z

@MaoZiming Wow, this is very nice. IIUC, this is saying our master branch's policy optimizes for low cost but risks getting a lot of preemptions/wasting time? While the current PR's policy would significantly lower preemptions and possibly increase a little bit of cost.

What does the second number in CurrentPolicy(3,30) stand for?

MaoZiming · 2023-07-22T03:14:22Z

@concretevitamin I think so. CurrentPolicy(num_retry, retry_time).
If the spot instance that is relaunched in the same zone gets preempted again within retry_time * 3 mins for num_retry of times, we next launch the instance in a different zone.
CurrentPolicy(3,30) means we check whether the instance is preempted within the changeover delay (cold start time, etc.) = 1.5 hours to decide whether we can pick a different zone next time. The idea is to use a more relaxed bound for checking since if the instance is preempted during cold start it is not doing any work.
CurrentPolicy(3,1) means we check whether the instance is immediately preempted (or launch unsuccessful)

concretevitamin

Thanks @Michaelvll!

concretevitamin · 2023-07-26T21:33:34Z

sky/spot/recovery_strategy.py

+In the YAML file, the user can specify the strategy to use for spot jobs.
+
+resources:
+    spot_recovery: EAGER_FAILOVER


How about "EAGER_NEXT_REGION"?

Good point! Fixed. Thanks!

concretevitamin · 2023-07-26T21:35:25Z

sky/spot/recovery_strategy.py

+    def recover(self) -> float:
+        # 1. Terminate the current cluster
+        # 2. Launch the cluster without retrying the previously launched region
+        # 3. Launch the cluster with no cloud/region constraint or respect the


How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)

The entire search space is defined by the original task request, task.resources.

sky/spot/recovery_strategy.py

concretevitamin · 2023-07-26T21:40:31Z

sky/spot/recovery_strategy.py

+            job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT,
+                                            raise_on_failure=False)
+            if job_submitted_at is None:
+                # Failed to launch the cluster.


(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.

sky/spot/recovery_strategy.py

…ggresive-failover

Michaelvll · 2023-08-03T06:44:40Z

Tested:

pytest tests/test_smoke.py --managed-spot
pytest tests/test_smoke.py --managed-spot --aws
sky spot launch -n text-next-region --cpus 2+ --cloud gcp; manually delete the spot cluster and check it failover to the next region; manually delete the spot cluster again and check it failover back to the first region.

Add aggressive failover strategy

46d6e5c

Michaelvll marked this pull request as ready for review July 13, 2023 21:47

Michaelvll added 5 commits July 13, 2023 14:48

remove uneccessary code

6f5f27c

reset blocked resources

6c92c4a

reset blocekd_resources to None

8d0bd36

optimize the case where the region is specified

e91adfb

Use the new eager failover by default

ddc5547

Michaelvll changed the title ~~[Spot] Add aggressive failover strategy~~ [Spot] Add eager failover strategy Jul 14, 2023

Michaelvll requested a review from concretevitamin July 14, 2023 17:28

concretevitamin reviewed Jul 14, 2023

View reviewed changes

CodiumAI-Agent reviewed Jul 16, 2023

View reviewed changes

Michaelvll added 2 commits July 26, 2023 00:31

address comments

096d22c

format

c687e0a

concretevitamin approved these changes Jul 26, 2023

View reviewed changes

Michaelvll added 2 commits August 2, 2023 23:12

address comment

67cf217

Merge branch 'master' of github.com:skypilot-org/skypilot into spot-a…

0372b95

…ggresive-failover

Michaelvll merged commit ca2a092 into master Aug 3, 2023

Michaelvll deleted the spot-aggresive-failover branch August 3, 2023 16:12

concretevitamin mentioned this pull request Aug 17, 2023

Minor: add some docs to recovery_strategy.py. #2419

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Add eager failover strategy #2234

[Spot] Add eager failover strategy #2234

Michaelvll commented Jul 13, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin Jul 14, 2023

Michaelvll Jul 26, 2023

concretevitamin Jul 26, 2023

concretevitamin Jul 14, 2023

Michaelvll Jul 26, 2023

concretevitamin Jul 26, 2023

Michaelvll Aug 3, 2023

GadiZimerman commented Jul 16, 2023

CodiumAI-Agent commented Jul 16, 2023

CodiumAI-Agent Jul 16, 2023

MaoZiming commented Jul 22, 2023 •

edited

Loading

concretevitamin commented Jul 22, 2023

MaoZiming commented Jul 22, 2023 •

edited

Loading

concretevitamin left a comment

concretevitamin Jul 26, 2023

Michaelvll Aug 3, 2023

concretevitamin Jul 26, 2023

concretevitamin Jul 26, 2023

Michaelvll commented Aug 3, 2023 •

edited

Loading

[Spot] Add eager failover strategy #2234

[Spot] Add eager failover strategy #2234

Conversation

Michaelvll commented Jul 13, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GadiZimerman commented Jul 16, 2023

CodiumAI-Agent commented Jul 16, 2023

PR Analysis

PR Feedback

How to use

Choose a reason for hiding this comment

MaoZiming commented Jul 22, 2023 • edited Loading

concretevitamin commented Jul 22, 2023

MaoZiming commented Jul 22, 2023 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Aug 3, 2023 • edited Loading

Michaelvll commented Jul 13, 2023 •

edited

Loading

MaoZiming commented Jul 22, 2023 •

edited

Loading

MaoZiming commented Jul 22, 2023 •

edited

Loading

Michaelvll commented Aug 3, 2023 •

edited

Loading