Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Add eager failover strategy #2234

Merged
merged 10 commits into from
Aug 3, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jul 13, 2023

Our user has been experience frequent preemption in the same region when running spot jobs. This is because our original recovery strategy will retry in the same region for a few times before it starts to failover to the other regions. It suffers from the case where the preempted cluster can always be relaunched in the same region, but the duration of the new instance is very short. That said, the original recovery strategy can lead to a cluster stuck in a single region, frequently preempted (not enough exploration).

This PR introduces another EAGER_FAILOVER strategy, which will skip the region the previous preempted cluster was in, and directly go to other regions in the first try. This trades off the locality/data egress for better availability.

We now defaults to the EAGER_FAILOVER as it might be more desirable by our users.

To use the new strategy, the user can specify the following field in their spot job:

resources:
    use_spot: true

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky spot launch -n test-aggressive test.yaml; manually terminate the cluster on the console; sky spot logs --controller to check the cluster go to another region.
    resources:
      cpus: 2
      use_spot: true
      spot_recovery: AGGRESSIVE_FAILOVER
    
    run: |
      sleep 100000
    
    • sky spot launch -n test-aggressive --cloud gcp --region us-central1 test.yaml; manually terminate the cluster on the console; sky spot logs --controller the cluster relaunched in the same region.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review July 13, 2023 21:47
@Michaelvll Michaelvll changed the title [Spot] Add aggressive failover strategy [Spot] Add eager failover strategy Jul 14, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this important improvement @Michaelvll -- did a pass. Probably some of the comments need discussions.

sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT,
raise_on_failure=False)
if job_submitted_at is None:
# Failed to launch the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about who should set self._launched_cloud_region to None reliably. In _launch()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should let the _launch() to set the self._launched_cloud_region to None, because in that case we will not be able to control how many retries for exhausted failover without using the current region, before we start failover with the current region.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.

def recover(self) -> float:
# 1. Terminate the current cluster
# 2. Launch the cluster without retrying the previously launched region
# 3. Launch the cluster with no cloud/region constraint or respect the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "or respect the original user specification." mean? It seems like we should respect the original requirements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to resources requirements. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)

The entire search space is defined by the original task request, task.resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated. Thanks!

sky/spot/recovery_strategy.py Show resolved Hide resolved
sky/task.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
@GadiZimerman
Copy link

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: Adding a new failover strategy for spot jobs
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • Focused PR: Yes, the PR is focused as it introduces a new failover strategy for spot jobs and all changes are related to this enhancement.
  • 🔒 Security concerns: No, the PR does not introduce any apparent security concerns. It mainly deals with the failover strategy for spot jobs and does not involve any security-sensitive operations or data.

PR Feedback

  • 💡 General PR suggestions: The PR is well-structured and the changes are well-documented. The new failover strategy is a good addition to handle frequent preemptions in the same region. However, it would be beneficial to add more comments in the code to explain the logic behind the new strategy and its expected behavior.

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review - Request a review of the latest update to the PR.
/describe - Modify the PR title and description based on the contents of the PR.
/improve - Suggest improvements to the code in the PR. These will be provided as pull request comments, ready to commit.
/ask - Pose a question about the PR.

@@ -194,6 +195,9 @@ def __init__(
self.estimated_outputs_size_gigabytes = None
# Default to CPUNode
self.resources = {sky.Resources()}
# Resources that this task cannot run on.
self.blocked_resources = blocked_resources

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment to explain the purpose of the blocked_resources attribute in the Task class. This will help developers understand its role and how it is used. [medium]

@MaoZiming
Copy link
Collaborator

MaoZiming commented Jul 22, 2023

Added CurrentPolicy to the simulation.

CurrentPolicy: Retry the current zone three times. If all preempted within a short time (e.g. changeover delay), then pick a different zone.
Eager Random: Pick a random zone different from the current preempted one
Static Order: Pick a different zone following an order (e.g. by cost)
Epsilon Greedy: Exploit (pick a zone with fewer past preemptions in a window) + Explore (pick a different zone at random) with epsilon probability
Upper Confidence Bound: A time-window based multi-armed bandit. Exploit (pick a zone with more reward) + Explore (upweight zones that are seldom picked.). Reward: up time - changeover delay if up time > changeover delay else 0
Optimal: Solved with ILP

7 days of Spot V100 real traces in 8 US zones. Frequency: 3min. Job duration = 60 hours. Deadline: 7 days. Delay: 1.5 hours. Averaged over 100 runs

              strategy  job_fail_rate  avg_finish_time  total_vm_time  total_vm_cost
0   CurrentPolicy(3,1)         100.00              NaN            NaN            NaN
1  CurrentPolicy(3,30)          19.00           114.07          63.78       1,242.43
2          NaiveRandom          52.00           115.40          63.62       1,328.43
3          EagerRandom           0.00            84.38          71.75       1,402.70
4       StaticOrder(0)           0.00            83.35          71.57       1,332.24
5   EpsGreedy(200,0.2)           0.00            84.55          72.41       1,351.99
6             UCB(100)           0.00            83.15          69.60       1,292.75
7              Optimal           0.00            80.60          64.50       1,184.22

CurrentPolicy(3,1) - 3 retries, 1 = immediately preempted
CurrentPolicy(3,30) - 3 retries, preempted within changeover delay (30 * 3 mins)
StaticOrder(0) - Start from 0-th zone.
EpsGreedy(200,0.2) - Window size 200, Eps = 0.2
UCB(100) - Window Size 100

@concretevitamin
Copy link
Member

@MaoZiming Wow, this is very nice. IIUC, this is saying our master branch's policy optimizes for low cost but risks getting a lot of preemptions/wasting time? While the current PR's policy would significantly lower preemptions and possibly increase a little bit of cost.

What does the second number in CurrentPolicy(3,30) stand for?

@MaoZiming
Copy link
Collaborator

MaoZiming commented Jul 22, 2023

@concretevitamin I think so. CurrentPolicy(num_retry, retry_time).
If the spot instance that is relaunched in the same zone gets preempted again within retry_time * 3 mins for num_retry of times, we next launch the instance in a different zone.
CurrentPolicy(3,30) means we check whether the instance is preempted within the changeover delay (cold start time, etc.) = 1.5 hours to decide whether we can pick a different zone next time. The idea is to use a more relaxed bound for checking since if the instance is preempted during cold start it is not doing any work.
CurrentPolicy(3,1) means we check whether the instance is immediately preempted (or launch unsuccessful)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll!

In the YAML file, the user can specify the strategy to use for spot jobs.

resources:
spot_recovery: EAGER_FAILOVER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "EAGER_NEXT_REGION"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed. Thanks!

def recover(self) -> float:
# 1. Terminate the current cluster
# 2. Launch the cluster without retrying the previously launched region
# 3. Launch the cluster with no cloud/region constraint or respect the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
2. Launch again by explicitly blocking the previously launched region (this will failover through the entire search space except the previously launched region)
3. (If step 2 failed) Retry forever: Launch again with no blocked locations (this will failover through the entire search space)

The entire search space is defined by the original task request, task.resources.

sky/spot/recovery_strategy.py Show resolved Hide resolved
job_submitted_at = self._launch(max_retry=self._MAX_RETRY_CNT,
raise_on_failure=False)
if job_submitted_at is None:
# Failed to launch the cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed offline) Maybe we should put it in _launch() to make it the sole accessor.

sky/spot/recovery_strategy.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Aug 3, 2023

Tested:

  • pytest tests/test_smoke.py --managed-spot
  • pytest tests/test_smoke.py --managed-spot --aws
  • sky spot launch -n text-next-region --cpus 2+ --cloud gcp; manually delete the spot cluster and check it failover to the next region; manually delete the spot cluster again and check it failover back to the first region.

@Michaelvll Michaelvll merged commit ca2a092 into master Aug 3, 2023
@Michaelvll Michaelvll deleted the spot-aggresive-failover branch August 3, 2023 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants