Skip to content

Commit

Permalink
Fix jobs longer than 12 days
Browse files Browse the repository at this point in the history
  • Loading branch information
Michaelvll committed Apr 22, 2024
1 parent 1bdcd01 commit 31dc1c8
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,13 @@ def get_or_fail(futures, pg) -> List[int]:
\"\"\"Wait for tasks, if any fails, cancel all unready.\"\"\"
returncodes = [1] * len(futures)
# Wait for 1 task to be ready.
ready, unready = ray.wait(futures)
ready = []
# Recall ray.wait if ready is empty. This is because ray.wait
# with timeout=None will only wait for 10**6 seconds, which will
# cause the task longer than 12 days returned before it is
# ready. Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
while not ready:
ready, unready = ray.wait(futures)
idx = futures.index(ready[0])
returncodes[idx] = ray.get(ready[0])
while unready:
Expand Down

0 comments on commit 31dc1c8

Please sign in to comment.