[RLlib] Enhance node-failure tolerance. #50007

sven1977 · 2025-01-22T12:32:14Z

Enhance node-failure tolerance.

Rename some functions/method from ...worker... to ..env_runner.. for clarification.
Switch one default value in EnvRunnerGroup.fetch_ready_async_reqs(mark_healthy=) from True to False, which is crucial for EnvRunner restoration detection (and the respective callback) inside Algorithm.
Make Learner tolerant against ray.exceptions.OwnerNotFound errors in case an entire node goes down and the object reference to an episode list passed to Learner (or AggregatorActor) cannot be ray.get()'d.
Minor cleanups (remove redundant decorators, etc..).

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

sven1977 · 2025-01-22T12:35:47Z

rllib/env/env_runner_group.py

    def fetch_ready_async_reqs(
        self,
        *,
        timeout_seconds: Optional[float] = 0.0,
        return_obj_refs: bool = False,
-        mark_healthy: bool = True,
+        mark_healthy: bool = False,


Most important change here:
It's possible that an async result is still in the object store as the actor has crashed and been restarted. Fetching a valid result from such a restarted actor should NOT by default mark it as healthy. Only dedicated health checks, such as EnvRunnerGroup.probe_unhealthy_workers should set mark_healthy=True.

Signed-off-by: sven1977 <[email protected]>

…nce_spot_failure_support

simonsays1980 · 2025-01-22T13:00:50Z

rllib/algorithms/utils.py

+        # In this case, try each ref individually and collect only valid results.
+        try:
+            episodes = tree.flatten(ray.get(episode_refs))
+        except ray.exceptions.OwnerDiedError:


Very nice! Didn't know about this error type

Signed-off-by: sven1977 <[email protected]>

…to enhance_spot_failure_support

Signed-off-by: Anson Qian <[email protected]>

Signed-off-by: Puyuan Yao <[email protected]>

wip

95776f7

Signed-off-by: sven1977 <[email protected]>

sven1977 requested a review from simonsays1980 as a code owner January 22, 2025 12:32

sven1977 assigned simonsays1980 Jan 22, 2025

sven1977 added rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers. labels Jan 22, 2025

sven1977 commented Jan 22, 2025

View reviewed changes

sven1977 added 3 commits January 22, 2025 13:50

wip

fe4892a

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into enha…

c729c1f

…nce_spot_failure_support

Merge branch 'master' into enhance_spot_failure_support

785c73e

simonsays1980 approved these changes Jan 22, 2025

View reviewed changes

sven1977 added 2 commits January 22, 2025 15:44

fix

e588ec3

Signed-off-by: sven1977 <[email protected]>

Merge remote-tracking branch 'origin/enhance_spot_failure_support' in…

f61931f

…to enhance_spot_failure_support

sven1977 enabled auto-merge (squash) January 22, 2025 14:45

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 22, 2025

sven1977 merged commit 8d83686 into ray-project:master Jan 22, 2025
6 of 7 checks passed

win5923 pushed a commit to win5923/ray that referenced this pull request Jan 23, 2025

[RLlib] Enhance node-failure tolerance. (ray-project#50007)

9a96b34

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Jan 23, 2025

[RLlib] Enhance node-failure tolerance. (ray-project#50007)

aac5037

anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025

[RLlib] Enhance node-failure tolerance. (ray-project#50007)

8b2b742

Signed-off-by: Anson Qian <[email protected]>

anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025

[RLlib] Enhance node-failure tolerance. (ray-project#50007)

9d9b87d

Signed-off-by: Anson Qian <[email protected]>

srinathk10 pushed a commit that referenced this pull request Feb 2, 2025

[RLlib] Enhance node-failure tolerance. (#50007)

f9cd897

anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025

[RLlib] Enhance node-failure tolerance. (ray-project#50007)

2a0b14f

Signed-off-by: Puyuan Yao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Enhance node-failure tolerance. #50007

[RLlib] Enhance node-failure tolerance. #50007

sven1977 commented Jan 22, 2025 •

edited

Loading

sven1977 Jan 22, 2025

simonsays1980 Jan 22, 2025

[RLlib] Enhance node-failure tolerance. #50007

[RLlib] Enhance node-failure tolerance. #50007

Conversation

sven1977 commented Jan 22, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 Jan 22, 2025

Choose a reason for hiding this comment

simonsays1980 Jan 22, 2025

Choose a reason for hiding this comment

sven1977 commented Jan 22, 2025 •

edited

Loading