Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Enhance node-failure tolerance. #50007

Merged
merged 6 commits into from
Jan 22, 2025

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jan 22, 2025

Enhance node-failure tolerance.

  • Rename some functions/method from ...worker... to ..env_runner.. for clarification.
  • Switch one default value in EnvRunnerGroup.fetch_ready_async_reqs(mark_healthy=) from True to False, which is crucial for EnvRunner restoration detection (and the respective callback) inside Algorithm.
  • Make Learner tolerant against ray.exceptions.OwnerNotFound errors in case an entire node goes down and the object reference to an episode list passed to Learner (or AggregatorActor) cannot be ray.get()'d.
  • Minor cleanups (remove redundant decorators, etc..).

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 added rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers. labels Jan 22, 2025
def fetch_ready_async_reqs(
self,
*,
timeout_seconds: Optional[float] = 0.0,
return_obj_refs: bool = False,
mark_healthy: bool = True,
mark_healthy: bool = False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most important change here:
It's possible that an async result is still in the object store as the actor has crashed and been restarted. Fetching a valid result from such a restarted actor should NOT by default mark it as healthy. Only dedicated health checks, such as EnvRunnerGroup.probe_unhealthy_workers should set mark_healthy=True.

# In this case, try each ref individually and collect only valid results.
try:
episodes = tree.flatten(ray.get(episode_refs))
except ray.exceptions.OwnerDiedError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Didn't know about this error type

@sven1977 sven1977 enabled auto-merge (squash) January 22, 2025 14:45
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 22, 2025
@sven1977 sven1977 merged commit 8d83686 into ray-project:master Jan 22, 2025
6 of 7 checks passed
win5923 pushed a commit to win5923/ray that referenced this pull request Jan 23, 2025
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Jan 23, 2025
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests rllib RLlib related issues rllib-checkpointing-or-recovery An issue related to checkpointing/recovering RLlib Trainers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants