-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Enhance node-failure tolerance. #50007
[RLlib] Enhance node-failure tolerance. #50007
Conversation
Signed-off-by: sven1977 <[email protected]>
def fetch_ready_async_reqs( | ||
self, | ||
*, | ||
timeout_seconds: Optional[float] = 0.0, | ||
return_obj_refs: bool = False, | ||
mark_healthy: bool = True, | ||
mark_healthy: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most important change here:
It's possible that an async result is still in the object store as the actor has crashed and been restarted. Fetching a valid result from such a restarted actor should NOT by default mark it as healthy. Only dedicated health checks, such as EnvRunnerGroup.probe_unhealthy_workers
should set mark_healthy=True
.
Signed-off-by: sven1977 <[email protected]>
…nce_spot_failure_support
# In this case, try each ref individually and collect only valid results. | ||
try: | ||
episodes = tree.flatten(ray.get(episode_refs)) | ||
except ray.exceptions.OwnerDiedError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Didn't know about this error type
Signed-off-by: sven1977 <[email protected]>
…to enhance_spot_failure_support
Signed-off-by: Anson Qian <[email protected]>
Signed-off-by: Anson Qian <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Enhance node-failure tolerance.
...worker...
to..env_runner..
for clarification.EnvRunnerGroup.fetch_ready_async_reqs(mark_healthy=)
fromTrue
toFalse
, which is crucial for EnvRunner restoration detection (and the respective callback) insideAlgorithm
.Learner
tolerant againstray.exceptions.OwnerNotFound
errors in case an entire node goes down and the object reference to an episode list passed to Learner (or AggregatorActor) cannot beray.get()
'd.Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.