Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Sep 28, 2021

Issue 18812: Torch multi-GPU stats not protected against race conditions.

This PR:

  • moves loss stats (which are computed per-tower) to the individual towers (copies of the model).
  • in the stats_fn (run after all towers have computed their losses), these stats can now be averaged (or min/max'd) from the individual towers.
  • unifies all stats with the already existing handling of td-errors (which need to remain per-batch item)

Why are these changes needed?

#18812

Related issue number

Closes #18812

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@sven1977
Copy link
Contributor Author

@mvindiola1 ^^

…e_18812_torch_multi_gpu_stats_race_condition

# Conflicts:
#	rllib/agents/dqn/r2d2_torch_policy.py
#	rllib/agents/sac/rnnsac_torch_policy.py
@@ -279,7 +291,7 @@ def extra_action_out_fn(policy: Policy, input_dict, state_batches, model,
postprocess_fn=postprocess_nstep_and_prio,
optimizer_fn=adam_optimizer,
extra_grad_process_fn=grad_process_and_td_error_fn,
extra_learn_fetches_fn=lambda policy: {"td_error": policy._td_error},
extra_learn_fetches_fn=concat_multi_gpu_td_errors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better just to hardcode the lambda function directly as a function in the r2d2 policy class

@@ -16,7 +16,7 @@
from ray.rllib.policy.torch_policy import TorchPolicy
from ray.rllib.utils.annotations import override
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.torch_ops import huber_loss
from ray.rllib.utils.torch_ops import concat_multi_gpu_td_errors, huber_loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think its better not to abstract td_errors away

…e_18812_torch_multi_gpu_stats_race_condition

# Conflicts:
#	rllib/agents/impala/vtrace_torch_policy.py
#	rllib/policy/tf_policy_template.py
#	rllib/policy/torch_policy.py
@sven1977 sven1977 merged commit b4300dd into ray-project:master Oct 4, 2021
@sven1977 sven1977 deleted the issue_18812_torch_multi_gpu_stats_race_condition branch June 2, 2023 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [RLLIB] Race condition in stats_fn when using multi-gpu
2 participants