[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

sven1977 · 2021-09-28T07:55:18Z

Issue 18812: Torch multi-GPU stats not protected against race conditions.

This PR:

moves loss stats (which are computed per-tower) to the individual towers (copies of the model).
in the stats_fn (run after all towers have computed their losses), these stats can now be averaged (or min/max'd) from the individual towers.
unifies all stats with the already existing handling of td-errors (which need to remain per-batch item)

Why are these changes needed?

#18812

Related issue number

Closes #18812

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

sven1977 · 2021-09-28T07:55:26Z

@mvindiola1 ^^

…e_18812_torch_multi_gpu_stats_race_condition

…e_18812_torch_multi_gpu_stats_race_condition # Conflicts: # rllib/agents/dqn/r2d2_torch_policy.py # rllib/agents/sac/rnnsac_torch_policy.py

michaelzhiluo · 2021-09-30T23:15:28Z

rllib/agents/dqn/r2d2_torch_policy.py

@@ -279,7 +291,7 @@ def extra_action_out_fn(policy: Policy, input_dict, state_batches, model,
    postprocess_fn=postprocess_nstep_and_prio,
    optimizer_fn=adam_optimizer,
    extra_grad_process_fn=grad_process_and_td_error_fn,
-    extra_learn_fetches_fn=lambda policy: {"td_error": policy._td_error},
+    extra_learn_fetches_fn=concat_multi_gpu_td_errors,


Might be better just to hardcode the lambda function directly as a function in the r2d2 policy class

michaelzhiluo · 2021-09-30T23:16:00Z

rllib/agents/dqn/simple_q_torch_policy.py

@@ -16,7 +16,7 @@
 from ray.rllib.policy.torch_policy import TorchPolicy
 from ray.rllib.utils.annotations import override
 from ray.rllib.utils.framework import try_import_torch
-from ray.rllib.utils.torch_ops import huber_loss
+from ray.rllib.utils.torch_ops import concat_multi_gpu_td_errors, huber_loss


Think its better not to abstract td_errors away

…e_18812_torch_multi_gpu_stats_race_condition # Conflicts: # rllib/agents/impala/vtrace_torch_policy.py # rllib/policy/tf_policy_template.py # rllib/policy/torch_policy.py

…e_18812_torch_multi_gpu_stats_race_condition

wip

74dbde7

sven1977 mentioned this pull request Sep 28, 2021

[Bug] [RLLIB] Race condition in stats_fn when using multi-gpu #18812

Closed

2 tasks

sven1977 requested a review from michaelzhiluo September 28, 2021 07:58

sven1977 assigned michaelzhiluo Sep 28, 2021

sven1977 added 11 commits September 28, 2021 13:15

fixes.

8f617aa

LINT.

3a23ce5

Merge branch 'master' of https://github.com/ray-project/ray into issu…

5b62ed5

…e_18812_torch_multi_gpu_stats_race_condition

fixes.

5ef5a61

Merge branch 'master' of https://github.com/ray-project/ray into issu…

0d2bffd

…e_18812_torch_multi_gpu_stats_race_condition

fix.

3abac59

wip

690b9ae

Merge branch 'master' of https://github.com/ray-project/ray into issu…

86d8ce6

…e_18812_torch_multi_gpu_stats_race_condition # Conflicts: # rllib/agents/dqn/r2d2_torch_policy.py # rllib/agents/sac/rnnsac_torch_policy.py

fix

bf2f99b

fix

4521749

fixes.

1caa514

michaelzhiluo reviewed Sep 30, 2021

View reviewed changes

michaelzhiluo approved these changes Sep 30, 2021

View reviewed changes

sven1977 added 3 commits October 1, 2021 18:15

Merge branch 'master' of https://github.com/ray-project/ray into issu…

8b1fea7

…e_18812_torch_multi_gpu_stats_race_condition # Conflicts: # rllib/agents/impala/vtrace_torch_policy.py # rllib/policy/tf_policy_template.py # rllib/policy/torch_policy.py

Merge branch 'master' of https://github.com/ray-project/ray into issu…

f9585e0

…e_18812_torch_multi_gpu_stats_race_condition

fixes.

f2e8e2f

sven1977 merged commit b4300dd into ray-project:master Oct 4, 2021

sven1977 deleted the issue_18812_torch_multi_gpu_stats_race_condition branch June 2, 2023 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

sven1977 commented Sep 28, 2021 •

edited

Loading

sven1977 commented Sep 28, 2021

michaelzhiluo Sep 30, 2021

michaelzhiluo Sep 30, 2021

[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. #18937

Conversation

sven1977 commented Sep 28, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 commented Sep 28, 2021

michaelzhiluo Sep 30, 2021

Choose a reason for hiding this comment

michaelzhiluo Sep 30, 2021

Choose a reason for hiding this comment

sven1977 commented Sep 28, 2021 •

edited

Loading