[RLlib] Fix `train_batch_size_per_learner` problems. #49715

sven1977 · 2025-01-08T13:15:24Z

Fix train_batch_size_per_learner problems.

Make config.train_batch_size_per_learner a property (with underlying self._train_batch_size_per_learner private value holder).
Users can set this property, but - by default - takes the value of config.train_batch_size / num_learners.

This fixes a couple of got'chas where users don't explicitly set this property in their configs (or they think setting train_batch_size is enough, even though the migration guide talks about how to translate this setting), then run into errors b/c the default value of this property used to be None.

Note: Users should not notice these changes b/c they are seamless. They can still configure train_batch_size_per_learner (or not) w/o change of RLlib's behavior, other than that it doesn't crash anymore.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

simonsays1980

LGTM. It is not clear, yet, why we need all three _train_batch_size, total_train_batch_size and train_batch_size_per_learner. Also, total_train_batch_size depends on train_batch_size_per_learner whcih could be None as I understood it?

simonsays1980 · 2025-01-08T15:26:59Z

rllib/algorithms/algorithm_config.py

-            and not self.in_evaluation
-            and self.total_train_batch_size > 0
-        ):
+        if self.rollout_fragment_length != "auto" and not self.in_evaluation:


Again, is "auto" now the only configuration possible?

simonsays1980 · 2025-01-08T15:27:46Z

rllib/algorithms/algorithm_config.py

+        @OldAPIStack: User never touches `train_batch_size_per_learner` or
+        `num_learners`) -> `train_batch_size`.
+        """
+        return self.train_batch_size_per_learner * (self.num_learners or 1)


Why do we return here not the private attribute self.train_batch_size?

B/c self.train_batch_size is old API stack. So we should no longer reference it anywhere in the new API stack logic.

sven1977 · 2025-01-08T16:21:10Z

LGTM. It is not clear, yet, why we need all three _train_batch_size, total_train_batch_size and train_batch_size_per_learner. Also, total_train_batch_size depends on train_batch_size_per_learner whcih could be None as I understood it?

train_batch_size is deprecated (old API stack) and should no longer be used anymore (not by users and not by RLlib internally).
train_batch_size_per_learner is the new user-facing setting. However, since many users will still use train_batch_size in their old configs, we make it fall back to that value in case the user left train_batch_size_per_learner=None.
total_train_batch_size is a convenience property allowing the user to quickly check in their code, what the effective, overall batch size would be given train_batch_size_per_learner AND num_learners.

I agree it's a little confusing right now. We need to get rid of train_batch_size on the new API stack next (maybe produce a warning now, then an error).

Signed-off-by: dayshah <[email protected]>

Signed-off-by: lielin.hyl <[email protected]>

Signed-off-by: Puyuan Yao <[email protected]>

wip

d899cc4

Signed-off-by: sven1977 <[email protected]>

sven1977 requested a review from simonsays1980 as a code owner January 8, 2025 13:15

wip

b4e6b5f

Signed-off-by: sven1977 <[email protected]>

sven1977 assigned simonsays1980 Jan 8, 2025

sven1977 added rllib RLlib related issues rllib-newstack labels Jan 8, 2025

simonsays1980 approved these changes Jan 8, 2025

View reviewed changes

sven1977 enabled auto-merge (squash) January 8, 2025 16:22

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 8, 2025

sven1977 merged commit afd024c into ray-project:master Jan 9, 2025
6 of 7 checks passed

sven1977 deleted the fix_train_batch_size_per_learner_setting branch January 9, 2025 09:51

dayshah pushed a commit to dayshah/ray that referenced this pull request Jan 10, 2025

[RLlib] Fix train_batch_size_per_learner problems. (ray-project#49715)

d0769ae

Signed-off-by: dayshah <[email protected]>

HYLcool pushed a commit to HYLcool/ray that referenced this pull request Jan 13, 2025

[RLlib] Fix train_batch_size_per_learner problems. (ray-project#49715)

f4445fd

Signed-off-by: lielin.hyl <[email protected]>

srinathk10 pushed a commit that referenced this pull request Feb 2, 2025

[RLlib] Fix train_batch_size_per_learner problems. (#49715)

80bbd4c

anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025

[RLlib] Fix train_batch_size_per_learner problems. (ray-project#49715)

c4b57fe

Signed-off-by: Puyuan Yao <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Fix `train_batch_size_per_learner` problems. #49715

[RLlib] Fix `train_batch_size_per_learner` problems. #49715

sven1977 commented Jan 8, 2025 •

edited

Loading

simonsays1980 left a comment

simonsays1980 Jan 8, 2025

simonsays1980 Jan 8, 2025

sven1977 Jan 8, 2025

sven1977 commented Jan 8, 2025 •

edited

Loading

[RLlib] Fix train_batch_size_per_learner problems. #49715

[RLlib] Fix train_batch_size_per_learner problems. #49715

Conversation

sven1977 commented Jan 8, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

simonsays1980 left a comment

Choose a reason for hiding this comment

simonsays1980 Jan 8, 2025

Choose a reason for hiding this comment

simonsays1980 Jan 8, 2025

Choose a reason for hiding this comment

sven1977 Jan 8, 2025

Choose a reason for hiding this comment

sven1977 commented Jan 8, 2025 • edited Loading

[RLlib] Fix `train_batch_size_per_learner` problems. #49715

[RLlib] Fix `train_batch_size_per_learner` problems. #49715

sven1977 commented Jan 8, 2025 •

edited

Loading

sven1977 commented Jan 8, 2025 •

edited

Loading