Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Fix train_batch_size_per_learner problems. #49715

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jan 8, 2025

Fix train_batch_size_per_learner problems.

  • Make config.train_batch_size_per_learner a property (with underlying self._train_batch_size_per_learner private value holder).
  • Users can set this property, but - by default - takes the value of config.train_batch_size / num_learners.

This fixes a couple of got'chas where users don't explicitly set this property in their configs (or they think setting train_batch_size is enough, even though the migration guide talks about how to translate this setting), then run into errors b/c the default value of this property used to be None.

Note: Users should not notice these changes b/c they are seamless. They can still configure train_batch_size_per_learner (or not) w/o change of RLlib's behavior, other than that it doesn't crash anymore.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is not clear, yet, why we need all three _train_batch_size, total_train_batch_size and train_batch_size_per_learner. Also, total_train_batch_size depends on train_batch_size_per_learner whcih could be None as I understood it?

and not self.in_evaluation
and self.total_train_batch_size > 0
):
if self.rollout_fragment_length != "auto" and not self.in_evaluation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, is "auto" now the only configuration possible?

@OldAPIStack: User never touches `train_batch_size_per_learner` or
`num_learners`) -> `train_batch_size`.
"""
return self.train_batch_size_per_learner * (self.num_learners or 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return here not the private attribute self.train_batch_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B/c self.train_batch_size is old API stack. So we should no longer reference it anywhere in the new API stack logic.

@sven1977
Copy link
Contributor Author

sven1977 commented Jan 8, 2025

LGTM. It is not clear, yet, why we need all three _train_batch_size, total_train_batch_size and train_batch_size_per_learner. Also, total_train_batch_size depends on train_batch_size_per_learner whcih could be None as I understood it?

  • train_batch_size is deprecated (old API stack) and should no longer be used anymore (not by users and not by RLlib internally).
  • train_batch_size_per_learner is the new user-facing setting. However, since many users will still use train_batch_size in their old configs, we make it fall back to that value in case the user left train_batch_size_per_learner=None.
  • total_train_batch_size is a convenience property allowing the user to quickly check in their code, what the effective, overall batch size would be given train_batch_size_per_learner AND num_learners.

I agree it's a little confusing right now. We need to get rid of train_batch_size on the new API stack next (maybe produce a warning now, then an error).

@sven1977 sven1977 enabled auto-merge (squash) January 8, 2025 16:22
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 8, 2025
@sven1977 sven1977 merged commit afd024c into ray-project:master Jan 9, 2025
6 of 7 checks passed
@sven1977 sven1977 deleted the fix_train_batch_size_per_learner_setting branch January 9, 2025 09:51
dayshah pushed a commit to dayshah/ray that referenced this pull request Jan 10, 2025
HYLcool pushed a commit to HYLcool/ray that referenced this pull request Jan 13, 2025
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests rllib RLlib related issues rllib-newstack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants