Feature/memory error sampling #280

belerico · 2024-05-05T17:50:47Z

Summary

In this PR we have fixed multiple problems related to the resuming from checkpoint functionality. In particular:

when resuming from checkpoint the algo.learning_starts was read from the old config and if the buffer was not saved into the checkpoint then at the first training iteration the Ratio class, to keep with the replay-ratio after the algo.learning_starts steps, will output a huge number of per_rank_gradient_steps to be run on the first training iteration which could lead to OOM or to really slow training time
now the user must specify the algo.learning_starts even when resuming from a checkpoint. Set algo.learning_starts=0 to disable the buffer pre-fill
algo.learning_starts are not taken into consideration for the replay-ratio computation also when resuming from checkpoint
updated checkpoint howto

Type of Change

Please select the one relevant option below:

Bug fix (non-breaking change that solves an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Other (please describe):

Checklist

Please confirm that the following tasks have been completed:

I have tested my changes locally and they work as expected. (Please describe the tests you performed.)
I have added unit tests for my changes, or updated existing tests if necessary.
I have updated the documentation, if applicable.
I have installed pre-commit and run locally for my code changes.

Screenshots or Visuals (Optional)

If applicable, please provide screenshots, diagrams, graphs, or videos of the changes, features or the error.

Additional Information (Optional)

Please provide any additional information that may be useful for the reviewer, such as:

Any potential risks or challenges associated with the changes.
Any instructions for testing or running the code.
Any other relevant information.

Thank you for your contribution! Once you have filled out this template, please ensure that you have assigned the appropriate reviewers and that all tests have passed.

… feature/memory-error-sampling

howto/logs_and_checkpoints.md

sheeprl/configs/buffer/default.yaml

… feature/memory-error-sampling

belerico added 4 commits May 5, 2024 19:29

Always pre-fill the buffer for algo.learning_starts steps

8213b3b

Update howto

5801507

Fix prefill_steps

6c37154

Check for algo.learning_starts to be greater than 0

8d3b258

belerico requested review from DavideTr8, michele-milesi and rcmalli as code owners May 5, 2024 17:50

This was linked to issues May 5, 2024

dreamer v3 resuming problem #273

Closed

Dreamer v3 resuming #277

Closed

belerico added 6 commits May 5, 2024 19:52

Merge branch 'main' of https://github.com/Eclectic-Sheep/sheeprl into…

9b521dc

… feature/memory-error-sampling

Fix check algo.learning_starts

56d25f2

Fix replay-ratio steps to account once for policy_steps_per_update

51d6771

Do not compute policy_steps_per_update every time

f09519a

Fix typo

f1fa8c0

ratio_steps account every time for the policy_steps_per_update

ef7ef30

michele-milesi reviewed May 9, 2024

View reviewed changes

howto/logs_and_checkpoints.md Outdated Show resolved Hide resolved

howto/logs_and_checkpoints.md Outdated Show resolved Hide resolved

howto/logs_and_checkpoints.md Outdated Show resolved Hide resolved

sheeprl/configs/buffer/default.yaml Outdated Show resolved Hide resolved

belerico added 2 commits May 9, 2024 08:48

Merge branch 'main' of https://github.com/Eclectic-Sheep/sheeprl into…

9e4cce2

… feature/memory-error-sampling

Buffer is checkpointed by default

5ec76bc

michele-milesi approved these changes May 9, 2024

View reviewed changes

belerico merged commit 304e931 into main May 9, 2024
12 checks passed

belerico deleted the feature/memory-error-sampling branch May 9, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/memory error sampling #280

Feature/memory error sampling #280

belerico commented May 5, 2024 •

edited

Loading

Feature/memory error sampling #280

Feature/memory error sampling #280

Conversation

belerico commented May 5, 2024 • edited Loading

Summary

Type of Change

Checklist

Screenshots or Visuals (Optional)

Additional Information (Optional)

belerico commented May 5, 2024 •

edited

Loading