-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorchFileRecorder
memory usage
#1270
Comments
yurzhang has reported the following on the discord channel:
|
I made a minimal reproducible example, I hope it helps. |
Thank you. I will check it out. |
Based on the information provided by @Nikaidou-Shinku on discord and the MRE I found that I had a pattern that did not cover all layers. My fix for the pattern chain in #1269 did not reuse the new name, so some names would still not match and thus I'd run into the OOM issue we described. I'll open another PR for that. I'm still running into an issue though but it seems to be related to something else. /edit: below are my findings Turns out the When loading the |
I will be investing the root cause but now that @Nikaidou-Shinku pointed to another problem (wrong keys), I suspect the NestedValue creation is blown up because there is a loop or something like this during unflattening. I will try fixing the two issues: memory doubling and recursion. |
I have figured out the root cause of the issue. It was because a struct field in a model was not present in pt file. This coincided with incorrect renaming fields. Thanks to @Nikaidou-Shinku for catching it. #1286 PR fixes it. Please see the changes section on how I fixed it. |
Fixed with PR #1286 |
Updated
Describe the bug
When importing weights from a saved PyTorch model, incorrectly mapped keys will lead to increasing memory usage and eventually the process will get killed.
To Reproduce
Check out this simple MRE: https://github.com/Nikaidou-Shinku/burn-1270-mre
Expected behavior
The user experience could be improved. Hopefully an incorrect key mapping doesn't lead to OOM and we can catch and report the error.
The text was updated successfully, but these errors were encountered: