Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load model and optimizet states on CPU to void OOMs #299

Merged
merged 1 commit into from
Mar 29, 2022
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Mar 29, 2022

This PR makes sure we load the model and optimizer state dicts (in the load_state method) on the CPU. Otherwise, those might be tensors that are on GPU 0 (since save_state only saves on process 0), which may result in an OOM: at some point we may have the optimizer state properly loaded on GPU 0 (by process 0) but also loaded a second time in GPU 0 by process 1 during load_state_dict, so twice that size on poor GPU 0.

@sgugger sgugger requested a review from muellerzr March 29, 2022 17:46
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 29, 2022

The documentation is not available anymore as the PR was closed or merged.

@muellerzr
Copy link
Collaborator

LG2M! Nice catch! Looks like we might be able to link #296?

@sgugger
Copy link
Collaborator Author

sgugger commented Mar 29, 2022

That's no linked per se since in this issue, the user was using optimizer.load_state_dict and not accelerator.load_state.

@sgugger sgugger merged commit 1e0b96f into main Mar 29, 2022
@sgugger sgugger deleted the load_opt_cpu branch March 29, 2022 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants