Load model and optimizet states on CPU to void OOMs #299

sgugger · 2022-03-29T17:46:24Z

This PR makes sure we load the model and optimizer state dicts (in the load_state method) on the CPU. Otherwise, those might be tensors that are on GPU 0 (since save_state only saves on process 0), which may result in an OOM: at some point we may have the optimizer state properly loaded on GPU 0 (by process 0) but also loaded a second time in GPU 0 by process 1 during load_state_dict, so twice that size on poor GPU 0.

HuggingFaceDocBuilderDev · 2022-03-29T17:54:06Z

The documentation is not available anymore as the PR was closed or merged.

muellerzr · 2022-03-29T18:31:10Z

LG2M! Nice catch! Looks like we might be able to link #296?

sgugger · 2022-03-29T18:49:38Z

That's no linked per se since in this issue, the user was using optimizer.load_state_dict and not accelerator.load_state.

Load model and optimizet states on CPU to void OOMs

8b623d8

sgugger requested a review from muellerzr March 29, 2022 17:46

sgugger merged commit 1e0b96f into main Mar 29, 2022

sgugger deleted the load_opt_cpu branch March 29, 2022 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load model and optimizet states on CPU to void OOMs #299

Load model and optimizet states on CPU to void OOMs #299

sgugger commented Mar 29, 2022

HuggingFaceDocBuilderDev commented Mar 29, 2022 •

edited

Loading

muellerzr commented Mar 29, 2022

sgugger commented Mar 29, 2022

Load model and optimizet states on CPU to void OOMs #299

Load model and optimizet states on CPU to void OOMs #299

Conversation

sgugger commented Mar 29, 2022

HuggingFaceDocBuilderDev commented Mar 29, 2022 • edited Loading

muellerzr commented Mar 29, 2022

sgugger commented Mar 29, 2022

HuggingFaceDocBuilderDev commented Mar 29, 2022 •

edited

Loading