-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Offload memory usage not performing as expected #1437
Comments
@murnanedaniel, thanks for the report. Can you please add deepspeed's memory usage profiler to your model? And please share the resulting log. I am particularly interested in the memory usage before forward, backward, and step. Thanks. |
Thanks for the quick reply @tjruwase! Here are 3 batches of no offloading:
Here are 3 batches of optimizer offloading:
and here are 3 batches of optimizer + param offloading:
I'm not familiar with the abbreviations. What stands out to you from these metrics? |
@murnanedaniel, thanks for sharing these logs. Let me quickly answer the metrics question. Below is the mapping of abbreviations to cuda memory management functions defined here. |
Above is a summary from the memory usage logs. This is showing the memory utilization (in GB) before first forward, backward, and optimizer calls when offloading How did you obtain the 1.9GB and 4.1GB memory usage that you attributed to |
One more point, in general ZeRO is not advisable for model or batch sizes that can fit without it. In such cases, ZeRO could reduce throughput and bloat memory usage. |
Thanks for summarizing these values. Indeed I agree that this case of a small model and small graph (i.e. a single-edged graph) is not really what ZeRO is built for, but scaling it up to a larger model and larger graph I see similar behavior. I am using the |
Got it. Can you please share the memory usage logs for the OOM case? Please share the corresponding log for non-offloaded as well. Thanks. |
@murnanedaniel, are you still interested in further debugging this issue? Thanks! |
Closing for lack of activity. Please re-open as needed. |
Does this mean it's a Pytorch issue, for not letting go of the inactive memory? Is there a way to keep Pytorch from reserving so much memory or to free some memory? Simply |
I'm working with graph neural networks for particle physics, where we often have large graphs that cannot fit all gradients on the GPU simultaneously. I'm hoping to use Stage 3 offloading to move parameters/gradients off the GPU and train these graphs. However after also trying this with Fairscale, I'm settling for a simpler toy model to understand the memory behaviour. Here is the toy:
To Reproduce
Obviously this is not a GNN in any sense - it is just a binary classifier sequential model. I train this with:
Expected behavior
I would expect that this large model should consume a lot of memory with no offloading, but be sharded to much smaller memory usage with stage 3 offloading. In fact, it's the opposite: with no offloading this model requires around 1.9Gb, and with offloading requires 4.1Gb. Aside from other memory saving techniques (mixed precision, activation checkpointing), I'm hoping to understand why offloading by itself is not delivering a smaller memory footprint.
Am I missing something obvious here? Do I have the wrong idea of how Zero 3 offload is meant to work? If I can't get this toy to use less memory, I don't see how the more complicated GNN architecture would benefit.
ds_report output
System info (please complete the following information):
1 GPU - V100
FYI: Config file
The text was updated successfully, but these errors were encountered: