[BUG] Offload memory usage not performing as expected #1437

murnanedaniel · 2021-10-07T16:23:05Z

I'm working with graph neural networks for particle physics, where we often have large graphs that cannot fit all gradients on the GPU simultaneously. I'm hoping to use Stage 3 offloading to move parameters/gradients off the GPU and train these graphs. However after also trying this with Fairscale, I'm settling for a simpler toy model to understand the memory behaviour. Here is the toy:

To Reproduce

num_graphs = 10
num_input = 3
num_edges = 1
num_outputs = 1

num_hidden =  1024
num_layers =  100

edges = torch.rand(num_graphs, num_edges, num_input)
truth = torch.round(torch.rand(num_graphs, num_edges))
graph_data = torch.utils.data.TensorDataset(edges, truth)
dataloader = DataLoader(graph_data, batch_size=1)

net = torch.nn.Sequential(
    torch.nn.Linear(num_input, num_hidden),
    *([torch.nn.Linear(num_hidden, num_hidden) for _ in range(num_layers)]),
    torch.nn.Linear(num_hidden, num_outputs),
)

Obviously this is not a GNN in any sense - it is just a binary classifier sequential model. I train this with:

criterion = torch.nn.BCEWithLogitsLoss()
        
for step, (batch_edges, batch_truth) in enumerate(dataloader):
    torch.cuda.reset_peak_memory_stats()   
    
    #forward() method
    batch_edges, batch_truth = batch_edges.to("cuda").squeeze(0), batch_truth.to("cuda").squeeze(0)
    output = model_engine(batch_edges)    
    loss = criterion(output.squeeze(1), target=batch_truth)
    
    #runs backpropagation
    model_engine.backward(loss)
    
    #weight update
    model_engine.step()
    print(f'Using memory: {torch.cuda.max_memory_allocated()/1024**3} Gb')

Expected behavior
I would expect that this large model should consume a lot of memory with no offloading, but be sharded to much smaller memory usage with stage 3 offloading. In fact, it's the opposite: with no offloading this model requires around 1.9Gb, and with offloading requires 4.1Gb. Aside from other memory saving techniques (mixed precision, activation checkpointing), I'm hoping to understand why offloading by itself is not delivering a smaller memory footprint.

Am I missing something obvious here? Do I have the wrong idea of how Zero 3 offload is meant to work? If I can't get this toy to use less memory, I don't see how the more complicated GNN architecture would benefit.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

System info (please complete the following information):
1 GPU - V100

FYI: Config file

{
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
}}}

The text was updated successfully, but these errors were encountered:

tjruwase · 2021-10-07T16:57:50Z

@murnanedaniel, thanks for the report.

Can you please add deepspeed's memory usage profiler to your model?

import like here
use like here

And please share the resulting log. I am particularly interested in the memory usage before forward, backward, and step. Thanks.

murnanedaniel · 2021-10-07T17:33:25Z

Thanks for the quick reply @tjruwase!

Here are 3 batches of no offloading:

[2021-10-07 10:27:25,902] [INFO] [utils.py:806:see_memory_usage] before forward 0
[2021-10-07 10:27:25,903] [INFO] [utils.py:811:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 0.39 GB         Max_CA 0 GB 
[2021-10-07 10:27:25,903] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 102.77 GB, percent = 27.3%
[2021-10-07 10:27:25,983] [INFO] [utils.py:806:see_memory_usage] before backward 0
[2021-10-07 10:27:25,984] [INFO] [utils.py:811:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 0.39 GB         Max_CA 0 GB 
[2021-10-07 10:27:25,984] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 102.97 GB, percent = 27.4%
[2021-10-07 10:27:26,079] [INFO] [utils.py:806:see_memory_usage] before optimizer 0
[2021-10-07 10:27:26,080] [INFO] [utils.py:811:see_memory_usage] MA 0.78 GB         Max_MA 1.17 GB         CA 1.18 GB         Max_CA 1 GB 
[2021-10-07 10:27:26,080] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.04 GB, percent = 27.4%
[2021-10-07 10:27:26,198] [INFO] [utils.py:806:see_memory_usage] before forward 1
[2021-10-07 10:27:26,198] [INFO] [utils.py:811:see_memory_usage] MA 1.17 GB         Max_MA 1.17 GB         CA 1.57 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,199] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.03 GB, percent = 27.4%
[2021-10-07 10:27:26,276] [INFO] [utils.py:806:see_memory_usage] before backward 1
[2021-10-07 10:27:26,277] [INFO] [utils.py:811:see_memory_usage] MA 1.17 GB         Max_MA 1.17 GB         CA 1.57 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,278] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.2 GB, percent = 27.5%
[2021-10-07 10:27:26,373] [INFO] [utils.py:806:see_memory_usage] before optimizer 1
[2021-10-07 10:27:26,374] [INFO] [utils.py:811:see_memory_usage] MA 1.56 GB         Max_MA 1.96 GB         CA 1.96 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,374] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.41 GB, percent = 27.5%
[2021-10-07 10:27:26,460] [INFO] [utils.py:806:see_memory_usage] before forward 2
[2021-10-07 10:27:26,461] [INFO] [utils.py:811:see_memory_usage] MA 1.17 GB         Max_MA 1.17 GB         CA 1.96 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,461] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.64 GB, percent = 27.6%
[2021-10-07 10:27:26,539] [INFO] [utils.py:806:see_memory_usage] before backward 2
[2021-10-07 10:27:26,540] [INFO] [utils.py:811:see_memory_usage] MA 1.17 GB         Max_MA 1.17 GB         CA 1.96 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,540] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 103.81 GB, percent = 27.6%
[2021-10-07 10:27:26,624] [INFO] [utils.py:806:see_memory_usage] before optimizer 2
[2021-10-07 10:27:26,625] [INFO] [utils.py:811:see_memory_usage] MA 1.56 GB         Max_MA 1.96 GB         CA 1.96 GB         Max_CA 2 GB 
[2021-10-07 10:27:26,625] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 104.01 GB, percent = 27.7%

Here are 3 batches of optimizer offloading:

[2021-10-07 10:28:17,653] [INFO] [utils.py:806:see_memory_usage] before forward 0
[2021-10-07 10:28:17,654] [INFO] [utils.py:811:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 0.79 GB         Max_CA 1 GB 
[2021-10-07 10:28:17,654] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 110.11 GB, percent = 29.3%
[2021-10-07 10:28:17,768] [INFO] [utils.py:806:see_memory_usage] before backward 0
[2021-10-07 10:28:17,769] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 0.79 GB         Max_CA 1 GB 
[2021-10-07 10:28:17,769] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 110.37 GB, percent = 29.4%
[2021-10-07 10:28:17,963] [INFO] [utils.py:806:see_memory_usage] before optimizer 0
[2021-10-07 10:28:17,965] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 2.27 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:17,965] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 110.45 GB, percent = 29.4%
[2021-10-07 10:28:18,590] [INFO] [utils.py:806:see_memory_usage] before forward 1
[2021-10-07 10:28:18,591] [INFO] [utils.py:811:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:18,592] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 111.96 GB, percent = 29.8%
[2021-10-07 10:28:18,671] [INFO] [utils.py:806:see_memory_usage] before backward 1
[2021-10-07 10:28:18,671] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:18,672] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 112.15 GB, percent = 29.8%
[2021-10-07 10:28:18,836] [INFO] [utils.py:806:see_memory_usage] before optimizer 1
[2021-10-07 10:28:18,837] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 2.27 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:18,837] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 112.48 GB, percent = 29.9%
[2021-10-07 10:28:19,554] [INFO] [utils.py:806:see_memory_usage] before forward 2
[2021-10-07 10:28:19,557] [INFO] [utils.py:811:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:19,557] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 105.17 GB, percent = 28.0%
[2021-10-07 10:28:19,640] [INFO] [utils.py:806:see_memory_usage] before backward 2
[2021-10-07 10:28:19,641] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:19,641] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 104.67 GB, percent = 27.8%
[2021-10-07 10:28:19,788] [INFO] [utils.py:806:see_memory_usage] before optimizer 2
[2021-10-07 10:28:19,789] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 2.27 GB         CA 2.65 GB         Max_CA 3 GB 
[2021-10-07 10:28:19,789] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 106.48 GB, percent = 28.3%

and here are 3 batches of optimizer + param offloading:

[2021-10-07 10:29:00,483] [INFO] [utils.py:806:see_memory_usage] before forward 0
[2021-10-07 10:29:00,483] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 0.01 GB         CA 0.41 GB         Max_CA 0 GB 
[2021-10-07 10:29:00,484] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 106.37 GB, percent = 28.3%
[2021-10-07 10:29:00,702] [INFO] [utils.py:806:see_memory_usage] before backward 0
[2021-10-07 10:29:00,703] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 0.82 GB         Max_CA 1 GB 
[2021-10-07 10:29:00,703] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 106.9 GB, percent = 28.4%
[2021-10-07 10:29:01,065] [INFO] [utils.py:806:see_memory_usage] before optimizer 0
[2021-10-07 10:29:01,066] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 4.13 GB         CA 4.65 GB         Max_CA 5 GB 
[2021-10-07 10:29:01,067] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 107.74 GB, percent = 28.7%
[2021-10-07 10:29:01,644] [INFO] [utils.py:806:see_memory_usage] before forward 1
[2021-10-07 10:29:01,645] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 0.01 GB         CA 4.65 GB         Max_CA 5 GB 
[2021-10-07 10:29:01,645] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 108.99 GB, percent = 29.0%
[2021-10-07 10:29:01,830] [INFO] [utils.py:806:see_memory_usage] before backward 1
[2021-10-07 10:29:01,831] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 4.65 GB         Max_CA 5 GB 
[2021-10-07 10:29:01,832] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 109.39 GB, percent = 29.1%
[2021-10-07 10:29:02,186] [INFO] [utils.py:806:see_memory_usage] before optimizer 1
[2021-10-07 10:29:02,186] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 4.13 GB         CA 4.69 GB         Max_CA 5 GB 
[2021-10-07 10:29:02,187] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 110.16 GB, percent = 29.3%
[2021-10-07 10:29:02,899] [INFO] [utils.py:806:see_memory_usage] before forward 2
[2021-10-07 10:29:02,900] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 0.01 GB         CA 4.69 GB         Max_CA 5 GB 
[2021-10-07 10:29:02,900] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 111.82 GB, percent = 29.7%
[2021-10-07 10:29:03,079] [INFO] [utils.py:806:see_memory_usage] before backward 2
[2021-10-07 10:29:03,080] [INFO] [utils.py:811:see_memory_usage] MA 0.4 GB         Max_MA 0.4 GB         CA 4.69 GB         Max_CA 5 GB 
[2021-10-07 10:29:03,081] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 112.21 GB, percent = 29.9%
[2021-10-07 10:29:03,462] [INFO] [utils.py:806:see_memory_usage] before optimizer 2
[2021-10-07 10:29:03,463] [INFO] [utils.py:811:see_memory_usage] MA 0.01 GB         Max_MA 4.13 GB         CA 4.71 GB         Max_CA 5 GB 
[2021-10-07 10:29:03,463] [INFO] [utils.py:816:see_memory_usage] CPU Virtual Memory:  used = 113.01 GB, percent = 30.1%

I'm not familiar with the abbreviations. What stands out to you from these metrics?
Thanks again for the help!

tjruwase · 2021-10-07T19:04:11Z

@murnanedaniel, thanks for sharing these logs. Let me quickly answer the metrics question. Below is the mapping of abbreviations to cuda memory management functions defined here.

https://github.com/microsoft/DeepSpeed/blob/9c672783e95c3729ee7dae72c2afdb0b46ed5ff9/deepspeed/runtime/utils.py#L807-L811

tjruwase · 2021-10-07T19:25:40Z

before	nothing	optimizer	optimizer+params
forward	0.39	0.39	0.01
backward	0.39	0.40	0.40
optimizer	0.78	0.40	0.01

Above is a summary from the memory usage logs. This is showing the memory utilization (in GB) before first forward, backward, and optimizer calls when offloading nothing, optimizer, and optimizer+params. The MA metric captures the actual usage. As you can see offloading reduces GPU memory usage. You might also notice that memory usage for nothing increases for later batches (up to 1.56GB) whereas the usage remains stable for offloading.

How did you obtain the 1.9GB and 4.1GB memory usage that you attributed to nothing and offloading? If it is through nvidia-smi, then note that nvidia-smi captures the total GPU memory cached by the PyTorch process and not the actively used memory. So nvidia-smi is not a good estimator for GPU memory use, especially for relatively small models like your example. Hope that helps.

tjruwase · 2021-10-07T19:30:02Z

One more point, in general ZeRO is not advisable for model or batch sizes that can fit without it. In such cases, ZeRO could reduce throughput and bloat memory usage.

murnanedaniel · 2021-10-07T19:35:51Z

Thanks for summarizing these values. Indeed I agree that this case of a small model and small graph (i.e. a single-edged graph) is not really what ZeRO is built for, but scaling it up to a larger model and larger graph I see similar behavior.

I am using the max_memory_allocated as a proxy for the maximum model size (or, conversely, graph size). So I'm seeing a Max_MA of around 1.9 and 4.1 for non-offloaded and offloaded. As the model size increases in width/depth, or the graph size increases (e.g. num_edges = 10000) , Max_MA increases for both until I hit an OOM for the offloaded model. The non-offloaded model can continue to scale larger. Shouldn't the offloading alleviate this problem?

tjruwase · 2021-10-07T19:41:15Z

Got it. Can you please share the memory usage logs for the OOM case? Please share the corresponding log for non-offloaded as well. Thanks.

tjruwase · 2021-11-12T17:02:39Z

@murnanedaniel, are you still interested in further debugging this issue? Thanks!

tjruwase · 2021-11-17T20:53:22Z

Closing for lack of activity. Please re-open as needed.

taehyunzzz · 2022-04-21T02:22:55Z

before nothing optimizer optimizer+params
forward 0.39 0.39 0.01
backward 0.39 0.40 0.40
optimizer 0.78 0.40 0.01
Above is a summary from the memory usage logs. This is showing the memory utilization (in GB) before first forward, backward, and optimizer calls when offloading nothing, optimizer, and optimizer+params. The MA metric captures the actual usage. As you can see offloading reduces GPU memory usage. You might also notice that memory usage for nothing increases for later batches (up to 1.56GB) whereas the usage remains stable for offloading.

How did you obtain the 1.9GB and 4.1GB memory usage that you attributed to nothing and offloading? If it is through nvidia-smi, then note that nvidia-smi captures the total GPU memory cached by the PyTorch process and not the actively used memory. So nvidia-smi is not a good estimator for GPU memory use, especially for relatively small models like your example. Hope that helps.

Does this mean it's a Pytorch issue, for not letting go of the inactive memory? Is there a way to keep Pytorch from reserving so much memory or to free some memory? Simply del -ing some intermediate variables didn't seem to help much :(

murnanedaniel added the bug Something isn't working label Oct 7, 2021

tjruwase self-assigned this Oct 8, 2021

tjruwase closed this as completed Nov 17, 2021

tjruwase mentioned this issue Mar 15, 2022

[REQUEST] ZeRO-Infinity: GPU Memory Usage Higher Than Expected #1831

Open

tjruwase mentioned this issue Jul 27, 2022

OPT-66B memory and GPU requirement #2131

Closed

tjruwase mentioned this issue Sep 9, 2022

DeepSpeed still gives CUDA-out-of-memory issue #2302

Open

Thomas-MMJ mentioned this issue Oct 12, 2022

Dreambooth doesn't train on 8GB huggingface/diffusers#807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Offload memory usage not performing as expected #1437

[BUG] Offload memory usage not performing as expected #1437

murnanedaniel commented Oct 7, 2021 •

edited

Loading

tjruwase commented Oct 7, 2021

murnanedaniel commented Oct 7, 2021

tjruwase commented Oct 7, 2021

tjruwase commented Oct 7, 2021

tjruwase commented Oct 7, 2021

murnanedaniel commented Oct 7, 2021

tjruwase commented Oct 7, 2021 •

edited

Loading

tjruwase commented Nov 12, 2021

tjruwase commented Nov 17, 2021

taehyunzzz commented Apr 21, 2022

[BUG] Offload memory usage not performing as expected #1437

[BUG] Offload memory usage not performing as expected #1437

Comments

murnanedaniel commented Oct 7, 2021 • edited Loading

tjruwase commented Oct 7, 2021

murnanedaniel commented Oct 7, 2021

tjruwase commented Oct 7, 2021

tjruwase commented Oct 7, 2021

tjruwase commented Oct 7, 2021

murnanedaniel commented Oct 7, 2021

tjruwase commented Oct 7, 2021 • edited Loading

tjruwase commented Nov 12, 2021

tjruwase commented Nov 17, 2021

taehyunzzz commented Apr 21, 2022

murnanedaniel commented Oct 7, 2021 •

edited

Loading

tjruwase commented Oct 7, 2021 •

edited

Loading