OOM during training #4

yejr0229 · 2024-07-05T04:27:25Z

I meet OOM problem in refining, Here is my detailed error:
Refining...: 0%| | 0/1000 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/home/yejr/AIGC/Director3D-main/inference.py", line 93, in
result = system_gm_ldm.inference(sparse_cameras, text, dense_cameras=cameras, use_3d_mode_every_m_steps=args.use_3d_mode_every_m_steps, refiner=refiner)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yejr/AIGC/Director3D-main/system_gm_ldm.py", line 112, in inference
gaussians = refiner.refine_gaussians(result['gaussians'], text, dense_cameras=dense_cameras)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yejr/AIGC/Director3D-main/modules/refiners/sds_pp_refiner.py", line 242, in refine_gaussians
loss_latent_sds, loss_img_sds, loss_embedding = self.train_step(images_pred.squeeze(0), t, text_embeddings, uncond_text_embeddings, learnable_text_embeddings)
File "/home/yejr/AIGC/Director3D-main/modules/refiners/sds_pp_refiner.py", line 175, in train_step
images_pred = self.decode_latent(latents_pred).clamp(-1, 1)
File "/home/yejr/AIGC/Director3D-main/modules/refiners/sds_pp_refiner.py", line 126, in decode_latent
images = self.vae.decode(latents).sample
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 314, in decode
decoded = self._decode(z).sample
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 285, in _decode
dec = self.decoder(z)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/models/autoencoders/vae.py", line 337, in forward
sample = up_block(sample, latent_embeds)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 2746, in forward
hidden_states = resnet(hidden_states, temb=temb)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/diffusers/models/resnet.py", line 327, in forward
hidden_states = self.norm1(hidden_states)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/media/data4/yejr/conda_env/director3d/lib/python3.9/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 23.69 GiB total capacity; 22.04 GiB already allocated; 168.88 MiB free; 22.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I use one RTX 3090 to run this command:
python inference.py --export_all --text "a delicious hamburger on a wooden table."
Could you please tell me how to solve this problem?

imlixinyang · 2024-07-05T05:08:08Z

We have tested some cases on a single RTX 3090. The memory cost is very close to the maximum so we gave a simple solution. You can try running the refining separately to avoid OOM. For example:

generate camera and 3DGS without refining:

python inference.py --export_all --text '{text}' --num_refine_steps 0 --num_samples 4

see the results in exps/tmp/videos and choose a sample (filename) for separate refining:

python refine.py --ply 'exps/tmp/ply/{filename}.ply'  --camera 'exps/tmp/camera/{filename}.npy' --export_all --text '{text}' --num_refine_steps 1000

This has been tested on a single T4 GPU (16 GB). Let me know if it works!

yejr0229 · 2024-07-05T06:26:59Z

Thanks for replying, it works just fine!

githubnameoo · 2024-09-10T06:46:09Z

We have tested some cases on a single RTX 3090. The memory cost is very close to the maximum so we gave a simple solution. You can try running the refining separately to avoid OOM. For example:

generate camera and 3DGS without refining:
python inference.py --export_all --text '{text}' --num_refine_steps 0 --num_samples 4
see the results in exps/tmp/videos and choose a sample (filename) for separate refining:
python refine.py --ply 'exps/tmp/ply/{filename}.ply'  --camera 'exps/tmp/camera/{filename}.npy' --export_all --text '{text}' --num_refine_steps 1000
This has been tested on a single T4 GPU (16 GB). Let me know if it works!

How much gpu memory is actually needed to test?

imlixinyang · 2024-09-10T06:57:04Z

Not a sure number. Since the number of Gaussian points varies for different scenes during refining, the GPU cost also varies.
Typically, 28GB is basically enough for running the refining jointly with generation, and 16GB is enough for running the refining separately.
@githubnameoo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM during training #4

OOM during training #4

yejr0229 commented Jul 5, 2024

imlixinyang commented Jul 5, 2024

yejr0229 commented Jul 5, 2024

githubnameoo commented Sep 10, 2024

imlixinyang commented Sep 10, 2024

OOM during training #4

OOM during training #4

Comments

yejr0229 commented Jul 5, 2024

imlixinyang commented Jul 5, 2024

yejr0229 commented Jul 5, 2024

githubnameoo commented Sep 10, 2024

imlixinyang commented Sep 10, 2024