Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pyramid_dit_for_video_gen_pipeline.py #100

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Quasimondo
Copy link
Contributor

Several optimizations that try to reduce memory allocations (so far only implemented for image-to-video)
Tested it locally on my RTX 3090 and it seemed to reduce memory leakage, so that subsequent runs were possible without the machine locking up.

Several optimizations that try to reduce memory allocations (so far only implemented for image-to-video)
@feifeiobama
Copy link
Collaborator

feifeiobama commented Oct 14, 2024

Thank you for your contribution. I noticed that there are several changes in the file. Could you help me identify which are the critical ones related to memory leakage? I will merge them into the main branch.

@Quasimondo
Copy link
Contributor Author

Oh yeah I realize that I should have made this in smaller steps .

There is one main improvement -which is the changes inside of generate_i2v() which pre-allocate the generated_latents tensor before the loop and thus avoiding to create a list which then needs to be concatenated.

In there I also delete a few objects after their use - not sure if it makes a difference since garbage collection should take care of them, but I don't think it makes it worse either.

The other smaller change is to sample_block_noise() which now generates that tensor directly on the GPU - unfortunately it has to do it in float since "cholesky_cusolver" not implemented for 'BFloat16')

There are several places where I replaced torch.cat([xy]*2) with repeat_interleave(2, dim=0) - not sure if that does much, but it also does not seem to hurt.

And there are one or two places where I changed a calculation to run in-place
-> latents.mul_(alpha).add_(noise, alpha=beta)

@dillfrescott
Copy link

No good sadly. Can't even make it past step 17 with this PR.

Traceback (most recent call last):
  File "text.py", line 23, in <module>
    frames = model.generate(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 737, in generate
    intermed_latents = self.generate_one_unit(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 288, in generate_one_unit
    noise_pred = self.dit(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_pyramid_mmdit.py", line 479, in forward
    encoder_hidden_states, hidden_states = block(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 640, in forward
    attn_output, context_attn_output = self.attn(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 548, in forward
    hidden_states, encoder_hidden_states = self.var_len_attn(
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 308, in __call__
    stage_hidden_states = F.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.68 GiB. GPU 0 has a total capacty of 23.99 GiB of which 12.46 GiB is free. Of the allocated memory 6.10 GiB is allocated by PyTorch, and 3.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Its using more and more memory every step till it uses all 24 GB and I run out.

@Quasimondo
Copy link
Contributor Author

Well, can you run it without the patch and it works on your machine?

@dillfrescott
Copy link

Yes it works with or without the patch but in both cases it eventually runs out of memory and crashes.

Implemented the pre-allocation of generated_latents also in the generate() method
@Quasimondo
Copy link
Contributor Author

Okay, it sounded like it does not work at all with the patch. Well, unfortunately this fix cannot work wonders. On my 24G I can do 31 frames at 384p, but I cannot do 768p at all (with or without patch)

@dillfrescott
Copy link

Oh. Gotcha!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants