Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support VideoToVideo with CogVideoX #9333

Merged
merged 8 commits into from
Sep 2, 2024
Merged

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Aug 31, 2024

What does this PR do?

Adds support for the simplest vid2vid pipeline with CogVideoX where noise is added to latents based on strength. Actively testing other noising ideas parallely to take into consideration that the latents are encoded/compressed temporally. Mostly inspired from posts on X that seem to do the same noising process too.

Usage:

import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)

input_video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4")
prompt = (
    "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
    "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
    "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
    "moons, but the remainder of the scene is mostly realistic."
)

video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

Results:

Input video `strength = 0.2`
A-solitary-hiker--cl.mp4
cogvideox_vid2vid_2.mp4
`strength = 0.4` `strength = 0.6`
cogvideox_vid2vid_4.mp4
cogvideox_vid2vid_6.mp4
`strength = 0.8` `strength = 0.9`
cogvideox_vid2vid_8.mp4
cogvideox_vid2vid_9.mp4

I'm okay with sitting on this for a few days/weeks to see if the community discovers something better, or this is more extensively tested by others :)

TODO:

  • Upload input video to HF docs repo
  • Tests
  • Implement fake context parallel cache and tiled encoding to reduce memory requirements in vae.encode (prefer to do in a separate PR since CogVideoX Lora also requires these changes)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @yiyixuxu @asomoza

cc @zRzRzRzRzRzRzR

cc @tin2tin too since you seemed interested in this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tin2tin
Copy link

tin2tin commented Aug 31, 2024

I'm so happy to see you taking a stab at this. Thank you!

Unfortunately, with the presented code, running it on a 4090, the first part of the process will use up more than 40 GB VRAM and it'll crash.

Adding this:

import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)

input_video = load_video("-824744897_Drone_flying_through_the_stunning_ice_c.mp4")[:49]  # TODO: update with HF docs URL
prompt = (
    "drone-shot, A deep hole in the desert, vast expanse of sand dunes stretching out to the horizon, intense sunlight beating down, hole's edges worn smooth by erosion, steep walls plunging into darkness, mysterious shadows lurking within, 8K resolution, Hasselblad H6D, 24mm lens, f/5.6 aperture, 1/60s shutter speed, ISO 200."
)

video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

Will have spikes above 35 GB VRAM, and when the status bar starts moving in the console, it is using around 6 GB VRAM.
image

So, the question is if the first part of the inference process can be optimized, so lower specs will be able to run it?

As for the results, they are super blurry, unless I have a strength of 1(basically replacing the input image). Could this be caused by ex. enable_sequential_cpu_offload?

Btw. this file fails to be used as input, even though it was generated by CogVideoX-5b:
https://github.com/user-attachments/assets/c01949ca-fb05-40a8-b966-d5d8d3f829d7

@a-r-r-o-w
Copy link
Member Author

Unfortunately, with the presented code, running it on a 4090, the first part of the process will use up more than 40 GB VRAM and it'll crash.

Yes... As mentioned in the description, we'd have to implement the fake context parallel cache (as done in decode) and tiling in vae.encode. I'm working on it at the moment

As for the results, they are super blurry, unless I have a strength of 1(basically replacing the input image). Could this be caused by ex. enable_sequential_cpu_offload?

Btw. this file fails to be used as input, even though it was generated by CogVideoX-5b:
https://github.com/user-attachments/assets/c01949ca-fb05-40a8-b966-d5d8d3f829d7

Not sure what causes this, I'll take a look too once done with encode memory optimizations.

@a-r-r-o-w
Copy link
Member Author

@tin2tin Pushed some memory optimizations. I don't think any changes would be required to your testing code as such. If you're using vae.enable_tiling(), it should kick in during the video encode step too. Would you be able to test this now?

I'm looking into other noising options that could help improve generation quality that the one used in this PR. Let me know if you're still getting blurry results

Previously:

memory=0.52 GB
max_memory=29.08 GB
max_reserved=30.88 G

Fake context parallel cache:

memory=0.51 GB
max_memory=12.45 GB
max_reserved=20.91 GB

Fake CP cache + Tiling:

memory=0.51 GB
max_memory=3.59 GB
max_reserved=5.41 GB

@tin2tin
Copy link

tin2tin commented Sep 1, 2024

I have it running already, looking good VRAM wise (but much slower of course, still waiting, but will properly end at around 10 min and max 8 GB VRAM (in spikes)):
image

But this is with the full pack:

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Will try to disable some of that next time.

Would you recommend this instead?

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

Btw. num_frames seems to have been disabled it the latest patch?

@a-r-r-o-w
Copy link
Member Author

Maybe I can help with that.

  • pipe.vae.enable_slicing doesn't do anything at the moment - will only be useful when generating more than 1 video at a time i.e. when using multiple prompts (we fix num_videos_per_prompt to be 1 so a single prompt should never trigger any slicing optimization).
  • pipe.enable_sequential_cpu_offload will save you loads of memory but cause generations to be much more slower. I would recommend using a int8 quantized model with pipe.enable_model_cpu_offload, which should also be runnable in under 16 GB while giving you much faster generations.

@tin2tin
Copy link

tin2tin commented Sep 1, 2024

The inference still takes some time here, so while waiting:

  • num_frames can't be defined after the latest patch.

Only using pipe.enable_model_cpu_offload, 22 GB VRAM is needed.

int8 quantized model

Are there any for CogVideoX? And how do you load them?

(My troubles with blur, was caused by my video loading code, sorry. All good now!)

@tin2tin
Copy link

tin2tin commented Sep 1, 2024

It could properly be tweaked for a better outcome, but it could work as proof of concept of this patch(thank you!):

Muybridge_vs_CogVideoX.mp4

@tin2tin
Copy link

tin2tin commented Sep 1, 2024

Some more testing:

Part_2_Muybridge_vs_CogVideoX.mp4

@a-r-r-o-w
Copy link
Member Author

num_frames can't be defined after the latest patch.

Yes, this parameter was removed to keep it similar to our other video-to-video pipelines. Since the pipeline takes an input video, it would be weird if the length of list of video frames did not match num_frames and simply raise an error. So, it is expected that the input video has the correct number of frames

Are there any for CogVideoX? And how do you load them?

Currently, we're working on hosting the quantized checkpoints. It's not necessary to have a quantized checkpoint though, since it can also be created on the fly. You could follow the torchao quantized inference guides here for more detailed examples. A concise example would be something like:

import torch
from diffusers import CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from torchao.quantization import (
    quantize_,
    int4_weight_only,
    int8_dynamic_activation_int4_weight,
    int8_weight_only,
    int8_dynamic_activation_int8_weight,
)

# Either "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"

# 1. Quantize models
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, int8_weight_only())

# 2. Create pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")

# 3. Inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    guidance_scale=6,
    use_dynamic_cfg=True,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=8)

You would need torchao installed from source and pytorch nightly for this to work until the next release. Full table of benchmarks here.

@tin2tin
Copy link

tin2tin commented Sep 2, 2024

Thank you.

Previously, I tried to get torchao to work on Windows, but unsuccessfully, so please, do a test run to check it this is working on Windows too, and if, not leave a note on that in the docs.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Sep 2, 2024

If you can, could you please open an issue on the torchao repo with what you tried and the error stack trace when trying to install? I would do it myself, but unfortunately I don't own any device with Windows installed to test...

@a-r-r-o-w a-r-r-o-w requested a review from DN6 September 2, 2024 10:21
Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@a-r-r-o-w a-r-r-o-w merged commit 0e6a840 into main Sep 2, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the cogvideox/vid2vid branch September 2, 2024 11:25
@a-r-r-o-w
Copy link
Member Author

Thank you very much @tin2tin here! Your feedback, testing and demos on social media were really useful and awesome ❤️

Comment on lines +380 to +385
init_latents = [
retrieve_latents(self.vae.encode(video[i].unsqueeze(0)), generator[i]) for i in range(batch_size)
]
else:
init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess tiled encoding could be enabled before making the pipeline call here right?

if self.use_tiling and (width > self.tile_sample_min_width or height > self.tile_sample_min_height):

Could be nicely highlighted in our docs.

a-r-r-o-w added a commit that referenced this pull request Sep 17, 2024
* add vid2vid pipeline for cogvideox

* make fix-copies

* update docs

* fake context parallel cache, vae encode tiling

* add test for cog vid2vid

* use video link from HF docs repo

* add copied from comments; correctly rename test class
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
* add vid2vid pipeline for cogvideox

* make fix-copies

* update docs

* fake context parallel cache, vae encode tiling

* add test for cog vid2vid

* use video link from HF docs repo

* add copied from comments; correctly rename test class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants