[core] Support VideoToVideo with CogVideoX #9333

a-r-r-o-w · 2024-08-31T17:01:22Z

What does this PR do?

Adds support for the simplest vid2vid pipeline with CogVideoX where noise is added to latents based on strength. Actively testing other noising ideas parallely to take into consideration that the latents are encoded/compressed temporally. Mostly inspired from posts on X that seem to do the same noising process too.

Usage:

import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)

input_video = load_video("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4")
prompt = (
    "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
    "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
    "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
    "moons, but the remainder of the scene is mostly realistic."
)

video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

Results:

Input video	`strength = 0.2`
A-solitary-hiker--cl.mp4	cogvideox_vid2vid_2.mp4
`strength = 0.4`	`strength = 0.6`
cogvideox_vid2vid_4.mp4	cogvideox_vid2vid_6.mp4
`strength = 0.8`	`strength = 0.9`
cogvideox_vid2vid_8.mp4	cogvideox_vid2vid_9.mp4

I'm okay with sitting on this for a few days/weeks to see if the community discovers something better, or this is more extensively tested by others :)

TODO:

Upload input video to HF docs repo
Tests
Implement fake context parallel cache and tiled encoding to reduce memory requirements in vae.encode (prefer to do in a separate PR since CogVideoX Lora also requires these changes)

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @yiyixuxu @asomoza

cc @zRzRzRzRzRzRzR

cc @tin2tin too since you seemed interested in this

HuggingFaceDocBuilderDev · 2024-08-31T17:06:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tin2tin · 2024-08-31T22:40:07Z

I'm so happy to see you taking a stab at this. Thank you!

Unfortunately, with the presented code, running it on a 4090, the first part of the process will use up more than 40 GB VRAM and it'll crash.

Adding this:

import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)

input_video = load_video("-824744897_Drone_flying_through_the_stunning_ice_c.mp4")[:49]  # TODO: update with HF docs URL
prompt = (
    "drone-shot, A deep hole in the desert, vast expanse of sand dunes stretching out to the horizon, intense sunlight beating down, hole's edges worn smooth by erosion, steep walls plunging into darkness, mysterious shadows lurking within, 8K resolution, Hasselblad H6D, 24mm lens, f/5.6 aperture, 1/60s shutter speed, ISO 200."
)

video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

Will have spikes above 35 GB VRAM, and when the status bar starts moving in the console, it is using around 6 GB VRAM.

So, the question is if the first part of the inference process can be optimized, so lower specs will be able to run it?

As for the results, they are super blurry, unless I have a strength of 1(basically replacing the input image). Could this be caused by ex. enable_sequential_cpu_offload?

Btw. this file fails to be used as input, even though it was generated by CogVideoX-5b:
https://github.com/user-attachments/assets/c01949ca-fb05-40a8-b966-d5d8d3f829d7

a-r-r-o-w · 2024-09-01T02:18:53Z

Unfortunately, with the presented code, running it on a 4090, the first part of the process will use up more than 40 GB VRAM and it'll crash.

Yes... As mentioned in the description, we'd have to implement the fake context parallel cache (as done in decode) and tiling in vae.encode. I'm working on it at the moment

As for the results, they are super blurry, unless I have a strength of 1(basically replacing the input image). Could this be caused by ex. enable_sequential_cpu_offload?

Btw. this file fails to be used as input, even though it was generated by CogVideoX-5b:
https://github.com/user-attachments/assets/c01949ca-fb05-40a8-b966-d5d8d3f829d7

Not sure what causes this, I'll take a look too once done with encode memory optimizations.

a-r-r-o-w · 2024-09-01T07:15:00Z

@tin2tin Pushed some memory optimizations. I don't think any changes would be required to your testing code as such. If you're using vae.enable_tiling(), it should kick in during the video encode step too. Would you be able to test this now?

I'm looking into other noising options that could help improve generation quality that the one used in this PR. Let me know if you're still getting blurry results

Previously:

memory=0.52 GB
max_memory=29.08 GB
max_reserved=30.88 G

Fake context parallel cache:

memory=0.51 GB
max_memory=12.45 GB
max_reserved=20.91 GB

Fake CP cache + Tiling:

memory=0.51 GB
max_memory=3.59 GB
max_reserved=5.41 GB

tin2tin · 2024-09-01T07:28:10Z

I have it running already, looking good VRAM wise (but much slower of course, still waiting, but will properly end at around 10 min and max 8 GB VRAM (in spikes)):

But this is with the full pack:

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Will try to disable some of that next time.

Would you recommend this instead?

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

Btw. num_frames seems to have been disabled it the latest patch?

a-r-r-o-w · 2024-09-01T07:34:40Z

Maybe I can help with that.

pipe.vae.enable_slicing doesn't do anything at the moment - will only be useful when generating more than 1 video at a time i.e. when using multiple prompts (we fix num_videos_per_prompt to be 1 so a single prompt should never trigger any slicing optimization).
pipe.enable_sequential_cpu_offload will save you loads of memory but cause generations to be much more slower. I would recommend using a int8 quantized model with pipe.enable_model_cpu_offload, which should also be runnable in under 16 GB while giving you much faster generations.

tin2tin · 2024-09-01T07:50:06Z

The inference still takes some time here, so while waiting:

num_frames can't be defined after the latest patch.

Only using pipe.enable_model_cpu_offload, 22 GB VRAM is needed.

int8 quantized model

Are there any for CogVideoX? And how do you load them?

(My troubles with blur, was caused by my video loading code, sorry. All good now!)

tin2tin · 2024-09-01T09:37:14Z

It could properly be tweaked for a better outcome, but it could work as proof of concept of this patch(thank you!):

Muybridge_vs_CogVideoX.mp4

tin2tin · 2024-09-01T20:41:14Z

Some more testing:

Part_2_Muybridge_vs_CogVideoX.mp4

a-r-r-o-w · 2024-09-02T04:39:48Z

num_frames can't be defined after the latest patch.

Yes, this parameter was removed to keep it similar to our other video-to-video pipelines. Since the pipeline takes an input video, it would be weird if the length of list of video frames did not match num_frames and simply raise an error. So, it is expected that the input video has the correct number of frames

Are there any for CogVideoX? And how do you load them?

Currently, we're working on hosting the quantized checkpoints. It's not necessary to have a quantized checkpoint though, since it can also be created on the fly. You could follow the torchao quantized inference guides here for more detailed examples. A concise example would be something like:

import torch
from diffusers import CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from torchao.quantization import (
    quantize_,
    int4_weight_only,
    int8_dynamic_activation_int4_weight,
    int8_weight_only,
    int8_dynamic_activation_int8_weight,
)

# Either "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"

# 1. Quantize models
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, int8_weight_only())

# 2. Create pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")

# 3. Inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    guidance_scale=6,
    use_dynamic_cfg=True,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=8)

You would need torchao installed from source and pytorch nightly for this to work until the next release. Full table of benchmarks here.

tin2tin · 2024-09-02T06:23:31Z

Thank you.

Previously, I tried to get torchao to work on Windows, but unsuccessfully, so please, do a test run to check it this is working on Windows too, and if, not leave a note on that in the docs.

a-r-r-o-w · 2024-09-02T07:37:12Z

If you can, could you please open an issue on the torchao repo with what you tried and the error stack trace when trying to install? I would do it myself, but unfortunately I don't own any device with Windows installed to test...

DN6

LGTM

a-r-r-o-w · 2024-09-02T11:25:55Z

Thank you very much @tin2tin here! Your feedback, testing and demos on social media were really useful and awesome ❤️

sayakpaul · 2024-09-02T13:33:32Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py

+                init_latents = [
+                    retrieve_latents(self.vae.encode(video[i].unsqueeze(0)), generator[i]) for i in range(batch_size)
+                ]
+            else:
+                init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video]
+


I guess tiled encoding could be enabled before making the pipeline call here right?

diffusers/src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

Line 1088 in 007ad0e

if self.use_tiling and (width > self.tile_sample_min_width or height > self.tile_sample_min_height):

Could be nicely highlighted in our docs.

* add vid2vid pipeline for cogvideox * make fix-copies * update docs * fake context parallel cache, vae encode tiling * add test for cog vid2vid * use video link from HF docs repo * add copied from comments; correctly rename test class

a-r-r-o-w added 3 commits August 31, 2024 18:14

add vid2vid pipeline for cogvideox

fc8989c

make fix-copies

cef5a2d

update docs

1b781ba

a-r-r-o-w added 2 commits September 1, 2024 09:07

fake context parallel cache, vae encode tiling

bf890bc

add test for cog vid2vid

3b5977d

a-r-r-o-w mentioned this pull request Sep 2, 2024

[core] CogVideoX memory optimizations in VAE encode #9340

Merged

Merge branch 'main' into cogvideox/vid2vid

0209f3f

a-r-r-o-w requested a review from DN6 September 2, 2024 10:21

DN6 approved these changes Sep 2, 2024

View reviewed changes

a-r-r-o-w added 2 commits September 2, 2024 13:06

use video link from HF docs repo

e1b85fd

add copied from comments; correctly rename test class

c03d907

a-r-r-o-w merged commit 0e6a840 into main Sep 2, 2024
18 checks passed

a-r-r-o-w deleted the cogvideox/vid2vid branch September 2, 2024 11:25

sayakpaul reviewed Sep 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support VideoToVideo with CogVideoX #9333

[core] Support VideoToVideo with CogVideoX #9333

a-r-r-o-w commented Aug 31, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 31, 2024

tin2tin commented Aug 31, 2024

a-r-r-o-w commented Sep 1, 2024

a-r-r-o-w commented Sep 1, 2024

tin2tin commented Sep 1, 2024 •

edited

Loading

a-r-r-o-w commented Sep 1, 2024

tin2tin commented Sep 1, 2024 •

edited

Loading

tin2tin commented Sep 1, 2024

tin2tin commented Sep 1, 2024 •

edited

Loading

a-r-r-o-w commented Sep 2, 2024

tin2tin commented Sep 2, 2024

a-r-r-o-w commented Sep 2, 2024 •

edited

Loading

DN6 left a comment

a-r-r-o-w commented Sep 2, 2024

sayakpaul Sep 2, 2024

[core] Support VideoToVideo with CogVideoX #9333

[core] Support VideoToVideo with CogVideoX #9333

Conversation

a-r-r-o-w commented Aug 31, 2024 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 31, 2024

tin2tin commented Aug 31, 2024

a-r-r-o-w commented Sep 1, 2024

a-r-r-o-w commented Sep 1, 2024

tin2tin commented Sep 1, 2024 • edited Loading

a-r-r-o-w commented Sep 1, 2024

tin2tin commented Sep 1, 2024 • edited Loading

tin2tin commented Sep 1, 2024

tin2tin commented Sep 1, 2024 • edited Loading

a-r-r-o-w commented Sep 2, 2024

tin2tin commented Sep 2, 2024

a-r-r-o-w commented Sep 2, 2024 • edited Loading

DN6 left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Sep 2, 2024

sayakpaul Sep 2, 2024

Choose a reason for hiding this comment

a-r-r-o-w commented Aug 31, 2024 •

edited

Loading

tin2tin commented Sep 1, 2024 •

edited

Loading

tin2tin commented Sep 1, 2024 •

edited

Loading

tin2tin commented Sep 1, 2024 •

edited

Loading

a-r-r-o-w commented Sep 2, 2024 •

edited

Loading