-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Support VideoToVideo with CogVideoX #9333
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I'm so happy to see you taking a stab at this. Thank you! Unfortunately, with the presented code, running it on a 4090, the first part of the process will use up more than 40 GB VRAM and it'll crash. Adding this:
Will have spikes above 35 GB VRAM, and when the status bar starts moving in the console, it is using around 6 GB VRAM. So, the question is if the first part of the inference process can be optimized, so lower specs will be able to run it? As for the results, they are super blurry, unless I have a strength of 1(basically replacing the input image). Could this be caused by ex. enable_sequential_cpu_offload? Btw. this file fails to be used as input, even though it was generated by CogVideoX-5b: |
Yes... As mentioned in the description, we'd have to implement the fake context parallel cache (as done in decode) and tiling in vae.encode. I'm working on it at the moment
Not sure what causes this, I'll take a look too once done with encode memory optimizations. |
@tin2tin Pushed some memory optimizations. I don't think any changes would be required to your testing code as such. If you're using I'm looking into other noising options that could help improve generation quality that the one used in this PR. Let me know if you're still getting blurry results
|
Maybe I can help with that.
|
The inference still takes some time here, so while waiting:
Only using pipe.enable_model_cpu_offload, 22 GB VRAM is needed.
Are there any for CogVideoX? And how do you load them? (My troubles with blur, was caused by my video loading code, sorry. All good now!) |
It could properly be tweaked for a better outcome, but it could work as proof of concept of this patch(thank you!): Muybridge_vs_CogVideoX.mp4 |
Some more testing: Part_2_Muybridge_vs_CogVideoX.mp4 |
Yes, this parameter was removed to keep it similar to our other video-to-video pipelines. Since the pipeline takes an input video, it would be weird if the length of list of video frames did not match
Currently, we're working on hosting the quantized checkpoints. It's not necessary to have a quantized checkpoint though, since it can also be created on the fly. You could follow the torchao quantized inference guides here for more detailed examples. A concise example would be something like: import torch
from diffusers import CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from torchao.quantization import (
quantize_,
int4_weight_only,
int8_dynamic_activation_int4_weight,
int8_weight_only,
int8_dynamic_activation_int8_weight,
)
# Either "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"
# 1. Quantize models
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, int8_weight_only())
# 2. Create pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16).to("cuda")
# 3. Inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
prompt=prompt,
guidance_scale=6,
use_dynamic_cfg=True,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=8) You would need torchao installed from source and pytorch nightly for this to work until the next release. Full table of benchmarks here. |
Thank you. Previously, I tried to get torchao to work on Windows, but unsuccessfully, so please, do a test run to check it this is working on Windows too, and if, not leave a note on that in the docs. |
If you can, could you please open an issue on the torchao repo with what you tried and the error stack trace when trying to install? I would do it myself, but unfortunately I don't own any device with Windows installed to test... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you very much @tin2tin here! Your feedback, testing and demos on social media were really useful and awesome ❤️ |
init_latents = [ | ||
retrieve_latents(self.vae.encode(video[i].unsqueeze(0)), generator[i]) for i in range(batch_size) | ||
] | ||
else: | ||
init_latents = [retrieve_latents(self.vae.encode(vid.unsqueeze(0)), generator) for vid in video] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess tiled encoding could be enabled before making the pipeline call here right?
if self.use_tiling and (width > self.tile_sample_min_width or height > self.tile_sample_min_height): |
Could be nicely highlighted in our docs.
* add vid2vid pipeline for cogvideox * make fix-copies * update docs * fake context parallel cache, vae encode tiling * add test for cog vid2vid * use video link from HF docs repo * add copied from comments; correctly rename test class
* add vid2vid pipeline for cogvideox * make fix-copies * update docs * fake context parallel cache, vae encode tiling * add test for cog vid2vid * use video link from HF docs repo * add copied from comments; correctly rename test class
What does this PR do?
Adds support for the simplest vid2vid pipeline with CogVideoX where noise is added to latents based on strength. Actively testing other noising ideas parallely to take into consideration that the latents are encoded/compressed temporally. Mostly inspired from posts on X that seem to do the same noising process too.
Usage:
Results:
A-solitary-hiker--cl.mp4
cogvideox_vid2vid_2.mp4
cogvideox_vid2vid_4.mp4
cogvideox_vid2vid_6.mp4
cogvideox_vid2vid_8.mp4
cogvideox_vid2vid_9.mp4
I'm okay with sitting on this for a few days/weeks to see if the community discovers something better, or this is more extensively tested by others :)
TODO:
vae.encode
(prefer to do in a separate PR since CogVideoX Lora also requires these changes)Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@DN6 @yiyixuxu @asomoza
cc @zRzRzRzRzRzRzR
cc @tin2tin too since you seemed interested in this