Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no issue] Congratulation and a question how to break the 64 / 61 frames? #9

Open
ibrainventures opened this issue Aug 23, 2024 · 1 comment

Comments

@ibrainventures
Copy link

ibrainventures commented Aug 23, 2024

Hi,
thank you very much for your work and make this accesable to the world.
I tested the last 20 hours many generations (mostly realistic) and i am very impressed about
the results. The results from "people action" dedicated prompts are looking absolute realistic.

No morph-style, so its great to see - what to squeeze out of the (good-old) 1.5 SD on Enduser Hardware (okay my vast-ai-ed 4090s are not typical :- ) EU ) ..

I tried to understand your paper and saw the interpolation and 3D UNET related solutions / experiments.

Questions:

A) How would you estimate the possibility for a > 10 sec (+250frames) or longer generations?
B) if based on the 64/61 frames - >by offloading and unrecognizable merging / stitching pipelines ?
C) By wider-stepped interpolation? - under less quality but longer "action" ?

Would be great to get a feedback and Chapeau to the Team! Great work!!

@ibrainventures ibrainventures changed the title [no issue] Congratulation and how to break the 64 / 61 frames? [no issue] Congratulation and a question how to break the 64 / 61 frames? Aug 23, 2024
@MaAo
Copy link
Collaborator

MaAo commented Aug 24, 2024

Hi, thank you very much for your work and make this accesable to the world. I tested the last 20 hours many generations (mostly realistic) and i am very impressed about the results. The results from "people action" dedicated prompts are looking absolute realistic.

No morph-style, so its great to see - what to squeeze out of the (good-old) 1.5 SD on Enduser Hardware (okay my vast-ai-ed 4090s are not typical :- ) EU ) ..

I tried to understand your paper and saw the interpolation and 3D UNET related solutions / experiments.

Questions:

A) How would you estimate the possibility for a > 10 sec (+250frames) or longer generations? B) if based on the 64/61 frames - >by offloading and unrecognizable merging / stitching pipelines ? C) By wider-stepped interpolation? - under less quality but longer "action" ?

Would be great to get a feedback and Chapeau to the Team! Great work!!

Thank you for your attention and recognition of our work. Here are the answers to your questions:

A) To obtain more frames in a video, you have two options:

a) Using the current 61-frame video generation model, you can iteratively generate additional frames by using the end frame of the previous video as the reference for the next video.

b) We will release models in the future with more frames, such as 125 or more. However, keep in mind that these models will require more memory for inference.

B) I didn't fully understand your question B) . If my response does not completely address your issue, please clarify further, and I will do my best to assist you.

C) We are currently utilizing Video VAEs, which obviates the need for interpolation during model generation. For instance, with a latent space of (1,4,16,64,64), decoding the Video VAEs produces a video with dimensions of (1,3,61,512,512). The temporal dimension is computed as 4𝑛-3, where 𝑛 represents the number of frames in the latent space. Our research indicates that the current Video VAEs are constrained by the number of channels. Therefore, we plan to train 16-channel Video VAEs and integrate them into our project in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants