-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cogvideox-5B Model adapter change #9203
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
I left some comments, but code changes look good to me to merge if we are in a hurry (can do some follow-up refractory)
dim_w = embed_dim // 8 * 3 | ||
|
||
# Temporal frequencies | ||
freqs_t = 1.0 / (theta ** (torch.arange(0, dim_t, 2).float() / dim_t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's refactor with get_1d_rotary_pos_embeds (
diffusers/src/diffusers/models/embeddings.py
Line 441 in dc07fc2
def get_1d_rotary_pos_embed( |
src/diffusers/models/embeddings.py
Outdated
@@ -532,7 +617,10 @@ def apply_rotary_emb( | |||
else: | |||
raise ValueError(f"`use_real_unbind_dim={use_real_unbind_dim}` but should be -1 or -2.") | |||
|
|||
out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype) | |||
if upcast: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but I'm wondering why upcasting will cause not able to generate videos for cogvideox
maybe we can test out not downcast rotary embedding instead (all our other pipelines keep it in float32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly added this to numerically match the output of apply_rotary_emb with original implementation. It seems that even if we compute in float32, the results are great and the small diff does not impact quality, now that other bugs have been resolved. We can undo this change
@yiyixuxu @zRzRzRzRzRzRzR Would it be okay to remove the following limit in the follow-up PR? diffusers/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py Lines 527 to 530 in dc07fc2
I tested the 5B-model with generations of 57, 65, and 73 frames and they all turn out good - maybe the RoPE embeddings help the model generalize better. For the 2B-model, the outputs are bad for the above values probably due to limitations of normal PE. We could add a recommendation in docs mentioning that 49 frames and below is the good setting for 2B. In the refactor, I'd also like to create the normal positional embeddings in the pipeline instead of the transformer, similar to rope embeds, because it does not make sense to create them for the 5B model (currently they are created and saved to the module with a call to |
use_real=True, | ||
) | ||
|
||
freqs_cos = freqs_cos.to(device=device, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're going to undo the change about upcast in apply_rotary_embedding
I think it is better to not downcast freqs_cos
and freqs_sin
here too
i.e. change to freqs_cos = freqs_cos.to(device=device)
here
otherwise I think the rotary embedding is created in float32, downcasted to bfloat16 here, and then upcast to float32 when applying - this cannot be good even if it does not have a noticeable impact
can you test it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I overlooked this. Testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just noticed that apply_rotary_emb
moves the embeddings to the correct device, so we can just remove these 2 lines entirely
* draft of embedding --------- Co-authored-by: Aryan <[email protected]>
* draft of embedding --------- Co-authored-by: Aryan <[email protected]>
What does this PR do?
This is the draft of PR CogVideoX-5B (CogVideoX-Pro), including:
This is still a draft and need to do more adaptation for run
@a-r-r-o-w