Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cogvideox-5B Model adapter change #9203

Merged
merged 24 commits into from
Aug 23, 2024

Conversation

zRzRzRzRzRzRzR
Copy link
Contributor

What does this PR do?

This is the draft of PR CogVideoX-5B (CogVideoX-Pro), including:

  1. using get_3d_rotary_pos_embed
  2. convert_transformer need to remove some parm
  3. using with CogVideoXAttnProcessor2_0 with image_rotary_emb

This is still a draft and need to do more adaptation for run

@a-r-r-o-w

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu August 22, 2024 17:07
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
I left some comments, but code changes look good to me to merge if we are in a hurry (can do some follow-up refractory)

dim_w = embed_dim // 8 * 3

# Temporal frequencies
freqs_t = 1.0 / (theta ** (torch.arange(0, dim_t, 2).float() / dim_t))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's refactor with get_1d_rotary_pos_embeds (

def get_1d_rotary_pos_embed(
) (in a follow-up PR )

@@ -532,7 +617,10 @@ def apply_rotary_emb(
else:
raise ValueError(f"`use_real_unbind_dim={use_real_unbind_dim}` but should be -1 or -2.")

out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
if upcast:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but I'm wondering why upcasting will cause not able to generate videos for cogvideox
maybe we can test out not downcast rotary embedding instead (all our other pipelines keep it in float32)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly added this to numerically match the output of apply_rotary_emb with original implementation. It seems that even if we compute in float32, the results are great and the small diff does not impact quality, now that other bugs have been resolved. We can undo this change

@a-r-r-o-w
Copy link
Member

@yiyixuxu @zRzRzRzRzRzRzR Would it be okay to remove the following limit in the follow-up PR?

if num_frames > 49:
raise ValueError(
"The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation."
)

I tested the 5B-model with generations of 57, 65, and 73 frames and they all turn out good - maybe the RoPE embeddings help the model generalize better. For the 2B-model, the outputs are bad for the above values probably due to limitations of normal PE. We could add a recommendation in docs mentioning that 49 frames and below is the good setting for 2B.

In the refactor, I'd also like to create the normal positional embeddings in the pipeline instead of the transformer, similar to rope embeds, because it does not make sense to create them for the 5B model (currently they are created and saved to the module with a call to register_buffer regardless of whether 2B or 5B).

use_real=True,
)

freqs_cos = freqs_cos.to(device=device, dtype=dtype)
Copy link
Collaborator

@yiyixuxu yiyixuxu Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-r-r-o-w

if we're going to undo the change about upcast in apply_rotary_embedding I think it is better to not downcast freqs_cos and freqs_sin here too
i.e. change to freqs_cos = freqs_cos.to(device=device) here

otherwise I think the rotary embedding is created in float32, downcasted to bfloat16 here, and then upcast to float32 when applying - this cannot be good even if it does not have a noticeable impact

can you test it out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I overlooked this. Testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just noticed that apply_rotary_emb moves the embeddings to the correct device, so we can just remove these 2 lines entirely

@yiyixuxu yiyixuxu merged commit 960c149 into huggingface:main Aug 23, 2024
14 of 15 checks passed
yiyixuxu pushed a commit that referenced this pull request Aug 24, 2024
* draft of embedding

---------

Co-authored-by: Aryan <[email protected]>
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
* draft of embedding

---------

Co-authored-by: Aryan <[email protected]>
@zRzRzRzRzRzRzR zRzRzRzRzRzRzR deleted the cogvideox-5b branch January 14, 2025 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants