Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CLIP as text encoder #8

Open
Espere-1119-Song opened this issue Mar 27, 2024 · 2 comments
Open

Use CLIP as text encoder #8

Espere-1119-Song opened this issue Mar 27, 2024 · 2 comments

Comments

@Espere-1119-Song
Copy link

Thanks for your great contribution to the Community.

I found that the experiment that uses CLIP as text encoder has been conducted in the paper, but I didn't find the corresponding code. Will you release the CLIP version code? I wonder how to deal with the linear layer of the attention layer in CLIP text encoder. Because it seems that the linear layer of the attention layer in CLIP is NonDynamicallyQuantizableLinear, not normal nn.Linear.

@ShihaoZhaoZSH
Copy link
Owner

Thank you for your interest in our LaVi-Bridge! We will schedule the release of the code related to CLIP text encoder. In the meantime, you can refer to the test/t5_unet.py. The main difference is to switch the text encoder from transformers.T5EncoderModel and AutoTokenizer to transformers.CLIPTextModel and CLIPTokenizer. The pre-trained model is the "CompVis/stable-diffusion-v1-4" repository on Hugging Face. Additionally, you can refer to the standard Stable Diffusion 1.4 pipeline, which also utilizes CLIP as the language model.

@Espere-1119-Song
Copy link
Author

Thanks a lot for your help! I will follow the instruction you provide, and really look forward to the release of CLIP Text encoder version :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants