Use CLIP as text encoder #8

Espere-1119-Song · 2024-03-27T05:15:13Z

Thanks for your great contribution to the Community.

I found that the experiment that uses CLIP as text encoder has been conducted in the paper, but I didn't find the corresponding code. Will you release the CLIP version code? I wonder how to deal with the linear layer of the attention layer in CLIP text encoder. Because it seems that the linear layer of the attention layer in CLIP is NonDynamicallyQuantizableLinear, not normal nn.Linear.

The text was updated successfully, but these errors were encountered:

ShihaoZhaoZSH · 2024-03-27T08:03:16Z

Thank you for your interest in our LaVi-Bridge! We will schedule the release of the code related to CLIP text encoder. In the meantime, you can refer to the test/t5_unet.py. The main difference is to switch the text encoder from transformers.T5EncoderModel and AutoTokenizer to transformers.CLIPTextModel and CLIPTokenizer. The pre-trained model is the "CompVis/stable-diffusion-v1-4" repository on Hugging Face. Additionally, you can refer to the standard Stable Diffusion 1.4 pipeline, which also utilizes CLIP as the language model.

Espere-1119-Song · 2024-03-27T09:20:48Z

Thanks a lot for your help! I will follow the instruction you provide, and really look forward to the release of CLIP Text encoder version :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CLIP as text encoder #8

Use CLIP as text encoder #8

Espere-1119-Song commented Mar 27, 2024

ShihaoZhaoZSH commented Mar 27, 2024

Espere-1119-Song commented Mar 27, 2024

Use CLIP as text encoder #8

Use CLIP as text encoder #8

Comments

Espere-1119-Song commented Mar 27, 2024

ShihaoZhaoZSH commented Mar 27, 2024

Espere-1119-Song commented Mar 27, 2024