Clarification Needed: `TextImageProjection` Function #7725

oooolga · 2024-04-19T15:12:09Z

oooolga
Apr 19, 2024

Hello everyone,

I am currently exploring the Hugging Face Diffusers library and came across the TextImageProjection function (from diffusers.models.embeddings). I'm having a bit of difficulty understanding the specific methodology it employs, particularly how it projects text embeddings with image embeddings.

From what I gather, the function uses a concatenation method to combine text and image embeddings. I'm curious about the details of this process:

What are the expected input dimensions of this function?
What theory explains why combining text and image data into one representation (feature fusion) is their projection? Or do I have misunderstood the purpose of the function?

I would greatly appreciate if someone could provide a detailed explanation or point me toward any resources or documentation that might help clarify these aspects.

Thank you in advance for your help!

tolgacangoz · 2024-04-20T15:31:43Z

tolgacangoz
Apr 20, 2024

Hello @oooolga!
TextImageProjection is used for Kandinsky 2.1. They use not only text embedding but also visual embedding with condition on the text embedding, thus concatenation. They argue that this is one of the reasons for achieving better quality over Kandinsky 2.0. TextImageProjection expects text_embed_dim for the output of MultilingualCLIP taking prompt and image_embed_dim
for the output of Diffusion Mapping taking the output of MultilingualCLIP taking prompt. Diffusion Mapping in the image is KandinskyPriorPipeline in diffusers. See their blog post.

Is this answer satisfactory, or do you need further clarification?

1 reply

tolgacangoz Apr 20, 2024

Or, are you asking why don't we directly concatenate them?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification Needed: `TextImageProjection` Function #7725

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Clarification Needed: TextImageProjection Function #7725

oooolga Apr 19, 2024

Replies: 1 comment · 1 reply

tolgacangoz Apr 20, 2024

tolgacangoz Apr 20, 2024

Clarification Needed: `TextImageProjection` Function #7725

oooolga
Apr 19, 2024

Replies: 1 comment 1 reply

tolgacangoz
Apr 20, 2024