How does attention/emphasis in prompts actually work? #2905

rmyj · 2022-10-16T21:58:01Z

rmyj
Oct 16, 2022

I would like to understand how the extra (, ). [. ] in a prompt actually interacts with the model to achieve extra attention/emphasis. I had thought that the model only took a "simple" string prompt as the input. If I was to try and implement the attention/emphasis, would it be as simple as converting the "simple" string prompt to an array of strings with weights somehow? Must one go deeper into the model to achieve this?

Any information on this would be greatly appreciated.

zwishenzug · 2022-10-16T22:32:18Z

zwishenzug
Oct 16, 2022

I'll try and describe the basics assuming the default CompVis version as it's a bit simpler than a repo like this (which implements longer prompts and other stuff).

The string prompt first gets converted into tokens, each token can represent some amount of characters. Many tokens represent a full word, there are around 49k possible tokens, if a word/part of string can't be made from one token, it's combined with others.

77 of these tokens that can be fed into the model at any one time, that's what sets the limit on the length of prompts.

The tokens start off as just an integer number, like 12871 represents "emails ", 47635 represents "cope"

They are then converted into a different representation, each token id is converted into a vector of 768 numbers, which are the token embedding.

So now you have an array of 77 * 768 numbers.

Next a positional embedding is added to that array. Each position in the vector has it's own embedding value.

Then the result of that is fed through a transformer.

Now at this point you still have an array of (77, 768), and this is the 'cond' that gets returned from the function model.get_learned_conditioning()

This is now ready to be fed into the model, but it's at this point that the emphasis is performed.

if you had put brackets around the word cope in your prompt, it would look through the vector of token id's that were created to find the position 47635 is at. This position in the cond is then multiplied by 1.1 ** num_brackets, so if it was ((cope)), that part of the array gets multiplied by 1.21. For [] it's a divide.

So, the basic idea is that the part of the embedding that represents your emphasised word gets multiplied to make it larger.

Sorry if I didn't explain it too well, but it's not something I can explain in extremely simple terms. Hope this helps.

If you want to see some actual code for implementing this feature to the basic CompVis version, check my commit here:

zwishenzug/stable-diffusion@3e159a1

2 replies

rmyj Oct 16, 2022
Author

Absolutely brilliant @zwishenzug, exactly the technical answer I was looking for. Thank you for taking the time to write this.

Drakmour Dec 13, 2022

Hello. Help me please understand more "advanced" emphasis. I have encountered a problem with multiple empowered words. Whatever I do, I get a situation when some elements, even with empowering are disappearing in generated image. For example when I try to make 3 elements of wearings, I can't make more than 1 cloth element at a time. If i put black jacket, white shirt and black bra, it picks only something single. If I empower one of em like (xxx:1.2) it chooses only this element and ignores other. If i put same empower for all, it choses randomly like there is no empower. If i put 1.2 for one, 1.3 for second, 1.4 for third, it shows the strongest and ignores the weakest. And more empowered elements I have, more elements become unseen for the generation. Like there is a "total level of emphasis" that allows only certain numbers of the emphasis in one prompt. And If you exceed the limit it ignores the weakest words.

rmyj · 2022-10-16T22:47:45Z

rmyj
Oct 16, 2022
Author

For the implementation being used in AUTOMATIC1111, I have found that the process_text function in modules/sd_hijack.py does a similar process to as what @zwishenzug described with help from the parse_prompt_attention function in modules/prompt_parser.py.

0 replies

CodeExplode · 2022-10-17T00:29:21Z

CodeExplode
Oct 17, 2022

For the super simple explanation - each word has an associated vector with 768 values, and that vector 'points' in the direction of the concept (in a 768-dimensional space).

If you scale that vector, the concept gets stronger or weaker.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does attention/emphasis in prompts actually work? #2905

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How does attention/emphasis in prompts actually work? #2905

rmyj Oct 16, 2022

Replies: 3 comments · 2 replies

zwishenzug Oct 16, 2022

rmyj Oct 16, 2022 Author

Drakmour Dec 13, 2022

rmyj Oct 16, 2022 Author

CodeExplode Oct 17, 2022

rmyj
Oct 16, 2022

Replies: 3 comments 2 replies

zwishenzug
Oct 16, 2022

rmyj Oct 16, 2022
Author

rmyj
Oct 16, 2022
Author

CodeExplode
Oct 17, 2022