Replies: 3 comments 2 replies
-
I'll try and describe the basics assuming the default CompVis version as it's a bit simpler than a repo like this (which implements longer prompts and other stuff). The string prompt first gets converted into tokens, each token can represent some amount of characters. Many tokens represent a full word, there are around 49k possible tokens, if a word/part of string can't be made from one token, it's combined with others. 77 of these tokens that can be fed into the model at any one time, that's what sets the limit on the length of prompts. The tokens start off as just an integer number, like 12871 represents "emails ", 47635 represents "cope" They are then converted into a different representation, each token id is converted into a vector of 768 numbers, which are the token embedding. So now you have an array of 77 * 768 numbers. Next a positional embedding is added to that array. Each position in the vector has it's own embedding value. Then the result of that is fed through a transformer. Now at this point you still have an array of (77, 768), and this is the 'cond' that gets returned from the function model.get_learned_conditioning() This is now ready to be fed into the model, but it's at this point that the emphasis is performed. if you had put brackets around the word cope in your prompt, it would look through the vector of token id's that were created to find the position 47635 is at. This position in the cond is then multiplied by 1.1 ** num_brackets, so if it was ((cope)), that part of the array gets multiplied by 1.21. For [] it's a divide. So, the basic idea is that the part of the embedding that represents your emphasised word gets multiplied to make it larger. Sorry if I didn't explain it too well, but it's not something I can explain in extremely simple terms. Hope this helps. If you want to see some actual code for implementing this feature to the basic CompVis version, check my commit here: |
Beta Was this translation helpful? Give feedback.
-
For the implementation being used in AUTOMATIC1111, I have found that the process_text function in modules/sd_hijack.py does a similar process to as what @zwishenzug described with help from the parse_prompt_attention function in modules/prompt_parser.py. |
Beta Was this translation helpful? Give feedback.
-
For the super simple explanation - each word has an associated vector with 768 values, and that vector 'points' in the direction of the concept (in a 768-dimensional space). If you scale that vector, the concept gets stronger or weaker. |
Beta Was this translation helpful? Give feedback.
-
I would like to understand how the extra
(
,)
.[
.]
in a prompt actually interacts with the model to achieve extra attention/emphasis. I had thought that the model only took a "simple" string prompt as the input. If I was to try and implement the attention/emphasis, would it be as simple as converting the "simple" string prompt to an array of strings with weights somehow? Must one go deeper into the model to achieve this?Any information on this would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions