Learn TI Embedding and LoRA both at the same time #635

aleksusklim · 2023-07-10T22:11:13Z

Is it possible to train a LoRA together with an Embedding? Here are some thoughts that came to this, when training a LoRA for an object:

Training the entire CLIP is wrong. It is best left frozen.
Without learnable CLIP, we cannot change the meaning of words.
With or without learned CLIP, given a prompt "a photo of sks in the forest" – why would LoRA learn sks but not learn photo and forest along?
Generally, I do not want to learn anything except my token.
You could say "just use TI then!", but Embeddings are weak at learning complex concepts.
You could say "use regularization then!", but in this case there is no "class word" (and I don't want to introduce it); making regularization against "forest" and anything I might have in descriptions – feels wrong.
If it would be possible to use a learnable embedding in place of chosen token ("sks", possibly initialized with class word), then it would be more correct, because the object would be clearly stored inside this embedding and not in any other word.
LoRA general training should help the embedding to reach its target quicker. It's a compromise between training the entire CLIP or not at all.
Learning rate for the embedding should be set differently than learning rate for U-Net (or for CLIP if needed), because the best speed is yet to be discovered.

What do you think? Otherwise, I'm not quite sure how to train LoRA on something that is not a character nor a style. For example, to train a LoRA for "scar" concept: what descriptions should we choose?
Should we say "sks over eye, 1boy, …"? If so, isn't it more logical to say directly "scar over eye, 1boy, …"? But if so, how can we be sure that only the concept of "scar" would be changed, and not the concept of "1boy"?

The text was updated successfully, but these errors were encountered:

aleksusklim · 2023-07-11T18:14:15Z

Related: https://huggingface.co/blog/dreambooth#epilogue-textual-inversion--dreambooth (last chapter)

AI-Casanova · 2023-09-03T02:41:28Z

At very least it would be nice to add TI loading in train_network.py such that a TI could be trained first and then a UNet LoRA trained afterwards.

spillerrec · 2023-11-14T22:52:40Z

HCP Diffusion supports this, but I have not yet been able to actually get it to work. I have seen other using it however.

I have been thinking about this approach a lot as well, because I don't think the current method is that good. If you just train the text encoder, you can get decent results. If you train both the text encoder and unet, the results are better, but if you try to disable the unet part of it, the results are really poor. This indicates that the text encoder is not fully taken advantage of.

I have two big motivations for looking for a better approach. First of all I think better exploiting existing capabilities of the base model will lead to better flexibility of the resulting Lora (you can end up with certain prompts, like a specific pose, that works fine without the Lora become unreliable or completely break with the Lore). However what I really would like to see is better composability with other Lora's and base models. With normal Lora training, the entire text encoder is affected instead of just the trigger tag we are trying to add.

When I tried to test how much other tags in the text encoder were affected, I saw numbers around 20-40% compared to the main trigger tag. I haven't messed with drop-out or anything like that, but for completely unrelated tokens to be so affected was quite surprising to me.
I also question the usability of actually trying to train the text encoder. Does it actually learn something about the interaction between the tokens? I didn't really see any indication of that in my testing. For some things like poses, the context could affect the learned tag, for example from_side, pov, from_above will affect the pose. But for many things I think a static TI is probably a good fit.
I also question how well the text encoder works in general. I mostly use anime models which might be worse with tag relations (?), but I have run into several examples of tags which interfere with each other and does not work as intended. For example hug from behind ends up being from behind + hug and shows the characters back. It fails to understand that this combination is a specific concept that is not just an addition of the underlying words.

In order to actually have "trigger words", I do think training both the TI and UNet together will be necessary, in order to create a link between the tag and the UNet doing something different. But pretraining the TI could potentially be useful. But it would be a nice first step. I train using anime screenshots as a base, and I wonder if you could potentially train a base style TI to reduce the influence of the common style of the training images.

aleksusklim · 2023-11-19T14:56:58Z

This paper – https://omriavrahami.com/the-chosen-one/ – features training two text inversion embeddings for SDXL along with LoRA simulateniously:

We base our solution on a pre-trained Stable Diffusion XL (SDXL) [57] model, which utilizes two text encoders: CLIP [61] and OpenCLIP [34]. We perform textual inversion [20] to add a new pair of textual tokens τ , one for each of the two text encoders. However, we found that this parameter space is not expressive enough, as demonstrated in Section 4.3, hence we also update the model weights θ via a low-rank adaptation (LoRA) [33, 71] of the self- and crossattention layers of the model.

spillerrec · 2023-11-20T00:01:02Z

I'm reading up on how these models work and I still only have a very superficial understanding, but I noticed this section in the original Lora paper:

E COMBINING LORA WITH PREFIX TUNING
LoRA can be naturally combined with existing prefix-based approaches. In this section, we evaluate
two combinations of LoRA and variants of prefix-tuning on WikiSQL and MNLI.
LoRA+PrefixEmbed (LoRA+PE) combines LoRA with prefix-embedding tuning, where we insert
lp + li special tokens whose embeddings are treated as trainable parameters. For more on prefixembedding tuning, see Section 5.1.
LoRA+PrefixLayer (LoRA+PL) combines LoRA with prefix-layer tuning.
...
In Table 15, we show the evaluation results of LoRA+PE and LoRA+PL on WikiSQL and MultiNLI.
First of all, LoRA+PE significantly outperforms both LoRA and prefix-embedding tuning on
WikiSQL, which indicates that LoRA is somewhat orthogonal to prefix-embedding tuning. On
MultiNLI, the combination of LoRA+PE doesn’t perform better than LoRA, possibly because LoRA
on its own already achieves performance comparable to the human baseline.
...

https://arxiv.org/abs/2106.09685

Isn't this "prefix-embedding tuning" the same as textual inversion?

aleksusklim · 2023-12-07T11:07:06Z

Sigh:
https://civitai.com/articles/2494/making-better-loras-with-pivotal-tuning
AUTOMATIC1111/stable-diffusion-webui#13568
https://github.com/IrisRainbowNeko/HCP-Diffusion/blob/main/doc/en/user_guides/train.md#prompt-template-usage-with-text_transforms

AI-Casanova · 2023-12-07T12:39:08Z

I'll clean up my code and PR it.

Doesn't train both at once, but loads TI into the LoRA trainer and works quite well.

feffy380 · 2024-02-12T05:20:28Z

I've been messing with Poiuytrezay1's PR and my experience is the TI overfits on style quite quickly, so you probably want to train them separately anyway.
The quality difference between PTI and LoRA alone wasn't worth switching for, but the TI behaves as a trigger word without the need for dreambooth-style regularization images. I'm sure you'll get bleed if you train the unet long enough, but that takes longer than most single concept loras are trained for.

aleksusklim · 2024-02-12T06:17:43Z

I have an idea that I didn't had time to try.

Overtrain the TI embedding
Use my EmbeddingMerge to normalize it (divide on its norm or even slightly more)
Use it as a trigger word with LoRA training (assuming a pipeline that supports using embeddings during training)
Set network_train_unet_only to freeze CLIP.

Learning rate at 1. should be high. We don't care if the embedding breaks as-is.
After 2. the embedding will "stop working" but still would "mean something"
Step 3. can be hacked by dumping text latents to disk and patch them manually, adding embedding vectors.
Everything else in 4. as normally. The LoRA will work only with the embedding, obviously.

feffy380 · 2024-02-12T06:21:25Z

I just used their other PR which ports cloneofsimo's code to normalize during training #993

A norm of 1 is probably already too high. IIRC the PTI authors found the embedding works best if it's at least somewhat close to other real embeddings. In this case that means initializing with an existing token (init_word) and keeping the norm close to 0.4

aleksusklim · 2024-02-12T06:29:29Z

I thought the normalization during training will compromise its speed.
My idea is to overtrain TI quickly and start LoRA!

feffy380 · 2024-02-12T06:32:37Z

Normalizing after training is not going to suddenly un-overfit it

aleksusklim · 2024-02-12T06:51:31Z

It will "disable" the embedding, as if it wasn't trained at all.
I played with normalization of my trained TI with my EM, and the result looked like it wasn't trained at all.

Which is what I want to try for LoRA instead of training CLIP or using a trigger word.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learn TI Embedding and LoRA both at the same time #635

Learn TI Embedding and LoRA both at the same time #635

aleksusklim commented Jul 10, 2023

aleksusklim commented Jul 11, 2023

AI-Casanova commented Sep 3, 2023

spillerrec commented Nov 14, 2023

aleksusklim commented Nov 19, 2023

spillerrec commented Nov 20, 2023

aleksusklim commented Dec 7, 2023

AI-Casanova commented Dec 7, 2023

feffy380 commented Feb 12, 2024

aleksusklim commented Feb 12, 2024

feffy380 commented Feb 12, 2024 •

edited

Loading

aleksusklim commented Feb 12, 2024

feffy380 commented Feb 12, 2024

aleksusklim commented Feb 12, 2024

Learn TI Embedding and LoRA both at the same time #635

Learn TI Embedding and LoRA both at the same time #635

Comments

aleksusklim commented Jul 10, 2023

aleksusklim commented Jul 11, 2023

AI-Casanova commented Sep 3, 2023

spillerrec commented Nov 14, 2023

aleksusklim commented Nov 19, 2023

spillerrec commented Nov 20, 2023

aleksusklim commented Dec 7, 2023

AI-Casanova commented Dec 7, 2023

feffy380 commented Feb 12, 2024

aleksusklim commented Feb 12, 2024

feffy380 commented Feb 12, 2024 • edited Loading

aleksusklim commented Feb 12, 2024

feffy380 commented Feb 12, 2024

aleksusklim commented Feb 12, 2024

feffy380 commented Feb 12, 2024 •

edited

Loading