Training a new ControlNet with Img2Img active during training, or training other image variation models #6676

mayhemsloth · 2024-01-22T21:08:02Z

mayhemsloth
Jan 22, 2024

Hello everyone! This is going to be a long post so I appreciate any response and your time and attention spent on that response.

Tl;dr: I want to train an image variation model that is guided by information in a conditional image instead of a conditional text prompt.

Background and Context: My overall goal is to produce a generative image model that, during inference, takes in

a starting image
a conditional image (or a few conditional images)
little to no text prompts

and then outputs a new image (which I'll call the target image) which looks very similar to the starting image but has been changed by the conditional information, injected in the "correct" way, taught during training.

I have arrived at the following plan to accomplish this goal. I plan to train a ControlNet from scratch, using a custom dataset that I will prepare, somewhat similar to the community training example of the circle filling dataset. Initially, for computational reasons, I will use SDv1.5, but will very likely want to migrate to SDXL after proving that it can be done well with the 512 x 512 resolution (64 x 64 x 4 latents) of SDv1.5.

However, I want to utilize a starting image, and not necessarily the text prompt, to heavily influence the output of the final trained model, and so my plan is to use img2img as the initial "starting point" for the latents, instead of true noise, during inference.

Problem: After tracing some of the train_controlnet.py code from the example above, the main denoising training loop starts here. The information in the "pixel_values" gets encoded by the VAE to produce latents. Note that this information is the target image, and the purpose of the network is to predict the noise injected into these latents a few lines down, based on information from the controlnet output and the conditioning of the encoder_hidden_states, which is the text prompt embedding.

Remember that what I want is to have a starting image that heavily influences the output, and the target image would be slightly different than the starting image, and the information needed to bridge that gap is contained primarily in the conditional information (in image form), and much less so in the text prompt. There seems to be a pipeline designed for img2img with a ControlNet, but that's only inference and not for training. I searched the diffusers issues and found this question, but the answer didn't explain very well why this doesn't work.

After looking into the forward pass of that pipeline, it looks like the initial latents from the starting image are prepared here, after the starting image is sent through the VAE here. Then in the denoising loop, the latents get pushed through the controlnet and the unet, and then the predicted noise is extracted from the latents here.

I want to be able to change code in train_controlnet.py such that it also accepts a "starting_image" and behaves similarly to the image2image pipeline. However I feel like this might not work how I want it to work due to the math behind why diffusion works in the first place (corrupting the image data distribution to the noise distribution, and undoing the process based on text conditioning). I don't quite understand how I would have the model learn that it's supposed to find the "noise" that exists "between" the "starting_image" and "target_image", if that makes sense. It seems like I would have to inject some amount of noise into the "starting_image" to get it into the "noisy distribution", and then have the denoising process transform it (by predicting and subtracting some noise) into the "target_image" conditioned on the conditional information.

I thought img2img might be the best approach to solve this, but in the process of typing this out I think there may be some other options.

Utilize the encoder_hidden_states to inject starting image information that has been transformed into a text embedding. Basically unCLIP but with a ControlNet. I don't want to do this because the image-to-text-embedding-back-to-image transformation will necessarily "compress" the image detail, and I can't afford to lose that much information about the starting image.
Hijack encoder_hidden_states to create my own conditional image encoder and inject information there. This seems impossible without a tremendous amount of compute (which I don't have), as the encoder_hidden_states have been trained on a specific embedding space of text.
Utilize two ControlNets: the first (StartControlNet) would condition the model on the starting image, and the second (ConditionControlNet) would condition the model on the conditional information. This is straightforwardly easy for me to understand, but ultimately inelegant IMO. Because the ControlNet inputs are the same spatial size as the VAE latent, you could give the ControlNet the direct VAE latent of the starting image in addition to any number of channels from whatever preferred encoded image space you want (whichever VGG19 layer that is x8 smaller than the initial input, for example). Thus StartControlNet would be trained by letting the starting image become the target image, corrupting the target image (starting image now) with noise, and then figuring out how to denoise the target image (starting image) based on the conditional information from StartControlNet (which has been passed the starting image information, so it should be very easy!). So then with StartControlNet I have a model that can, ideally, turn random noise into whatever image I give StartControlNet. Awesome. This net would effectively be a "bias" term to force the final model to be very close to the starting image (being sent into StartControlNet).
The next step would be to train ConditionControlNet, which goes through basically the same process as StartControlNet, but has StartControlNet inputs being added during the training of ConditionControlNet. I claim this is inelegant because it's like 2 training stages (feelsbadman.jpg), and I think I would have to arbitrarily affix the weighting from each ControlNet with respect to each other. I guess if I unfreeze the weights of StartControlNet during training of ConditionControlNet then the model would overall learn how to weight them with respect to each other, AND I would retain control later to change the weighting if necessary. The ConditionControlNet would thus be trained by having the StartControlNet take as input the starting image (encoded/embedded properly), the ConditionControlNet take as input the conditional image/images (encoded/embedded properly) and then the target being the proper target image, such that the overall model has to learn how to, ideally, change random noise into the target image while being conditioned by the starting image and the conditional image via the StartControlNet and ConditionControlNet, respectively. The good thing about this is technically you can do multiple ConditionControlNet during future inference if you have multiple conditional images you want to control with.
Do proposed 3) above, but just stack channels into exactly one ControlNet (StartConditionControlNet) such that you combine both the starting image and the conditioning image into one input, and allow the zero conv 1x1 layers to figure out which information is important to inject when. During training, some percentage of the time you can corrupt the conditioning image and change the target image to the starting image, to force the model to learn that you need to pay attention to both pieces of information and the context between them.
Do something else??? I mainly want to use cross attention to directly attend to the conditional image information from the starting image as a means to getting to the target image.

If you as a reader feel that I am being too vague, it's purposeful. At the moment, I don't want to give too much away publicly about what I'm doing. Thanks for any help!

sayakpaul · 2024-01-23T08:37:51Z

sayakpaul
Jan 23, 2024
Maintainer

This might be a useful resource: https://huggingface.co/lambdalabs/sd-image-variations-diffusers

3 replies

mayhemsloth Jan 23, 2024
Author

Thanks! That is somewhat what I wanted, but the starting image or the conditional image still must go through the CLIP embedding process (the "text aligned" space) and I'm very doubtful that what I want to do will survive that transformation.

Somehow the existence of IP-Adapter (repo, arxiv, project page) had eluded my knowledge until last night. This method still embeds the image into the shared image-text embedding space via the OpenCLIP-ViT-H/14 model, which is undesirable.

I guess what I'm looking for doesn't exist or is simply too general and difficult of a problem at the moment? IP-Adapter-FaceID (HuggingFace page) is the closest thing that describes roughly what I want. This implementation is a strict subset of the solution space: that is, only for human faces and being able to automatically identify, transform, and inject the details of a human face into the correct location. Notice that they use a CLIP embedding and a FaceID embedding, which presumably captures the most important details necessary to describe that specific human's face which an shared image-text embedding space would otherwise be unable to capture adequately. More generally, I am interested in injecting visual information that has been encoded into a dense, rich visual embedding space, which is basically what a FaceID embedding is but only for faces. Does this make sense?

Specifically, the IP-Adapter-FaceID-Portrait (random relevant github comments) is the closest thing, because they utilize multiple "conditional" images at inference time to better extract and transfer the appropriate human face details. The last wrinkle would be the ability to "prompt" IP-Adapter-FaceID-Portrait with a different image, which they show is essentially possible already with the inpainting model or with structural controls on the project page for IP-Adapter, but don't explicitly show that as an example on the IP-Adapter-FaceID page.

sayakpaul Jan 23, 2024
Maintainer

I see. Maybe InstantID could give some more inspiration.

#6673

Cc: @haofanwang

haofanwang Jan 24, 2024

Can you share your ideas in InstantID repo?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a new ControlNet with Img2Img active during training, or training other image variation models #6676

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Training a new ControlNet with Img2Img active during training, or training other image variation models #6676

mayhemsloth Jan 22, 2024

Replies: 1 comment · 3 replies

sayakpaul Jan 23, 2024 Maintainer

mayhemsloth Jan 23, 2024 Author

sayakpaul Jan 23, 2024 Maintainer

haofanwang Jan 24, 2024

mayhemsloth
Jan 22, 2024

Replies: 1 comment 3 replies

sayakpaul
Jan 23, 2024
Maintainer

mayhemsloth Jan 23, 2024
Author

sayakpaul Jan 23, 2024
Maintainer