A small neural network to provide interoperability between the latents generated by the different Stable Diffusion models.
I wanted to see if it was possible to pass latents generated by the new SDXL model directly into SDv1.5 models without decoding and re-encoding them using a VAE first.
To install it, simply clone this repo to your custom_nodes folder using the following command:
git clone https://github.com/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer
Alternatively, you can download the comfy_latent_interposer.py file to your ComfyUI/custom_nodes
folder as well. You may need to install hfhub using the command pip install huggingface-hub
inside your venv.
If you need the model weights for something else, they are hosted on HF under the same Apache2 license as the rest of the repo. The current files are in the "v4.0" subfolder.
Simply place it where you would normally place a VAE decode followed by a VAE encode. Set the denoise as appropirate to hide any artifacts while keeping the composition. See image below.
Without the interposer, the two latent spaces are incompatible:
The node pulls the required files from huggingface hub by default. You can create a models
folder and place the models there if you have a flaky connection or prefer to use it completely offline. The custom node will prefer local files over HF when available. The path should be: ComfyUI/custom_nodes/SD-Latent-Interposer/models
Alternatively, just clone the entire HF repo to it:
git clone https://huggingface.co/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer/models
Model names:
code | name |
---|---|
v1 |
Stable Diffusion v1.x |
xl |
SDXL |
v3 |
Stable Diffusion 3 |
fx |
Flux.1 |
ca |
Stable Cascade (Stage A/B) |
Available models:
From | to v1 |
to xl |
to v3 |
to fx |
to ca |
---|---|---|---|---|---|
v1 |
- | v4.0 | v4.0 | No | No |
xl |
v4.0 | - | v4.0 | No | No |
v3 |
v4.0 | v4.0 | - | No | No |
fx |
v4.0 | v4.0 | v4.0 | - | No |
ca |
v4.0 | v4.0 | v4.0 | No | - |
The training code initializes most training parameters from the provided config file. The dataset should be a single .bin file saved with torch.save
for each latent version. The format should be [batch, channels, height, width] with the "batch" being as large as the dataset, ie 88000.
The training code currently initializes two copies of the model, one in the target direction and one in the opposite. The losses are defined based on this.
p_loss
is the main criterion for the primary model.b_loss
is the main criterion for the secondary one.r_loss
is the output of the primary model back through the secondary model and checked against the source latent (basically a round trip through the two models).h_loss
is the same asr_loss
but for the secondary model.
All models were trained for 50000 steps with either batch size 128 (xl/v1) or 48 (cascade). The training was done locally on an RTX 3080 and a Tesla V100S.
Interposer v3.1
This is basically a complete rewrite. Replaced the mediocre bunch of conv2d layers with something that looks more like a proper neural network. No VGG loss because I still don't have a better GPU.
Training was done on combined Flickr2K + DIV2K, with each image being processed into 6 1024x1024 segments. Padded with some of my random images for a total of 22,000 source images in the dataset.
I think I got rid of most of the XL artifacts, but the color/hue/saturation shift issues are still there. I actually saved the optimizer state this time so I might be able to do 100K steps with visual loss on my P40s. Hopefully they won't burn up.
v3.0 was 500k steps at a constant LR of 1e-4, v3.1 was 1M steps using a CosineAnnealingLR to drop the learning rate towards the end. Both used AdamW.
Interposer v1.1
This is the second release using the "spaceship" architecture. It was trained on the Flickr2K dataset and was continued from the v1.0 checkpoint. Overall, it seems to perform a lot better, especially for real life photos. I also investigated the odd v1->xl artifacts but in the end it seems inherent to the VAE decoder stage.
Interposer v1.0
Not sure why the training loss is so different, it might be due to the """highly curated""" dataset of 1000 random images from my Downloads folder that I used to train it.
I probably should've just grabbed LAION.
I also trained a v1-to-v2 mode, before realizing v1 and v2 shared the same latent space. Oh well.