stable-diffusion.cpp

Inference of Stable Diffusion in pure C/C++

Features

Plain C/C++ implementation based on ggml, working in the same way as llama.cpp
Super lightweight and without external dependencies
SD1.x, SD2.x and SDXL support
- !!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.
SD-Turbo and SDXL-Turbo support
16-bit, 32-bit float support
4-bit, 5-bit and 8-bit integer quantization support
Accelerated memory-efficient CPU inference
- Only requires ~2.3GB when using txt2img with fp16 precision to generate a 512x512 image, enabling Flash Attention just requires ~1.8GB.
AVX, AVX2 and AVX512 support for x86 architectures
Full CUDA and Metal backend for GPU acceleration.
Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs models
- No need to convert to .ggml or .gguf anymore!
Flash Attention for memory usage optimization (only cpu for now)
Original txt2img and img2img mode
Negative prompt
stable-diffusion-webui style tokenizer (not all the features, only token weighting for now)
LoRA support, same as stable-diffusion-webui
Latent Consistency Models support (LCM/LCM-LoRA)
Faster and memory efficient latent decoding with TAESD
Upscale images generated with ESRGAN
VAE tiling processing for reduce memory usage
Control Net support with SD 1.5
Sampling method
- Euler A
- Euler
- Heun
- DPM2
- DPM++ 2M
- DPM++ 2M v2
- DPM++ 2S a
- LCM
Cross-platform reproducibility (--rng cuda, consistent with the stable-diffusion-webui GPU RNG)
Embedds generation parameters into png output as webui-compatible text string
Supported platforms
- Linux
- Mac OS
- Windows
- Android (via Termux)

TODO

More sampling methods
Make inference faster
- The current implementation of ggml_conv_2d is slow and has high memory usage
Continuing to reduce memory usage (quantizing the weights of ggml_conv_2d)
Implement Inpainting support
k-quants support

Usage

For most users, you can download the built executable program from the latest release. If the built product does not meet your requirements, you can choose to build it manually.

Get the Code

git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp

If you have already cloned the repository, you can use the following command to update the repository to the latest code.

cd stable-diffusion.cpp
git pull origin master
git submodule init
git submodule update

Download weights

download original weights(.ckpt or .safetensors). For example

Stable Diffusion v1.4 from https://huggingface.co/CompVis/stable-diffusion-v-1-4-original
Stable Diffusion v1.5 from https://huggingface.co/runwayml/stable-diffusion-v1-5
Stable Diffuison v2.1 from https://huggingface.co/stabilityai/stable-diffusion-2-1

curl -L -O https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt
# curl -L -O https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.safetensors
# curl -L -O https://huggingface.co/stabilityai/stable-diffusion-2-1/resolve/main/v2-1_768-nonema-pruned.safetensors

Build

Build from scratch

mkdir build
cd build
cmake ..
cmake --build . --config Release

Using OpenBLAS

cmake .. -DGGML_OPENBLAS=ON
cmake --build . --config Release

Using CUBLAS

This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. apt install nvidia-cuda-toolkit) or from here: CUDA Toolkit. Recommended to have at least 4 GB of VRAM.

cmake .. -DSD_CUBLAS=ON
cmake --build . --config Release

Using HipBLAS

This provides BLAS acceleration using the ROCm cores of your AMD GPU. Make sure to have the ROCm toolkit installed.

Windows User Refer to docs/hipBLAS_on_Windows.md for a comprehensive guide.

cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1100
cmake --build . --config Release

Using Metal

Using Metal makes the computation run on the GPU. Currently, there are some issues with Metal when performing operations on very large matrices, making it highly inefficient at the moment. Performance improvements are expected in the near future.

cmake .. -DSD_METAL=ON
cmake --build . --config Release

Using Flash Attention

Enabling flash attention reduces memory usage by at least 400 MB. At the moment, it is not supported when CUBLAS is enabled because the kernel implementation is missing.

cmake .. -DSD_FLASH_ATTN=ON
cmake --build . --config Release

Run

usage: ./build/bin/sd [arguments]

arguments:
  -h, --help                         show this help message and exit
  -M, --mode [MODEL]                 run mode (txt2img or img2img or convert, default: txt2img)
  -t, --threads N                    number of threads to use during computation (default: -1).
                                     If threads <= 0, then threads will be set to the number of CPU physical cores
  -m, --model [MODEL]                path to model
  --vae [VAE]                        path to vae
  --taesd [TAESD_PATH]               path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
  --control-net [CONTROL_PATH]       path to control net model
  --embd-dir [EMBEDDING_PATH]        path to embeddings.
  --upscale-model [ESRGAN_PATH]      path to esrgan model. Upscale images after generate, just RealESRGAN_x4plus_anime_6B supported by now.
  --upscale-repeats                  Run the ESRGAN upscaler this many times (default 1)
  --type [TYPE]                      weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
                                     If not specified, the default is the type of the weight file.
  --lora-model-dir [DIR]             lora model directory
  -i, --init-img [IMAGE]             path to the input image, required by img2img
  --control-image [IMAGE]            path to image condition, control net
  -o, --output OUTPUT                path to write result image to (default: ./output.png)
  -p, --prompt [PROMPT]              the prompt to render
  -n, --negative-prompt PROMPT       the negative prompt (default: "")
  --cfg-scale SCALE                  unconditional guidance scale: (default: 7.0)
  --strength STRENGTH                strength for noising/unnoising (default: 0.75)
  --control-strength STRENGTH        strength to apply Control Net (default: 0.9)
                                     1.0 corresponds to full destruction of information in init image
  -H, --height H                     image height, in pixel space (default: 512)
  -W, --width W                      image width, in pixel space (default: 512)
  --sampling-method {euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m, dpm++2mv2, lcm}
                                     sampling method (default: "euler_a")
  --steps  STEPS                     number of sample steps (default: 20)
  --rng {std_default, cuda}          RNG (default: cuda)
  -s SEED, --seed SEED               RNG seed (default: 42, use random seed for < 0)
  -b, --batch-count COUNT            number of images to generate.
  --schedule {discrete, karras}      Denoiser sigma schedule (default: discrete)
  --clip-skip N                      ignore last layers of CLIP network; 1 ignores none, 2 ignores one layer (default: -1)
                                     <= 0 represents unspecified, will be 1 for SD1.x, 2 for SD2.x
  --vae-tiling                       process vae in tiles to reduce memory usage
  --control-net-cpu                  keep controlnet in cpu (for low vram)
  --canny                            apply canny preprocessor (edge detection)
  -v, --verbose                      print extra info

Quantization

You can specify the model weight type using the --type parameter. The weights are automatically converted when loading the model.

f16 for 16-bit floating-point
f32 for 32-bit floating-point
q8_0 for 8-bit integer quantization
q5_0 or q5_1 for 5-bit integer quantization
q4_0 or q4_1 for 4-bit integer quantization

Convert to GGUF

You can also convert weights in the formats ckpt/safetensors/diffusers to gguf and perform quantization in advance, avoiding the need for quantization every time you load them.

For example:

./bin/sd -M convert -m ../models/v1-5-pruned-emaonly.safetensors -o  ../models/v1-5-pruned-emaonly.q8_0.gguf -v --type q8_0

txt2img example

./bin/sd -m ../models/sd-v1-4.ckpt -p "a lovely cat"
# ./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat"
# ./bin/sd -m ../models/sd_xl_base_1.0.safetensors --vae ../models/sdxl_vae-fp16-fix.safetensors -H 1024 -W 1024 -p "a lovely cat" -v

Using formats of different precisions will yield results of varying quality.

f32	f16	q8_0	q5_0	q5_1	q4_0	q4_1

img2img example

./output.png is the image generated from the above txt2img pipeline

./bin/sd --mode img2img -m ../models/sd-v1-4.ckpt -p "cat with blue eyes" -i ./output.png -o ./img2img_output.png --strength 0.4

with LoRA

You can specify the directory where the lora weights are stored via --lora-model-dir. If not specified, the default is the current working directory.
LoRA is specified via prompt, just like stable-diffusion-webui.

Here's a simple example:

./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:marblesh:1>" --lora-model-dir ../models

../models/marblesh.safetensors or ../models/marblesh.ckpt will be applied to the model

LCM/LCM-LoRA

Download LCM-LoRA form https://huggingface.co/latent-consistency/lcm-lora-sdv1-5
Specify LCM-LoRA by adding <lora:lcm-lora-sdv1-5:1> to prompt
It's advisable to set --cfg-scale to 1.0 instead of the default 7.0. For --steps, a range of 2-8 steps is recommended. For --sampling-method, lcm/euler_a is recommended.

Here's a simple example:

./bin/sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat<lora:lcm-lora-sdv1-5:1>" --steps 4 --lora-model-dir ../models -v --cfg-scale 1

without LCM-LoRA (--cfg-scale 7)	with LCM-LoRA (--cfg-scale 1)

Using TAESD to faster decoding

You can use TAESD to accelerate the decoding of latent images by following these steps:

Download the model weights.

Or curl

curl -L -O https://huggingface.co/madebyollin/taesd/blob/main/diffusion_pytorch_model.safetensors

Specify the model path using the --taesd PATH parameter. example:

sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --taesd ../models/diffusion_pytorch_model.safetensors

Using ESRGAN to upscale results

You can use ESRGAN to upscale the generated images. At the moment, only the RealESRGAN_x4plus_anime_6B.pth model is supported. Support for more models of this architecture will be added soon.

Specify the model path using the --upscale-model PATH parameter. example:

sd -m ../models/v1-5-pruned-emaonly.safetensors -p "a lovely cat" --upscale-model ../models/RealESRGAN_x4plus_anime_6B.pth

Docker

Building using Docker

docker build -t sd .

Run

docker run -v /path/to/models:/models -v /path/to/output/:/output sd [args...]
# For example
# docker run -v ./models:/models -v ./build:/output sd -m /models/sd-v1-4.ckpt -p "a lovely cat" -v -o /output/output.png

Memory Requirements

precision	f32	f16	q8_0	q5_0	q5_1	q4_0	q4_1
Memory (txt2img - 512 x 512)	~2.8G	~2.3G	~2.1G	~2.0G	~2.0G	~2.0G	~2.0G
Memory (txt2img - 512 x 512) with Flash Attention	~2.4G	~1.9G	~1.6G	~1.5G	~1.5G	~1.5G	~1.5G

Bindings

These projects wrap stable-diffusion.cpp for easier use in other languages/frameworks.

Golang: seasonjs/stable-diffusion
C#: DarthAffe/StableDiffusion.NET

Contributors

Thank you to all the people who have already contributed to stable-diffusion.cpp!

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
ggml @ 9cc5cb2		ggml @ 9cc5cb2
thirdparty		thirdparty
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
clip.hpp		clip.hpp
common.hpp		common.hpp
control.hpp		control.hpp
denoiser.hpp		denoiser.hpp
esrgan.hpp		esrgan.hpp
format-code.sh		format-code.sh
ggml_extend.hpp		ggml_extend.hpp
lora.hpp		lora.hpp
model.cpp		model.cpp
model.h		model.h
preprocessing.hpp		preprocessing.hpp
rng.hpp		rng.hpp
rng_philox.hpp		rng_philox.hpp
stable-diffusion.cpp		stable-diffusion.cpp
stable-diffusion.h		stable-diffusion.h
tae.hpp		tae.hpp
unet.hpp		unet.hpp
upscaler.cpp		upscaler.cpp
util.cpp		util.cpp
util.h		util.h
vae.hpp		vae.hpp
vocab.hpp		vocab.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stable-diffusion.cpp

Features

TODO

Usage

Get the Code

Download weights

Build

Build from scratch

Using OpenBLAS

Using CUBLAS

Using HipBLAS

Using Metal

Using Flash Attention

Run

Quantization

Convert to GGUF

txt2img example

img2img example

with LoRA

LCM/LCM-LoRA

Using TAESD to faster decoding

Using ESRGAN to upscale results

Docker

Building using Docker

Run

Memory Requirements

Bindings

Contributors

References

About

Releases

Packages

Languages

License

katsu560/stable-diffusion.cpp

Folders and files

Latest commit

History

Repository files navigation

stable-diffusion.cpp

Features

TODO

Usage

Get the Code

Download weights

Build

Build from scratch

Using OpenBLAS

Using CUBLAS

Using HipBLAS

Using Metal

Using Flash Attention

Run

Quantization

Convert to GGUF

txt2img example

img2img example

with LoRA

LCM/LCM-LoRA

Using TAESD to faster decoding

Using ESRGAN to upscale results

Docker

Building using Docker

Run

Memory Requirements

Bindings

Contributors

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages