feat: add SDXL support #117

leejet · 2023-12-13T16:31:25Z

~~This PR still need some work, such as handling text projection (although it has minimal impact on the result) and addressing issues with generating images >= 1024x1024.~~

!!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.

SDXL base 1.0

sd.exe -m ..\models\sd_xl_base_1.0.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 1024 -W 1024 -p "a lovely cat" -v

SDXL-Turbo

sd.exe -m ..\models\sd_xl_turbo_1.0_fp16.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 768 -W 768 --cfg-scale 1 --steps 1  -p "a lovely cat" -v

FSSRepo · 2023-12-13T16:38:46Z

AMAZING JOB!!!!, YEAHH, This model probably won't work on a 4GB RAM GPU, and it may not be loadable either. I'll see what I can do to optimize VRAM usage by adding dynamic buffers and split-attention.

leejet · 2023-12-13T16:47:01Z

AMAZING JOB!!!!, YEAHH, This model probably won't work on a 4GB RAM GPU, and it may not be loadable either. I'll see what I can do to optimize VRAM usage by adding dynamic buffers and split-attention.

I think the inability to generate a 1024x1024 image might be due to issues in the ggml cuda backend. I have a 24GB graphics card, but still can't generate a 1024x1024 image. At this point, it's not using the full 24GB of VRAM.

CUDA error 700 at E:\Code\sd.cpp\ggml\src\ggml-cuda.cu:8722: an illegal memory access was encountered

It's at this step that the issue arises.

ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_output->data, 0, ggml_nbytes(work_output));

leejet · 2023-12-13T16:48:33Z

I've modified the code like this, but it's still throwing the same error.

ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_output->data, 0, 1);

FSSRepo · 2023-12-13T16:49:36Z

It's very strange

FSSRepo · 2023-12-13T16:52:40Z

Could you send me the output of the program, as I cannot test it due to my low VRAM

leejet · 2023-12-13T17:26:25Z

Could you send me the output of the program, as I cannot test it due to my low VRAM

> sd.exe -m ..\models\sd_xl_turbo_1.0_fp16.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 768 -W 768 --cfg-scale 1 --steps 1  -p "a lovely cat" -v 
...
[INFO]  stable-diffusion.cpp:4466 - Stable Diffusion XL
[INFO]  stable-diffusion.cpp:4472 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:4474 - loading vocab
[DEBUG] stable-diffusion.cpp:4485 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:1183 - clip params backend buffer size =  1559.41 MB (1638 tensors)
[DEBUG] stable-diffusion.cpp:2167 - unet params backend buffer size =  4909.43 MB (1705 tensors)
[DEBUG] stable-diffusion.cpp:3189 - vae params backend buffer size =  95.47 MB (164 tensors)
...
[DEBUG] stable-diffusion.cpp:1289 - learned condition compute buffer size: 2.63 MB
[DEBUG] stable-diffusion.cpp:4784 - computing condition graph completed, taking 60 ms
[INFO]  stable-diffusion.cpp:5454 - get_learned_condition completed, taking 61 ms
[INFO]  stable-diffusion.cpp:5464 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:5468 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:2510 - diffusion compute buffer size: 779.48 MB
  |==================================================| 1/1 - 1.87it/s
[INFO]  stable-diffusion.cpp:5480 - sampling completed, taking 0.61s
[INFO]  stable-diffusion.cpp:5488 - generating 1 latent images completed, taking 0.66s
[INFO]  stable-diffusion.cpp:5490 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3318 - vae compute buffer size: 6656.00 MB

CUDA error 700 at E:\Code\sd.cpp\ggml\src\ggml-cuda.cu:8722: an illegal memory access was encountered
current device: 0

FSSRepo · 2023-12-13T17:30:42Z

Something I don't understand is the low memory usage that the computation graph of UNet achieves. I will find it very difficult to debug that since I don't have the hardware for those tests. Anyway, I think I'll try to set up a Colab.

leejet · 2023-12-13T17:35:31Z

Something I don't understand is the low memory usage that the computation graph of UNet achieves.

This makes sense. For UNet, parameters occupy a significant portion of memory, while the runtime tensor memory isn't extensive. Conversely, VAE operates oppositely. Parameter memory consumption isn't large, but the runtime tensor memory usage is quite high. UNet takes a 128x128 feature map as input, further undergoing downsampling. However, VAE starts with a 128x128 feature map and gradually upsamples it to 1024x1024, involving numerous channels.

FSSRepo · 2023-12-13T21:50:03Z

@leejet Perhaps processing the VAE in tiles could fix the issue of not generating images in 1024 x 1024, but it would need to be tested.

slaren · 2023-12-13T22:16:34Z

Running it with compute-sanitizer with 1024x1024 shows an out of bounds access in im2col:

========= Invalid __global__ write of size 2 bytes
=========     at im2col_f32_f16(const float *, __half *, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x630
=========     by thread (32,0,0) in block (35,911,0)
=========     Address 0x1280104010 is out of bounds
=========     and is 2151137264 bytes before the nearest allocation at 0x1300480000 of size 65536 bytes

FSSRepo · 2023-12-13T22:37:03Z

Cuda support unsigned ints? Perhaps by changing some of the int offsets to unsigned int, could solve the problem.

Could you print the tensor shape and run a test in backend-ops?

slaren · 2023-12-14T14:34:04Z

@FSSRepo I found these im2cols with 1024x1024:

    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {1,1,256,128}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {1,1,4,4}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {1,1,512,256}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {1,1,512,512}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,128}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,3}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,128}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {3,3,4,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {256,256,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));

  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[1,1,256,128],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[128,128,4,1],ne_kernel=[1,1,4,4],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[512,512,512,1],ne_kernel=[1,1,512,256],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[128,128,512,1],ne_kernel=[1,1,512,512],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,128,1],ne_kernel=[3,3,128,128],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,128,1],ne_kernel=[3,3,128,3],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[3,3,256,128],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):
CUDA error 700 at /home/diego/code/ggml/src/ggml-cuda.cu:9615: an illegal memory access was encountered

FSSRepo · 2023-12-14T14:46:58Z

Some offset integer is overflowing; I will run tests on my computer.

FSSRepo · 2023-12-14T15:24:36Z

@leejet using winograd with 512x512, At the moment, the performance is very poor, but this is because kernel processing is done in a single thread (although it can be multi-threaded). This Winograd operation must be carried out in two stages on the CPU. It only works with 3x3 kernels, stride 1, and dilation 1. Reduces memory consumption by 46%. In UNet, we could continue using im2col, and in VAE, use Winograd to avoid memory overload.

[DEBUG] stable-diffusion.cpp:4986 - Using CPU backend
[INFO]  stable-diffusion.cpp:4996 - loading model from 'models/kotosmix_v10-f16.gguf'
[INFO]  model.cpp:624  - load models/kotosmix_v10-f16.gguf using gguf format
[DEBUG] model.cpp:641  - init from 'models/kotosmix_v10-f16.gguf'
[INFO]  stable-diffusion.cpp:5019 - Stable Diffusion 1.x
[INFO]  stable-diffusion.cpp:5025 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:5027 - loading vocab
[DEBUG] stable-diffusion.cpp:5038 - ggml tensor size = 448 bytes
[DEBUG] stable-diffusion.cpp:1059 - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2153 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3176 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:5050 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:5081 - loading weights
[DEBUG] model.cpp:1219 - loading tensors from models/kotosmix_v10-f16.gguf
[DEBUG] stable-diffusion.cpp:5169 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:5179 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:5181 - loading model from 'models/kotosmix_v10-f16.gguf' completed, taking 1.35s
[INFO]  stable-diffusion.cpp:5195 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:5222 - finished loaded file
[DEBUG] stable-diffusion.cpp:6029 - prompt after extract and remove lora: "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality"
[INFO]  stable-diffusion.cpp:6034 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1292 - parse 'beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality' to [['beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality', 1], ]
[DEBUG] stable-diffusion.cpp:709  - split prompt "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality" to tokens ["beautiful</w>", "anime</w>", "girl</w>", ",</w>", "white</w>", "hair</w>", ",</w>", "blue</w>", "eyes</w>", ",</w>", "realistic</w>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", "lane</w>", ",</w>", "4</w>", "k</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 127 ms
[DEBUG] stable-diffusion.cpp:1292 - parse 'bad quality, ugly, face malformed, bad anatomy' to [['bad quality, ugly, face malformed, bad anatomy', 1], ]
[DEBUG] stable-diffusion.cpp:709  - split prompt "bad quality, ugly, face malformed, bad anatomy" to tokens ["bad</w>", "quality</w>", ",</w>", "ugly</w>", ",</w>", "face</w>", "<|endoftext|>", ",</w>", "bad</w>", "anatomy</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 124 ms
[INFO]  stable-diffusion.cpp:6063 - get_learned_condition completed, taking 253 ms
[INFO]  stable-diffusion.cpp:6073 - sampling using DPM++ (2M) method
[INFO]  stable-diffusion.cpp:6077 - generating image: 1/1 - seed 424354
[DEBUG] stable-diffusion.cpp:2491 - diffusion compute buffer size: 559.43 MB
  |==================================================| 20/20 - 43.70s/it
[INFO]  stable-diffusion.cpp:6089 - sampling completed, taking 1127.33s
[INFO]  stable-diffusion.cpp:6097 - generating 1 latent images completed, taking 1127.39s
[INFO]  stable-diffusion.cpp:6099 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3305 - vae compute buffer size: 770.00 MB
[DEBUG] stable-diffusion.cpp:5925 - computing vae [mode: DECODE] graph completed, taking 137.05s
[INFO]  stable-diffusion.cpp:6108 - latent 1 decoded, taking 137.05s
[INFO]  stable-diffusion.cpp:6112 - decode_first_stage completed, taking 137.05s
[INFO]  stable-diffusion.cpp:6129 - txt2img completed in 1264.70s
[INFO]  main.cpp:534  - save result image to 'output.png'

diimdeep · 2023-12-14T15:28:13Z

Tried on colab, it works, but crashing with loras

ggml_new_object: not enough space in the context's memory pool (needed 459168, available 458752)

[DEBUG] stable-diffusion.cpp:4427 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
[INFO]  stable-diffusion.cpp:4441 - loading model from 'models/juggernautXL_v7Rundiffusion.safetensors'
[INFO]  model.cpp:633  - load models/juggernautXL_v7Rundiffusion.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/juggernautXL_v7Rundiffusion.safetensors'
[INFO]  stable-diffusion.cpp:4450 - loading vae from 'models/sdxl_vae-fp16-fix.safetensors'
[INFO]  model.cpp:633  - load models/sdxl_vae-fp16-fix.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/sdxl_vae-fp16-fix.safetensors'
[INFO]  stable-diffusion.cpp:4466 - Stable Diffusion XL 
[INFO]  stable-diffusion.cpp:4472 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:4474 - loading vocab
[DEBUG] stable-diffusion.cpp:4485 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:1183 - clip params backend buffer size =  1559.41 MB (1638 tensors)
[DEBUG] stable-diffusion.cpp:2167 - unet params backend buffer size =  4909.43 MB (1705 tensors)
[DEBUG] stable-diffusion.cpp:3189 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:4502 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:4533 - loading weights
[DEBUG] model.cpp:1230 - loading tensors from models/juggernautXL_v7Rundiffusion.safetensors
[WARN]  stable-diffusion.cpp:4557 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file
[DEBUG] model.cpp:1230 - loading tensors from models/sdxl_vae-fp16-fix.safetensors
[DEBUG] stable-diffusion.cpp:4621 - model size = 6646.81MB
[INFO]  stable-diffusion.cpp:4627 - total memory buffer size = 6564.31MB (clip 1559.41MB, unet 4909.43MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:4633 - loading model from 'models/juggernautXL_v7Rundiffusion.safetensors' completed, taking 3.44s
[INFO]  stable-diffusion.cpp:4647 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:4674 - finished loaded file
[DEBUG] stable-diffusion.cpp:5412 - lora SDXLFrosted:0.30
[DEBUG] stable-diffusion.cpp:5416 - prompt after extract and remove lora: "A Photograph, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3 "
[INFO]  stable-diffusion.cpp:4050 - loading LoRA from 'models/SDXLFrosted.safetensors'
[INFO]  model.cpp:633  - load models/SDXLFrosted.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/SDXLFrosted.safetensors'
[DEBUG] stable-diffusion.cpp:4071 - calculating buffer size
[DEBUG] stable-diffusion.cpp:4073 - lora params backend buffer size =  435.30 MB
[DEBUG] model.cpp:1230 - loading tensors from models/SDXLFrosted.safetensors
ggml_new_object: not enough space in the context's memory pool (needed 459168, available 458752)

leejet · 2023-12-14T16:10:47Z

@leejet using winograd with 512x512, At the moment, the performance is very poor, but this is because kernel processing is done in a single thread (although it can be multi-threaded). This Winograd operation must be carried out in two stages on the CPU. It only works with 3x3 kernels, stride 1, and dilation 1. Reduces memory consumption by 46%. In UNet, we could continue using im2col, and in VAE, use Winograd to avoid memory overload.

Great! Looking forward to your progress!

leejet · 2023-12-14T16:43:23Z

@diimdeep Pull the latest code and try it again. It should be fixed now.

diimdeep · 2023-12-14T20:14:40Z

Ty, now another problem is that output is identical with or without Lora, no effect.
here is log

log.txt

here is code I use to run it on colab t4

!git clone https://github.com/leejet/stable-diffusion.cpp
%cd stable-diffusion.cpp
!git checkout sdxl
!git submodule update --init
!cmake -B build -DSD_CUBLAS=ON && cmake --build build --config Release
!mkdir output
!mkdir models
!wget https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/resolve/main/sdxl_vae.safetensors?download=true
!mv 'sdxl_vae.safetensors?download=true' models/sdxl_vae-fp16-fix.safetensors
!wget https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors?download=true
!mv sd_xl_base_1.0.safetensors?download=true models/sd_xl_base_1.0.safetensors
!wget https://civitai.com/api/download/models/247528?type=Model&format=SafeTensor
!mv '247528?type=Model' models/SDXLFrosted.safetensors

!./build/bin/sd -m models/sd_xl_base_1.0.safetensors \
--vae models/sdxl_vae-fp16-fix.safetensors \
--lora-model-dir models/ \
-p "A Photograph, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3 <lora:SDXLFrosted:0.9>" \
-s 1619903001 --sampling-method euler_a --cfg-scale 7 --steps 35 -W 512 -H 1024 -o output/sdxl_09.png

FSSRepo · 2023-12-15T17:46:43Z

@slaren

 ALIBI(type=f32,ne=[10,10,10,10],n_past=512,n_head=10,bias_max=0.500000): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[3,3,256,256],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):
D:\proyectos\ggml\build\bin\Release>

I can't even run the test-backend-ops with cpu backend on my computer since I only have 16 GB of RAM, let alone on my 4GB graphics card. I think the problem must be an overload in the CUDA registers.

slaren · 2023-12-15T18:07:25Z

You can use any types in a CUDA kernel, like int64_t or size_t. I don't have to time to look more into this right now, but I will check it some time next week if it is still not fixed.

leejet · 2023-12-18T16:08:08Z

I've implemented the following fixes, and now we can generate large 1024x1024 images without any issues.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 019648b..2e07bc6 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -5259,17 +5259,17 @@ static  __global__ void im2col_f32_f16(
     const int ky = (i - kd) / OW;
     const int ix = i % OW;

-    const int iiw = ix * s0 + kx * d0 - p0;
-    const int iih = blockIdx.y * s1 + ky * d1 - p1;
+    const int64_t iiw = ix * s0 + kx * d0 - p0;
+    const int64_t iih = blockIdx.y * s1 + ky * d1 - p1;

-    const int offset_dst =
+    const int64_t offset_dst =
         (blockIdx.y * OW + ix) * CHW +
         (blockIdx.z * (KW * KH) + ky * KW + kx);

     if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
         dst[offset_dst] = __float2half(0.0f);
     } else {
-        const int offset_src = blockIdx.z * offset_delta;
+        const int64_t offset_src = blockIdx.z * offset_delta;
         dst[offset_dst] = __float2half(x[offset_src + iih * IW + iiw]);
     }
 }

sdxl txt2img

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024

leejet · 2023-12-18T16:11:36Z

Ty, now another problem is that output is identical with or without Lora, no effect. here is log

log.txt

It seems the current code isn't fully compatible with certain SDXL LoRA names. I'll make time to fix this.

Green-Sky · 2023-12-18T19:14:05Z

This is great. I tried to push the envelope of how large I can fit into my 8gig vram (-700 for os). Since VAE tiling is not available yet, I used TAESD(xl).
The largest I could generate was -W 1024 -H 1600 which uses ~6.95gig vram.

(model is talmendoxlSDXL_v11Beta.safetensors)

Green-Sky · 2023-12-18T19:15:25Z

The model produced the following warning btw:

[WARN]  stable-diffusion.cpp:4557 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file

Green-Sky · 2023-12-18T20:02:06Z

I tried using the lcm lora (for sdxl) and it does not fit into vram. Even when I set width and height to 64.

diimdeep · 2023-12-21T22:31:09Z

Nice, T4 colab could sample -W 1600 -H 2304 showing 13.6GB utilization, but failed later at decoding using taesdxl

ggml_new_object: not enough space in the context's memory pool (needed 99299472, available 98959360)

-W 1600 -H 2112 at 12.3GB worked

leejet · 2023-12-23T06:54:57Z

I tried using the lcm lora (for sdxl) and it does not fit into vram. Even when I set width and height to 64.

@Green-Sky Where did you download SDXL LCM LoRA from? I'm using SDXL LCM LoRA, and it works fine.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 1024 -W 1024 --cfg-scale 1 --steps 4

leejet · 2023-12-23T07:10:12Z

Nice, T4 colab could sample -W 1600 -H 2304 showing 13.6GB utilization, but failed later at decoding using taesdxl
ggml_new_object: not enough space in the context's memory pool (needed 99299472, available 98959360)
-W 1600 -H 2112 at 12.3GB worked

I didn't replicate this issue using similar parameters. By the way, generating images with excessively large dimensions can result in strange outputs.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 2304 -W 1600 --cfg-scale 1 --steps 4 --taesd ..\models\taesdxl.safetensors

Green-Sky · 2023-12-23T12:29:23Z

@leejet

the file I got from here seemed to be corrupted. (amateur mistake) So I redownloaded it :)
even with the new file, the behavior is the same (cuda OOM in apply lora)

$ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors --vae ../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors --lora-model-dir ../stable-diffusion-webui/models/Lora/ -p "<lora:lcm_sdxl:1>a lovely cat" -H 1024 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5
[INFO]  stable-diffusion.cpp:4604 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:4613 - loading vae from '../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:4629 - Stable Diffusion XL
[INFO]  stable-diffusion.cpp:4635 - Stable Diffusion weight type: f16
[WARN]  stable-diffusion.cpp:4719 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file
[INFO]  stable-diffusion.cpp:4789 - total memory buffer size = 6564.31MB (clip 1559.41MB, unet 4909.43MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:4795 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors' completed, taking 2.06s
[INFO]  stable-diffusion.cpp:4809 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:4213 - loading LoRA from '../stable-diffusion-webui/models/Lora/lcm_sdxl.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/Lora/lcm_sdxl.safetensors using safetensors format

CUDA error 2 at /build/27vlpayv89jdf3gdy19kaj1x02fhr2y4-source/ggml/src/ggml-cuda.cu:9464: out of memory
current device: 0
GGML_ASSERT: /build/27vlpayv89jdf3gdy19kaj1x02fhr2y4-source/ggml/src/ggml-cuda.cu:9464: !"CUDA error"
[New LWP 1639948]
[New LWP 1639949]
[New LWP 1639950]
warning: File "/nix/store/9fy9zzhf613xp0c3jsjxbjq6yp8afrsv-gcc-12.3.0-lib/lib/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/gi26p79iq8jrw51irq5x82c2cqlgicxi-gcc-12.3.0-lib".
To enable execution of this file add
	add-auto-load-safe-path /nix/store/9fy9zzhf613xp0c3jsjxbjq6yp8afrsv-gcc-12.3.0-lib/lib/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/home/green/.config/gdb/gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/green/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
warning: File "/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/gi26p79iq8jrw51irq5x82c2cqlgicxi-gcc-12.3.0-lib".
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x00007f604a50f0d7 in wait4 () from /nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6
#0  0x00007f604a50f0d7 in wait4 () from /nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6
#1  0x00000000004adb1b in ggml_print_backtrace ()
#2  0x000000000050e4ac in ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type*, unsigned long) ()
#3  0x000000000045eb61 in StableDiffusionGGML::apply_lora(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, float) ()
#4  0x000000000045f731 in StableDiffusionGGML::apply_loras(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > > const&) ()
#5  0x000000000043c0f7 in StableDiffusion::txt2img(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, int, int, SampleMethod, int, long, int) ()
#6  0x000000000040ae08 in main ()
[Inferior 1 (process 1639944) detached]
Aborted (core dumped)

Amin456789 · 2023-12-23T12:42:19Z

question: how much ram it will take for a 512x512 image for sdxl turbo? does it work with dreamshaper sdxl turbo verison too? also will it work with taesdxl?

FSSRepo · 2023-12-23T12:47:46Z

question: how much ram it will take for a 512x512 image for sdxl turbo? does it work with dreamshaper sdxl turbo verison too? also will it work with taesdxl?

In my PC, it consumes 8 GB of RAM.

Amin456789 · 2023-12-23T12:53:55Z

thnx for answer!

leejet · 2023-12-24T10:12:27Z

@leejet

the file I got from here seemed to be corrupted. (amateur mistake) So I redownloaded it :)

even with the new file, the behavior is the same (cuda OOM in apply lora)

I suspect it might be due to your limited VRAM, as the 2070 GPU only has 8GB of memory. I'll think about how to optimize the VRAM usage.

Green-Sky · 2023-12-24T11:41:40Z

I suspect it might be due to your limited VRAM, as the 2070 GPU only has 8GB of memory. I'll think about how to optimize the VRAM usage.

Yea, probably. Maybe add an option to merge the lora in on cpu/ram. In any case it's not a blocker. 🎉

leejet · 2023-12-26T14:44:21Z

@Green-Sky Does the latest code from this PR result in invalid images on your device? If it doesn't, would you be able to help by trying to switch ggml to the upstream 99f3f152b2e5627c25e6f36d4b334aa44e91ff78 and check if it causes any issues with invalid images?

cd ggml
git remote add upstream https://github.com/ggerganov/ggml.git
git fetch upstream
git checkout 99f3f152b2e5627c25e6f36d4b334aa44e91ff78

Green-Sky · 2023-12-26T22:43:30Z

@leejet I guess this pr slipped though, since I basically only tested sdxl and that works fine. but yea, this pr and with the upstreamed commit checkeck out both fail the same way on my gpu.

(42,43,44)

Green-Sky · 2023-12-26T22:47:17Z

@leejet I just discovered this issue seems to be rare (but happens) when not batching. Not sure if that helps. Also keep in mind that every run the glitching looks different. So there has to be some kind of inter kernel sync issue.

leejet · 2023-12-28T13:53:45Z

At present, all major tasks have been completed. Support for the refiner and certain SDXL LoRA name accommodations will be addressed in separate PRs. This branch is now ready for merging.

FSSRepo · 2023-12-28T16:28:13Z

@leejet please test vae-tiling with SDXL

leejet · 2023-12-28T16:36:58Z

It works great.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024 --vae-tiling

Green-Sky · 2023-12-28T18:16:37Z

stable-diffusion.h

@@ -4,6 +4,7 @@
 #include <memory>
 #include <string>
 #include <vector>
+#include "ggml/ggml.h"


duped include

I'll delete it.

feat: add SDXL support

4767248

leejet mentioned this pull request Dec 14, 2023

stable-diffusion: implement ESRGAN upscaler + Metal Backend #104

Merged

increase lora graph size

88c5813

h3ndrik mentioned this pull request Dec 15, 2023

Support SDXL Turbo #98

Open

fix the issue with generating large images

71b45e8

leejet mentioned this pull request Dec 18, 2023

fix im2col_f32_f16 ggerganov/ggml#658

Merged

leejet mentioned this pull request Dec 21, 2023

video support #12

Open

Merge branch 'master' into sdxl

976cf9b

add additional conditioning support

3b428c8

Merge branch 'master' into sdxl

858e89b

leejet merged commit 78ad76f into master Dec 28, 2023
7 checks passed

leejet deleted the sdxl branch December 28, 2023 16:16

Green-Sky reviewed Dec 28, 2023

View reviewed changes

piallai mentioned this pull request Mar 16, 2024

SDXL : LoRa problem #203

Closed

feat: add SDXL support #117

feat: add SDXL support #117

Conversation

leejet commented Dec 13, 2023 • edited Loading

SDXL base 1.0

SDXL-Turbo

FSSRepo commented Dec 13, 2023 • edited Loading

leejet commented Dec 13, 2023

leejet commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

leejet commented Dec 13, 2023 • edited Loading

FSSRepo commented Dec 13, 2023 • edited Loading

leejet commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

slaren commented Dec 13, 2023

FSSRepo commented Dec 13, 2023 • edited Loading

slaren commented Dec 14, 2023

FSSRepo commented Dec 14, 2023

FSSRepo commented Dec 14, 2023 • edited Loading

diimdeep commented Dec 14, 2023

leejet commented Dec 14, 2023

leejet commented Dec 14, 2023

diimdeep commented Dec 14, 2023

FSSRepo commented Dec 15, 2023

slaren commented Dec 15, 2023

leejet commented Dec 18, 2023

sdxl txt2img

leejet commented Dec 18, 2023

Green-Sky commented Dec 18, 2023

Green-Sky commented Dec 18, 2023

Green-Sky commented Dec 18, 2023

diimdeep commented Dec 21, 2023

leejet commented Dec 23, 2023 • edited Loading

leejet commented Dec 23, 2023

Green-Sky commented Dec 23, 2023

Amin456789 commented Dec 23, 2023

FSSRepo commented Dec 23, 2023

Amin456789 commented Dec 23, 2023

leejet commented Dec 24, 2023

Green-Sky commented Dec 24, 2023

leejet commented Dec 26, 2023

Green-Sky commented Dec 26, 2023

Green-Sky commented Dec 26, 2023

leejet commented Dec 28, 2023

FSSRepo commented Dec 28, 2023

leejet commented Dec 28, 2023

Green-Sky Dec 28, 2023

Choose a reason for hiding this comment

leejet Dec 29, 2023

Choose a reason for hiding this comment

leejet commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 13, 2023 •

edited

Loading

leejet commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 14, 2023 •

edited

Loading

leejet commented Dec 23, 2023 •

edited

Loading