Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add SDXL support #117

Merged
merged 6 commits into from
Dec 28, 2023
Merged

feat: add SDXL support #117

merged 6 commits into from
Dec 28, 2023

Conversation

leejet
Copy link
Owner

@leejet leejet commented Dec 13, 2023

This PR still need some work, such as handling text projection (although it has minimal impact on the result) and addressing issues with generating images >= 1024x1024.

!!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.

SDXL base 1.0

sd.exe -m ..\models\sd_xl_base_1.0.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 1024 -W 1024 -p "a lovely cat" -v  

output

SDXL-Turbo

sd.exe -m ..\models\sd_xl_turbo_1.0_fp16.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 768 -W 768 --cfg-scale 1 --steps 1  -p "a lovely cat" -v 

output

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

AMAZING JOB!!!!, YEAHH, This model probably won't work on a 4GB RAM GPU, and it may not be loadable either. I'll see what I can do to optimize VRAM usage by adding dynamic buffers and split-attention.

@leejet
Copy link
Owner Author

leejet commented Dec 13, 2023

AMAZING JOB!!!!, YEAHH, This model probably won't work on a 4GB RAM GPU, and it may not be loadable either. I'll see what I can do to optimize VRAM usage by adding dynamic buffers and split-attention.

I think the inability to generate a 1024x1024 image might be due to issues in the ggml cuda backend. I have a 24GB graphics card, but still can't generate a 1024x1024 image. At this point, it's not using the full 24GB of VRAM.

CUDA error 700 at E:\Code\sd.cpp\ggml\src\ggml-cuda.cu:8722: an illegal memory access was encountered

It's at this step that the issue arises.

ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_output->data, 0, ggml_nbytes(work_output));

@leejet
Copy link
Owner Author

leejet commented Dec 13, 2023

I've modified the code like this, but it's still throwing the same error.

ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_output->data, 0, 1);

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

It's very strange

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

Could you send me the output of the program, as I cannot test it due to my low VRAM

@leejet
Copy link
Owner Author

leejet commented Dec 13, 2023

Could you send me the output of the program, as I cannot test it due to my low VRAM

> sd.exe -m ..\models\sd_xl_turbo_1.0_fp16.safetensors --vae ..\models\sdxl_vae-fp16-fix.safetensors -H 768 -W 768 --cfg-scale 1 --steps 1  -p "a lovely cat" -v 
...
[INFO]  stable-diffusion.cpp:4466 - Stable Diffusion XL
[INFO]  stable-diffusion.cpp:4472 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:4474 - loading vocab
[DEBUG] stable-diffusion.cpp:4485 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:1183 - clip params backend buffer size =  1559.41 MB (1638 tensors)
[DEBUG] stable-diffusion.cpp:2167 - unet params backend buffer size =  4909.43 MB (1705 tensors)
[DEBUG] stable-diffusion.cpp:3189 - vae params backend buffer size =  95.47 MB (164 tensors)
...
[DEBUG] stable-diffusion.cpp:1289 - learned condition compute buffer size: 2.63 MB
[DEBUG] stable-diffusion.cpp:4784 - computing condition graph completed, taking 60 ms
[INFO]  stable-diffusion.cpp:5454 - get_learned_condition completed, taking 61 ms
[INFO]  stable-diffusion.cpp:5464 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:5468 - generating image: 1/1 - seed 42
[DEBUG] stable-diffusion.cpp:2510 - diffusion compute buffer size: 779.48 MB
  |==================================================| 1/1 - 1.87it/s
[INFO]  stable-diffusion.cpp:5480 - sampling completed, taking 0.61s
[INFO]  stable-diffusion.cpp:5488 - generating 1 latent images completed, taking 0.66s
[INFO]  stable-diffusion.cpp:5490 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3318 - vae compute buffer size: 6656.00 MB

CUDA error 700 at E:\Code\sd.cpp\ggml\src\ggml-cuda.cu:8722: an illegal memory access was encountered
current device: 0

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

Something I don't understand is the low memory usage that the computation graph of UNet achieves. I will find it very difficult to debug that since I don't have the hardware for those tests. Anyway, I think I'll try to set up a Colab.

@leejet
Copy link
Owner Author

leejet commented Dec 13, 2023

Something I don't understand is the low memory usage that the computation graph of UNet achieves.

This makes sense. For UNet, parameters occupy a significant portion of memory, while the runtime tensor memory isn't extensive. Conversely, VAE operates oppositely. Parameter memory consumption isn't large, but the runtime tensor memory usage is quite high. UNet takes a 128x128 feature map as input, further undergoing downsampling. However, VAE starts with a 128x128 feature map and gradually upsamples it to 1024x1024, involving numerous channels.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

@leejet Perhaps processing the VAE in tiles could fix the issue of not generating images in 1024 x 1024, but it would need to be tested.

@slaren
Copy link

slaren commented Dec 13, 2023

Running it with compute-sanitizer with 1024x1024 shows an out of bounds access in im2col:

========= Invalid __global__ write of size 2 bytes
=========     at im2col_f32_f16(const float *, __half *, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x630
=========     by thread (32,0,0) in block (35,911,0)
=========     Address 0x1280104010 is out of bounds
=========     and is 2151137264 bytes before the nearest allocation at 0x1300480000 of size 65536 bytes

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 13, 2023

Cuda support unsigned ints? Perhaps by changing some of the int offsets to unsigned int, could solve the problem.

Could you print the tensor shape and run a test in backend-ops?

@slaren
Copy link

slaren commented Dec 14, 2023

@FSSRepo I found these im2cols with 1024x1024:

    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {1,1,256,128}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {1,1,4,4}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {1,1,512,256}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {1,1,512,512}, 1, 1, 0, 0, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,128}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,3}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,128}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {3,3,4,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,256}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {256,256,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[1,1,256,128],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[128,128,4,1],ne_kernel=[1,1,4,4],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[512,512,512,1],ne_kernel=[1,1,512,256],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[128,128,512,1],ne_kernel=[1,1,512,512],s0=1,s1=1,p0=0,p1=0,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,128,1],ne_kernel=[3,3,128,128],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,128,1],ne_kernel=[3,3,128,3],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[3,3,256,128],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):
CUDA error 700 at /home/diego/code/ggml/src/ggml-cuda.cu:9615: an illegal memory access was encountered

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 14, 2023

Some offset integer is overflowing; I will run tests on my computer.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 14, 2023

@leejet using winograd with 512x512, At the moment, the performance is very poor, but this is because kernel processing is done in a single thread (although it can be multi-threaded). This Winograd operation must be carried out in two stages on the CPU. It only works with 3x3 kernels, stride 1, and dilation 1. Reduces memory consumption by 46%. In UNet, we could continue using im2col, and in VAE, use Winograd to avoid memory overload.

[DEBUG] stable-diffusion.cpp:4986 - Using CPU backend
[INFO]  stable-diffusion.cpp:4996 - loading model from 'models/kotosmix_v10-f16.gguf'
[INFO]  model.cpp:624  - load models/kotosmix_v10-f16.gguf using gguf format
[DEBUG] model.cpp:641  - init from 'models/kotosmix_v10-f16.gguf'
[INFO]  stable-diffusion.cpp:5019 - Stable Diffusion 1.x
[INFO]  stable-diffusion.cpp:5025 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:5027 - loading vocab
[DEBUG] stable-diffusion.cpp:5038 - ggml tensor size = 448 bytes
[DEBUG] stable-diffusion.cpp:1059 - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2153 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3176 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:5050 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:5081 - loading weights
[DEBUG] model.cpp:1219 - loading tensors from models/kotosmix_v10-f16.gguf
[DEBUG] stable-diffusion.cpp:5169 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:5179 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:5181 - loading model from 'models/kotosmix_v10-f16.gguf' completed, taking 1.35s
[INFO]  stable-diffusion.cpp:5195 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:5222 - finished loaded file
[DEBUG] stable-diffusion.cpp:6029 - prompt after extract and remove lora: "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality"
[INFO]  stable-diffusion.cpp:6034 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1292 - parse 'beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality' to [['beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality', 1], ]
[DEBUG] stable-diffusion.cpp:709  - split prompt "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality" to tokens ["beautiful</w>", "anime</w>", "girl</w>", ",</w>", "white</w>", "hair</w>", ",</w>", "blue</w>", "eyes</w>", ",</w>", "realistic</w>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", "lane</w>", ",</w>", "4</w>", "k</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 127 ms
[DEBUG] stable-diffusion.cpp:1292 - parse 'bad quality, ugly, face malformed, bad anatomy' to [['bad quality, ugly, face malformed, bad anatomy', 1], ]
[DEBUG] stable-diffusion.cpp:709  - split prompt "bad quality, ugly, face malformed, bad anatomy" to tokens ["bad</w>", "quality</w>", ",</w>", "ugly</w>", ",</w>", "face</w>", "<|endoftext|>", ",</w>", "bad</w>", "anatomy</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 124 ms
[INFO]  stable-diffusion.cpp:6063 - get_learned_condition completed, taking 253 ms
[INFO]  stable-diffusion.cpp:6073 - sampling using DPM++ (2M) method
[INFO]  stable-diffusion.cpp:6077 - generating image: 1/1 - seed 424354
[DEBUG] stable-diffusion.cpp:2491 - diffusion compute buffer size: 559.43 MB
  |==================================================| 20/20 - 43.70s/it
[INFO]  stable-diffusion.cpp:6089 - sampling completed, taking 1127.33s
[INFO]  stable-diffusion.cpp:6097 - generating 1 latent images completed, taking 1127.39s
[INFO]  stable-diffusion.cpp:6099 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3305 - vae compute buffer size: 770.00 MB
[DEBUG] stable-diffusion.cpp:5925 - computing vae [mode: DECODE] graph completed, taking 137.05s
[INFO]  stable-diffusion.cpp:6108 - latent 1 decoded, taking 137.05s
[INFO]  stable-diffusion.cpp:6112 - decode_first_stage completed, taking 137.05s
[INFO]  stable-diffusion.cpp:6129 - txt2img completed in 1264.70s
[INFO]  main.cpp:534  - save result image to 'output.png'

@diimdeep
Copy link

Tried on colab, it works, but crashing with loras

ggml_new_object: not enough space in the context's memory pool (needed 459168, available 458752)
[DEBUG] stable-diffusion.cpp:4427 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
[INFO]  stable-diffusion.cpp:4441 - loading model from 'models/juggernautXL_v7Rundiffusion.safetensors'
[INFO]  model.cpp:633  - load models/juggernautXL_v7Rundiffusion.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/juggernautXL_v7Rundiffusion.safetensors'
[INFO]  stable-diffusion.cpp:4450 - loading vae from 'models/sdxl_vae-fp16-fix.safetensors'
[INFO]  model.cpp:633  - load models/sdxl_vae-fp16-fix.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/sdxl_vae-fp16-fix.safetensors'
[INFO]  stable-diffusion.cpp:4466 - Stable Diffusion XL 
[INFO]  stable-diffusion.cpp:4472 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:4474 - loading vocab
[DEBUG] stable-diffusion.cpp:4485 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:1183 - clip params backend buffer size =  1559.41 MB (1638 tensors)
[DEBUG] stable-diffusion.cpp:2167 - unet params backend buffer size =  4909.43 MB (1705 tensors)
[DEBUG] stable-diffusion.cpp:3189 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:4502 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:4533 - loading weights
[DEBUG] model.cpp:1230 - loading tensors from models/juggernautXL_v7Rundiffusion.safetensors
[WARN]  stable-diffusion.cpp:4557 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file
[DEBUG] model.cpp:1230 - loading tensors from models/sdxl_vae-fp16-fix.safetensors
[DEBUG] stable-diffusion.cpp:4621 - model size = 6646.81MB
[INFO]  stable-diffusion.cpp:4627 - total memory buffer size = 6564.31MB (clip 1559.41MB, unet 4909.43MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:4633 - loading model from 'models/juggernautXL_v7Rundiffusion.safetensors' completed, taking 3.44s
[INFO]  stable-diffusion.cpp:4647 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:4674 - finished loaded file
[DEBUG] stable-diffusion.cpp:5412 - lora SDXLFrosted:0.30
[DEBUG] stable-diffusion.cpp:5416 - prompt after extract and remove lora: "A Photograph, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3 "
[INFO]  stable-diffusion.cpp:4050 - loading LoRA from 'models/SDXLFrosted.safetensors'
[INFO]  model.cpp:633  - load models/SDXLFrosted.safetensors using safetensors format
[DEBUG] model.cpp:699  - init from 'models/SDXLFrosted.safetensors'
[DEBUG] stable-diffusion.cpp:4071 - calculating buffer size
[DEBUG] stable-diffusion.cpp:4073 - lora params backend buffer size =  435.30 MB
[DEBUG] model.cpp:1230 - loading tensors from models/SDXLFrosted.safetensors
ggml_new_object: not enough space in the context's memory pool (needed 459168, available 458752)

@leejet
Copy link
Owner Author

leejet commented Dec 14, 2023

@leejet using winograd with 512x512, At the moment, the performance is very poor, but this is because kernel processing is done in a single thread (although it can be multi-threaded). This Winograd operation must be carried out in two stages on the CPU. It only works with 3x3 kernels, stride 1, and dilation 1. Reduces memory consumption by 46%. In UNet, we could continue using im2col, and in VAE, use Winograd to avoid memory overload.

Great! Looking forward to your progress!

@leejet
Copy link
Owner Author

leejet commented Dec 14, 2023

@diimdeep Pull the latest code and try it again. It should be fixed now.

@diimdeep
Copy link

Ty, now another problem is that output is identical with or without Lora, no effect.
here is log

log.txt

here is code I use to run it on colab t4

!git clone https://github.com/leejet/stable-diffusion.cpp
%cd stable-diffusion.cpp
!git checkout sdxl
!git submodule update --init
!cmake -B build -DSD_CUBLAS=ON && cmake --build build --config Release
!mkdir output
!mkdir models
!wget https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/resolve/main/sdxl_vae.safetensors?download=true
!mv 'sdxl_vae.safetensors?download=true' models/sdxl_vae-fp16-fix.safetensors
!wget https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors?download=true
!mv sd_xl_base_1.0.safetensors?download=true models/sd_xl_base_1.0.safetensors
!wget https://civitai.com/api/download/models/247528?type=Model&format=SafeTensor
!mv '247528?type=Model' models/SDXLFrosted.safetensors

!./build/bin/sd -m models/sd_xl_base_1.0.safetensors \
--vae models/sdxl_vae-fp16-fix.safetensors \
--lora-model-dir models/ \
-p "A Photograph, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3 <lora:SDXLFrosted:0.9>" \
-s 1619903001 --sampling-method euler_a --cfg-scale 7 --steps 35 -W 512 -H 1024 -o output/sdxl_09.png

@h3ndrik h3ndrik mentioned this pull request Dec 15, 2023
@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 15, 2023

@slaren

 ALIBI(type=f32,ne=[10,10,10,10],n_past=512,n_head=10,bias_max=0.500000): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[3,3,256,256],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):
D:\proyectos\ggml\build\bin\Release>

I can't even run the test-backend-ops with cpu backend on my computer since I only have 16 GB of RAM, let alone on my 4GB graphics card. I think the problem must be an overload in the CUDA registers.

@slaren
Copy link

slaren commented Dec 15, 2023

You can use any types in a CUDA kernel, like int64_t or size_t. I don't have to time to look more into this right now, but I will check it some time next week if it is still not fixed.

@leejet
Copy link
Owner Author

leejet commented Dec 18, 2023

I've implemented the following fixes, and now we can generate large 1024x1024 images without any issues.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 019648b..2e07bc6 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -5259,17 +5259,17 @@ static  __global__ void im2col_f32_f16(
     const int ky = (i - kd) / OW;
     const int ix = i % OW;

-    const int iiw = ix * s0 + kx * d0 - p0;
-    const int iih = blockIdx.y * s1 + ky * d1 - p1;
+    const int64_t iiw = ix * s0 + kx * d0 - p0;
+    const int64_t iih = blockIdx.y * s1 + ky * d1 - p1;

-    const int offset_dst =
+    const int64_t offset_dst =
         (blockIdx.y * OW + ix) * CHW +
         (blockIdx.z * (KW * KH) + ky * KW + kx);

     if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
         dst[offset_dst] = __float2half(0.0f);
     } else {
-        const int offset_src = blockIdx.z * offset_delta;
+        const int64_t offset_src = blockIdx.z * offset_delta;
         dst[offset_dst] = __float2half(x[offset_src + iih * IW + iiw]);
     }
 }

sdxl txt2img

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024

output

@leejet
Copy link
Owner Author

leejet commented Dec 18, 2023

Ty, now another problem is that output is identical with or without Lora, no effect. here is log

log.txt

It seems the current code isn't fully compatible with certain SDXL LoRA names. I'll make time to fix this.

@Green-Sky
Copy link
Contributor

This is great. I tried to push the envelope of how large I can fit into my 8gig vram (-700 for os). Since VAE tiling is not available yet, I used TAESD(xl).
The largest I could generate was -W 1024 -H 1600 which uses ~6.95gig vram.
output

(model is talmendoxlSDXL_v11Beta.safetensors)

@Green-Sky
Copy link
Contributor

The model produced the following warning btw:

[WARN]  stable-diffusion.cpp:4557 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file

@Green-Sky
Copy link
Contributor

I tried using the lcm lora (for sdxl) and it does not fit into vram. Even when I set width and height to 64.

@leejet leejet mentioned this pull request Dec 21, 2023
@diimdeep
Copy link

Nice, T4 colab could sample -W 1600 -H 2304 showing 13.6GB utilization, but failed later at decoding using taesdxl

ggml_new_object: not enough space in the context's memory pool (needed 99299472, available 98959360)

-W 1600 -H 2112 at 12.3GB worked

@leejet
Copy link
Owner Author

leejet commented Dec 23, 2023

I tried using the lcm lora (for sdxl) and it does not fit into vram. Even when I set width and height to 64.

@Green-Sky Where did you download SDXL LCM LoRA from? I'm using SDXL LCM LoRA, and it works fine.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 1024 -W 1024 --cfg-scale 1 --steps 4

output

@leejet
Copy link
Owner Author

leejet commented Dec 23, 2023

Nice, T4 colab could sample -W 1600 -H 2304 showing 13.6GB utilization, but failed later at decoding using taesdxl

ggml_new_object: not enough space in the context's memory pool (needed 99299472, available 98959360)

-W 1600 -H 2112 at 12.3GB worked

I didn't replicate this issue using similar parameters. By the way, generating images with excessively large dimensions can result in strange outputs.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors --lora-model-dir ..\..\stable-diffusion-webui\models\Lora\ -p "a lovely cat<lora:lcm-lora-xl:1>" -v   -H 2304 -W 1600 --cfg-scale 1 --steps 4 --taesd ..\models\taesdxl.safetensors

output

@Green-Sky
Copy link
Contributor

@leejet

  1. the file I got from here seemed to be corrupted. (amateur mistake) So I redownloaded it :)
  2. even with the new file, the behavior is the same (cuda OOM in apply lora)
$ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors --vae ../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors --lora-model-dir ../stable-diffusion-webui/models/Lora/ -p "<lora:lcm_sdxl:1>a lovely cat" -H 1024 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5
[INFO]  stable-diffusion.cpp:4604 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:4613 - loading vae from '../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/VAE-approx/sdxl_vae.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:4629 - Stable Diffusion XL
[INFO]  stable-diffusion.cpp:4635 - Stable Diffusion weight type: f16
[WARN]  stable-diffusion.cpp:4719 - unknown tensor 'cond_stage_model.1.model.text_projection' in model file
[INFO]  stable-diffusion.cpp:4789 - total memory buffer size = 6564.31MB (clip 1559.41MB, unet 4909.43MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:4795 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/talmendoxlSDXL_v11Beta.safetensors' completed, taking 2.06s
[INFO]  stable-diffusion.cpp:4809 - running in eps-prediction mode
[INFO]  stable-diffusion.cpp:4213 - loading LoRA from '../stable-diffusion-webui/models/Lora/lcm_sdxl.safetensors'
[INFO]  model.cpp:633  - load ../stable-diffusion-webui/models/Lora/lcm_sdxl.safetensors using safetensors format

CUDA error 2 at /build/27vlpayv89jdf3gdy19kaj1x02fhr2y4-source/ggml/src/ggml-cuda.cu:9464: out of memory
current device: 0
GGML_ASSERT: /build/27vlpayv89jdf3gdy19kaj1x02fhr2y4-source/ggml/src/ggml-cuda.cu:9464: !"CUDA error"
[New LWP 1639948]
[New LWP 1639949]
[New LWP 1639950]
warning: File "/nix/store/9fy9zzhf613xp0c3jsjxbjq6yp8afrsv-gcc-12.3.0-lib/lib/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/gi26p79iq8jrw51irq5x82c2cqlgicxi-gcc-12.3.0-lib".
To enable execution of this file add
	add-auto-load-safe-path /nix/store/9fy9zzhf613xp0c3jsjxbjq6yp8afrsv-gcc-12.3.0-lib/lib/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/home/green/.config/gdb/gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/green/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
warning: File "/nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libthread_db.so.1" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/gi26p79iq8jrw51irq5x82c2cqlgicxi-gcc-12.3.0-lib".
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x00007f604a50f0d7 in wait4 () from /nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6
#0  0x00007f604a50f0d7 in wait4 () from /nix/store/gqghjch4p1s69sv4mcjksb2kb65rwqjy-glibc-2.38-23/lib/libc.so.6
#1  0x00000000004adb1b in ggml_print_backtrace ()
#2  0x000000000050e4ac in ggml_backend_cuda_buffer_type_alloc_buffer(ggml_backend_buffer_type*, unsigned long) ()
#3  0x000000000045eb61 in StableDiffusionGGML::apply_lora(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, float) ()
#4  0x000000000045f731 in StableDiffusionGGML::apply_loras(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > > const&) ()
#5  0x000000000043c0f7 in StableDiffusion::txt2img(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, int, int, SampleMethod, int, long, int) ()
#6  0x000000000040ae08 in main ()
[Inferior 1 (process 1639944) detached]
Aborted (core dumped)

@Amin456789
Copy link

question: how much ram it will take for a 512x512 image for sdxl turbo? does it work with dreamshaper sdxl turbo verison too? also will it work with taesdxl?

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 23, 2023

question: how much ram it will take for a 512x512 image for sdxl turbo? does it work with dreamshaper sdxl turbo verison too? also will it work with taesdxl?

In my PC, it consumes 8 GB of RAM.

@Amin456789
Copy link

thnx for answer!

@leejet
Copy link
Owner Author

leejet commented Dec 24, 2023

@leejet

  1. the file I got from here seemed to be corrupted. (amateur mistake) So I redownloaded it :)
  2. even with the new file, the behavior is the same (cuda OOM in apply lora)

I suspect it might be due to your limited VRAM, as the 2070 GPU only has 8GB of memory. I'll think about how to optimize the VRAM usage.

@Green-Sky
Copy link
Contributor

I suspect it might be due to your limited VRAM, as the 2070 GPU only has 8GB of memory. I'll think about how to optimize the VRAM usage.

Yea, probably. Maybe add an option to merge the lora in on cpu/ram. In any case it's not a blocker. 🎉

@leejet
Copy link
Owner Author

leejet commented Dec 26, 2023

@Green-Sky Does the latest code from this PR result in invalid images on your device? If it doesn't, would you be able to help by trying to switch ggml to the upstream 99f3f152b2e5627c25e6f36d4b334aa44e91ff78 and check if it causes any issues with invalid images?

cd ggml
git remote add upstream https://github.com/ggerganov/ggml.git
git fetch upstream
git checkout 99f3f152b2e5627c25e6f36d4b334aa44e91ff78

@Green-Sky
Copy link
Contributor

@leejet I guess this pr slipped though, since I basically only tested sdxl and that works fine. but yea, this pr and with the upstreamed commit checkeck out both fail the same way on my gpu.

output
output_2
output_3

(42,43,44)

@Green-Sky
Copy link
Contributor

@leejet I just discovered this issue seems to be rare (but happens) when not batching. Not sure if that helps. Also keep in mind that every run the glitching looks different. So there has to be some kind of inter kernel sync issue.

@leejet
Copy link
Owner Author

leejet commented Dec 28, 2023

At present, all major tasks have been completed. Support for the refiner and certain SDXL LoRA name accommodations will be addressed in separate PRs. This branch is now ready for merging.

@leejet leejet merged commit 78ad76f into master Dec 28, 2023
7 checks passed
@leejet leejet deleted the sdxl branch December 28, 2023 16:16
@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 28, 2023

@leejet please test vae-tiling with SDXL

@leejet
Copy link
Owner Author

leejet commented Dec 28, 2023

It works great.

.\bin\Release\sd.exe -m ..\..\stable-diffusion-webui\models\Stable-diffusion\sd_xl_base_1.0.safetensors --vae ..\..\stable-diffusion-webui\models\VAE\sdxl_vae-fp16-fix.safetensors -p "a lovely cat" -v   -H 1024 -W 1024 --vae-tiling

output

@@ -4,6 +4,7 @@
#include <memory>
#include <string>
#include <vector>
#include "ggml/ggml.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duped include

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delete it.

@piallai piallai mentioned this pull request Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants