-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add SDXL support #117
Conversation
AMAZING JOB!!!!, YEAHH, This model probably won't work on a 4GB RAM GPU, and it may not be loadable either. I'll see what I can do to optimize VRAM usage by adding dynamic buffers and split-attention. |
I think the inability to generate a 1024x1024 image might be due to issues in the ggml cuda backend. I have a 24GB graphics card, but still can't generate a 1024x1024 image. At this point, it's not using the full 24GB of VRAM.
It's at this step that the issue arises. ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_output->data, 0, ggml_nbytes(work_output)); |
I've modified the code like this, but it's still throwing the same error.
|
It's very strange |
Could you send me the output of the program, as I cannot test it due to my low VRAM |
|
Something I don't understand is the low memory usage that the computation graph of UNet achieves. I will find it very difficult to debug that since I don't have the hardware for those tests. Anyway, I think I'll try to set up a Colab. |
This makes sense. For UNet, parameters occupy a significant portion of memory, while the runtime tensor memory isn't extensive. Conversely, VAE operates oppositely. Parameter memory consumption isn't large, but the runtime tensor memory usage is quite high. UNet takes a 128x128 feature map as input, further undergoing downsampling. However, VAE starts with a 128x128 feature map and gradually upsamples it to 1024x1024, involving numerous channels. |
@leejet Perhaps processing the VAE in tiles could fix the issue of not generating images in 1024 x 1024, but it would need to be tested. |
Running it with
|
Cuda support unsigned ints? Perhaps by changing some of the int offsets to unsigned int, could solve the problem. Could you print the tensor shape and run a test in backend-ops? |
@FSSRepo I found these im2cols with 1024x1024: test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {1,1,256,128}, 1, 1, 0, 0, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {1,1,4,4}, 1, 1, 0, 0, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {1,1,512,256}, 1, 1, 0, 0, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {1,1,512,512}, 1, 1, 0, 0, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,128}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,128,1}, {3,3,128,3}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,128}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {1024,1024,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,256,1}, {3,3,256,256}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,4,1}, {3,3,4,512}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,256}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {128,128,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {256,256,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, {512,512,512,1}, {3,3,512,512}, 1, 1, 1, 1, 1, 1, true));
|
Some offset integer is overflowing; I will run tests on my computer. |
@leejet using winograd with 512x512, At the moment, the performance is very poor, but this is because kernel processing is done in a single thread (although it can be multi-threaded). This Winograd operation must be carried out in two stages on the CPU. It only works with 3x3 kernels, stride 1, and dilation 1. Reduces memory consumption by 46%. In UNet, we could continue using im2col, and in VAE, use Winograd to avoid memory overload. [DEBUG] stable-diffusion.cpp:4986 - Using CPU backend
[INFO] stable-diffusion.cpp:4996 - loading model from 'models/kotosmix_v10-f16.gguf'
[INFO] model.cpp:624 - load models/kotosmix_v10-f16.gguf using gguf format
[DEBUG] model.cpp:641 - init from 'models/kotosmix_v10-f16.gguf'
[INFO] stable-diffusion.cpp:5019 - Stable Diffusion 1.x
[INFO] stable-diffusion.cpp:5025 - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:5027 - loading vocab
[DEBUG] stable-diffusion.cpp:5038 - ggml tensor size = 448 bytes
[DEBUG] stable-diffusion.cpp:1059 - clip params backend buffer size = 236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2153 - unet params backend buffer size = 1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3176 - vae params backend buffer size = 95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:5050 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:5081 - loading weights
[DEBUG] model.cpp:1219 - loading tensors from models/kotosmix_v10-f16.gguf
[DEBUG] stable-diffusion.cpp:5169 - model size = 1969.67MB
[INFO] stable-diffusion.cpp:5179 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO] stable-diffusion.cpp:5181 - loading model from 'models/kotosmix_v10-f16.gguf' completed, taking 1.35s
[INFO] stable-diffusion.cpp:5195 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:5222 - finished loaded file
[DEBUG] stable-diffusion.cpp:6029 - prompt after extract and remove lora: "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality"
[INFO] stable-diffusion.cpp:6034 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1292 - parse 'beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality' to [['beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality', 1], ]
[DEBUG] stable-diffusion.cpp:709 - split prompt "beautiful anime girl, white hair, blue eyes, realistic, masterpiece, azur lane, 4k, high quality" to tokens ["beautiful</w>", "anime</w>", "girl</w>", ",</w>", "white</w>", "hair</w>", ",</w>", "blue</w>", "eyes</w>", ",</w>", "realistic</w>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", "lane</w>", ",</w>", "4</w>", "k</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 127 ms
[DEBUG] stable-diffusion.cpp:1292 - parse 'bad quality, ugly, face malformed, bad anatomy' to [['bad quality, ugly, face malformed, bad anatomy', 1], ]
[DEBUG] stable-diffusion.cpp:709 - split prompt "bad quality, ugly, face malformed, bad anatomy" to tokens ["bad</w>", "quality</w>", ",</w>", "ugly</w>", ",</w>", "face</w>", "<|endoftext|>", ",</w>", "bad</w>", "anatomy</w>", ]
[DEBUG] stable-diffusion.cpp:1218 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:5339 - computing condition graph completed, taking 124 ms
[INFO] stable-diffusion.cpp:6063 - get_learned_condition completed, taking 253 ms
[INFO] stable-diffusion.cpp:6073 - sampling using DPM++ (2M) method
[INFO] stable-diffusion.cpp:6077 - generating image: 1/1 - seed 424354
[DEBUG] stable-diffusion.cpp:2491 - diffusion compute buffer size: 559.43 MB
|==================================================| 20/20 - 43.70s/it
[INFO] stable-diffusion.cpp:6089 - sampling completed, taking 1127.33s
[INFO] stable-diffusion.cpp:6097 - generating 1 latent images completed, taking 1127.39s
[INFO] stable-diffusion.cpp:6099 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3305 - vae compute buffer size: 770.00 MB
[DEBUG] stable-diffusion.cpp:5925 - computing vae [mode: DECODE] graph completed, taking 137.05s
[INFO] stable-diffusion.cpp:6108 - latent 1 decoded, taking 137.05s
[INFO] stable-diffusion.cpp:6112 - decode_first_stage completed, taking 137.05s
[INFO] stable-diffusion.cpp:6129 - txt2img completed in 1264.70s
[INFO] main.cpp:534 - save result image to 'output.png' |
Tried on colab, it works, but crashing with loras
|
Great! Looking forward to your progress! |
@diimdeep Pull the latest code and try it again. It should be fixed now. |
Ty, now another problem is that output is identical with or without Lora, no effect. here is code I use to run it on colab t4
|
ALIBI(type=f32,ne=[10,10,10,10],n_past=512,n_head=10,bias_max=0.500000): OK
IM2COL(type_input=f32,type_kernel=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
IM2COL(type_input=f32,type_kernel=f16,ne_input=[1024,1024,256,1],ne_kernel=[3,3,256,256],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):
D:\proyectos\ggml\build\bin\Release> I can't even run the test-backend-ops with cpu backend on my computer since I only have 16 GB of RAM, let alone on my 4GB graphics card. I think the problem must be an overload in the CUDA registers. |
You can use any types in a CUDA kernel, like |
I've implemented the following fixes, and now we can generate large 1024x1024 images without any issues. diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 019648b..2e07bc6 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -5259,17 +5259,17 @@ static __global__ void im2col_f32_f16(
const int ky = (i - kd) / OW;
const int ix = i % OW;
- const int iiw = ix * s0 + kx * d0 - p0;
- const int iih = blockIdx.y * s1 + ky * d1 - p1;
+ const int64_t iiw = ix * s0 + kx * d0 - p0;
+ const int64_t iih = blockIdx.y * s1 + ky * d1 - p1;
- const int offset_dst =
+ const int64_t offset_dst =
(blockIdx.y * OW + ix) * CHW +
(blockIdx.z * (KW * KH) + ky * KW + kx);
if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
dst[offset_dst] = __float2half(0.0f);
} else {
- const int offset_src = blockIdx.z * offset_delta;
+ const int64_t offset_src = blockIdx.z * offset_delta;
dst[offset_dst] = __float2half(x[offset_src + iih * IW + iiw]);
}
} sdxl txt2img
|
It seems the current code isn't fully compatible with certain SDXL LoRA names. I'll make time to fix this. |
The model produced the following warning btw:
|
I tried using the lcm lora (for sdxl) and it does not fit into vram. Even when I set width and height to 64. |
Nice, T4 colab could sample -W 1600 -H 2304 showing 13.6GB utilization, but failed later at decoding using taesdxl
-W 1600 -H 2112 at 12.3GB worked |
@Green-Sky Where did you download SDXL LCM LoRA from? I'm using SDXL LCM LoRA, and it works fine.
|
I didn't replicate this issue using similar parameters. By the way, generating images with excessively large dimensions can result in strange outputs.
|
|
question: how much ram it will take for a 512x512 image for sdxl turbo? does it work with dreamshaper sdxl turbo verison too? also will it work with taesdxl? |
In my PC, it consumes 8 GB of RAM. |
thnx for answer! |
I suspect it might be due to your limited VRAM, as the 2070 GPU only has 8GB of memory. I'll think about how to optimize the VRAM usage. |
Yea, probably. Maybe add an option to merge the lora in on cpu/ram. In any case it's not a blocker. 🎉 |
@Green-Sky Does the latest code from this PR result in invalid images on your device? If it doesn't, would you be able to help by trying to switch ggml to the upstream
|
@leejet I guess this pr slipped though, since I basically only tested sdxl and that works fine. but yea, this pr and with the upstreamed commit checkeck out both fail the same way on my gpu. (42,43,44) |
@leejet I just discovered this issue seems to be rare (but happens) when not batching. Not sure if that helps. Also keep in mind that every run the glitching looks different. So there has to be some kind of inter kernel sync issue. |
At present, all major tasks have been completed. Support for the refiner and certain SDXL LoRA name accommodations will be addressed in separate PRs. This branch is now ready for merging. |
@leejet please test vae-tiling with SDXL |
@@ -4,6 +4,7 @@ | |||
#include <memory> | |||
#include <string> | |||
#include <vector> | |||
#include "ggml/ggml.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duped include
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll delete it.
This PR still need some work, such as handling text projection (although it has minimal impact on the result) and addressing issues with generating images >= 1024x1024.!!!The VAE in SDXL encounters NaN issues under FP16, but unfortunately, the ggml_conv_2d only operates under FP16. Hence, a parameter is needed to specify the VAE that has fixed the FP16 NaN issue. You can find it here: SDXL VAE FP16 Fix.
SDXL base 1.0
SDXL-Turbo