First impressions info dump #1

Green-Sky · 2023-08-13T10:48:47Z

Hey, finally stable diffusion for ggml 😄

Did a test run

$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts:
I have a 12core(24threads) cpu.
I took the timing of a sampling step.

quant	q8_0	q4_0	f16
-t 1	75.31s	75.20s	82.92s
-t 2	42.44s
-t 4	28.65s	29.23s	30.00s
-t 6	21.68s
-t 8	18.34s	18.89s	19.05s
-t 10	16.38s	16.78s	17.61s
-t 12	16.26s	16.98s	18.11s
-t 14	17.93s
-t 16	16.80s
-t 18	16.70s
-t 20	16.20s
-t 22	16.96s
-t 24	18.93s

Additional questions:

do you have/plan to support token weighing? ( eg: (cinematic:1.3) )
are you looking into supporting cuda/opencl backends from ggml?
are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

The text was updated successfully, but these errors were encountered:

leejet · 2023-08-13T11:23:51Z

Hey, finally stable diffusion for ggml 😄

Did a test run
$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'
Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts: I have a 12core(24threads) cpu. I took the timing of a sampling step.

quant q8_0 q4_0 f16
-t 1 75.31s 75.20s 82.92s
-t 2 42.44s
-t 4 28.65s 29.23s 30.00s
-t 6 21.68s
-t 8 18.34s 18.89s 19.05s
-t 10 16.38s 16.78s 17.61s
-t 12 16.26s 16.98s 18.11s
-t 14 17.93s
-t 16 16.80s
-t 18 16.70s
-t 20 16.20s
-t 22 16.96s
-t 24 18.93s
Additional questions:

do you have/plan to support token weighing? ( eg: (cinematic:1.3) )

are you looking into supporting cuda/opencl backends from ggml?

are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)

it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )

did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?

my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Thanks for the feedback.

Yes, I'm preparing to support an tokenizer in the style of stable-diffusion-webui, which includes token weighing.
I'm working on adding GPU support and currently focusing on getting ggml_conv_2d to function on the GPU. Because ggml_conv_2d only supports CPU now.
Great idea, I'll add this to the TODO list.
You can add the -v or --verbose parameter, which will allow you to see the system info.
Currently, only SD 1.x is supported. Support for SD 2.x will be added in the future.
Yes, a relatively large amount of memory is being used to store dynamic data (which is actually an optimized outcome). GGML currently utilizes f32 to store temporary calculation results. Changing it to f16 would reduce dynamic memory usage by half. I'm currently contemplating how to modify GGML to achieve this goal.

Green-Sky · 2023-08-13T12:10:05Z

Yes, I'm preparing to support an tokenizer in the style of stable-diffusion-webui, which includes token weighing.

very nice

I'm working on adding GPU support and currently focusing on getting ggml_conv_2d to function on the GPU. Because ggml_conv_2d only supports CPU now.

i see

You can add the -v or --verbose parameter, which will allow you to see the system info.

oh, i overlooked that one

run with verbose

$ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-f16.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" -v
Option:
    n_threads:       10
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-f16.bin
    output_path:     output.png
    prompt:          alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-f16.bin'
[DEBUG] stable-diffusion.cpp:2197 - verifying magic
[DEBUG] stable-diffusion.cpp:2208 - loading hparams
[INFO]  stable-diffusion.cpp:2214 - ftype: f16
[DEBUG] stable-diffusion.cpp:2220 - loading vocab
[DEBUG] stable-diffusion.cpp:2258 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1970.08 MB
[DEBUG] stable-diffusion.cpp:2276 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2291 - loading weights
[DEBUG] stable-diffusion.cpp:2396 - model size =  1969.67MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 0.59s
[DEBUG] stable-diffusion.cpp:333  - split prompt "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" to tokens ["alps</w>", ",</w>", "distant</w>", "<|endoftext|>", ",</w>", "small</w>", "church</w>", ",</w>", "(</w>", "cinematic</w>", ":</w>", "1</w>", ".</w>", "3</w>", "),</w>", "intricate</w>", "details</w>", ",</w>", "(</w>", "<|endoftext|>", ":</w>", "1</w>", ".</w>", "2</w>", "),</w>", "nikon</w>", "<|endoftext|>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2434 - condition context need 1.62MB static memory, with work_size needing 0.45MB
[DEBUG] stable-diffusion.cpp:2459 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2467 - computing condition graph completed, taking 0.11s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.56MB of memory: static 1.62MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2481 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:333  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:2434 - condition context need 1.62MB static memory, with work_size needing 0.45MB
[DEBUG] stable-diffusion.cpp:2459 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2467 - computing condition graph completed, taking 0.12s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.56MB of memory: static 1.62MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2481 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.24s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[DEBUG] stable-diffusion.cpp:2529 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 17.08s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 17.31s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 17.06s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 17.19s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 17.15s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 17.12s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 16.87s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 17.01s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 17.11s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 17.39s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 17.10s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 16.85s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 16.94s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 17.00s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 17.14s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 17.03s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 17.75s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 17.98s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 17.40s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 17.10s
[DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2690 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 343.58s
[DEBUG] stable-diffusion.cpp:2731 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:2757 - computing vae graph completed, taking 54.01s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[DEBUG] stable-diffusion.cpp:2770 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 54.05s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 397.86s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1969.94MB
save result image to 'output.png'

[DEBUG] stable-diffusion.cpp:333  - split prompt "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" to tokens ["alps</w>", ",</w>", "distant</w>", "<|endoftext|>", ",</w>", "small</w>", "church</w>", ",</w>", "(</w>", "cinematic</w>", ":</w>", "1</w>", ".</w>", "3</w>", "),</w>", "intricate</w>", "details</w>", ",</w>", "(</w>", "<|endoftext|>", ":</w>", "1</w>", ".</w>", "2</w>", "),</w>", "nikon</w>", "<|endoftext|>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", ]

the tokenizer really looks like it needs some work, really surprised the image came out that good.

Currently, only SD 1.x is supported. Support for SD 2.x will be added in the future.

good to hear.

Yes, a relatively large amount of memory is being used to store dynamic data (which is actually an optimized outcome). GGML currently utilizes f32 to store temporary calculation results. Changing it to f16 would reduce dynamic memory usage by half. I'm currently contemplating how to modify GGML to achieve this goal.

cant wait 😄

slaren · 2023-08-13T12:55:58Z

Adding my first impressions here as well. I had some compile errors in my system:

stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’:
stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope
  171 |     memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst));
      |     ^~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’?
   15 | #include "stable-diffusion.h"
  +++ |+#include <cstring>
   16 |
stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’:
stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’
  318 |                 std::vector<std::string> tokens{std::istream_iterator<std::string>{iss},
      |                                                      ^~~~~~~~~~~~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?

After adding these includes (<cstring> and <iterator>) to stable-diffusion.cpp it worked great.

Even with q4_0 the results are pretty good! I got this image with the example prompt:

leejet · 2023-08-13T13:49:42Z

Adding my first impressions here as well. I had some compile errors in my system:

stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’:
stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope
  171 |     memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst));
      |     ^~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’?
   15 | #include "stable-diffusion.h"
  +++ |+#include <cstring>
   16 |
stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’:
stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’
  318 |                 std::vector<std::string> tokens{std::istream_iterator<std::string>{iss},
      |                                                      ^~~~~~~~~~~~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?

After adding these includes (<cstring> and <iterator>) to stable-diffusion.cpp it worked great.

Even with q4_0 the results are pretty good! I got this image with the example prompt:

Thanks for the feedback.
Following your advice, I've addressed this issue in the latest commit. This compilation error might have occurred due to differences in compiler implementations. I tested with MSVC and GCC and didn't encounter this problem. May I ask which compiler you are using?

slaren · 2023-08-13T13:51:43Z

I am using gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0, which should be the current version of GCC in Ubuntu-latest.

leejet · 2023-08-13T13:59:18Z

I am using gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0, which should be the current version of GCC in Ubuntu-latest.

I'm using gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04). Haha, environmental issues can indeed be quite frustrating.

Green-Sky · 2023-08-13T14:17:49Z

I'm using gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04). Haha, environmental issues can indeed be quite frustrating.

ah yes, a fellow ubuntu20.04 user stuck on lts 🤣

ggerganov · 2023-08-13T18:49:15Z

Cool stuff!

Here is a sample run on M2 Ultra:

$ ▶ ./sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -t 12
[INFO]  stable-diffusion.cpp:2191 - loading model from '../models/sd-v1-4-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2216 - ftype: f16
[INFO]  stable-diffusion.cpp:2261 - params ctx size =  1970.08 MB
[INFO]  stable-diffusion.cpp:2401 - loading model from '../models/sd-v1-4-ggml-model-f16.bin' completed, taking 0.72s
[INFO]  stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2824 - get_learned_condition completed, taking 0.12s
[INFO]  stable-diffusion.cpp:2832 - start sampling
[INFO]  stable-diffusion.cpp:2676 - step 1 sampling completed, taking 5.42s
[INFO]  stable-diffusion.cpp:2676 - step 2 sampling completed, taking 5.35s
[INFO]  stable-diffusion.cpp:2676 - step 3 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 4 sampling completed, taking 5.35s
[INFO]  stable-diffusion.cpp:2676 - step 5 sampling completed, taking 5.30s
[INFO]  stable-diffusion.cpp:2676 - step 6 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 7 sampling completed, taking 5.36s
[INFO]  stable-diffusion.cpp:2676 - step 8 sampling completed, taking 5.47s
[INFO]  stable-diffusion.cpp:2676 - step 9 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 10 sampling completed, taking 5.37s
[INFO]  stable-diffusion.cpp:2676 - step 11 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 12 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 13 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 14 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 15 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 16 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 17 sampling completed, taking 5.39s
[INFO]  stable-diffusion.cpp:2676 - step 18 sampling completed, taking 5.36s
[INFO]  stable-diffusion.cpp:2676 - step 19 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 20 sampling completed, taking 5.38s
[INFO]  stable-diffusion.cpp:2691 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2837 - sampling completed, taking 107.12s
[INFO]  stable-diffusion.cpp:2771 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2844 - decode_first_stage completed, taking 17.86s
[INFO]  stable-diffusion.cpp:2850 - txt2img completed in 125.10s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1969.94MB
save result image to 'output.png'

another seed

What is currently the main performance bottleneck?
Here is a rough breakdown of the time per op using GGML_PERF:

perf_total_per_op_us[             ADD] =   169.408 ms
perf_total_per_op_us[             MUL] =   154.503 ms
perf_total_per_op_us[          REPEAT] =   308.208 ms
perf_total_per_op_us[          CONCAT] =     8.171 ms
perf_total_per_op_us[            GELU] =     4.251 ms
perf_total_per_op_us[            SILU] =     3.978 ms
perf_total_per_op_us[            NORM] =    41.288 ms
perf_total_per_op_us[      GROUP_NORM] =    24.921 ms
perf_total_per_op_us[         MUL_MAT] =  1258.711 ms
perf_total_per_op_us[           SCALE] =    47.123 ms
perf_total_per_op_us[            CONT] =   130.151 ms
perf_total_per_op_us[         RESHAPE] =     0.970 ms
perf_total_per_op_us[            VIEW] =     0.108 ms
perf_total_per_op_us[         PERMUTE] =     0.235 ms
perf_total_per_op_us[        SOFT_MAX] =   135.226 ms
perf_total_per_op_us[         CONV_2D] =  2795.054 ms
perf_total_per_op_us[         UPSCALE] =     4.307 ms

Looks like CONV_2D needs some work.

Would be nice to upstream the new ggml operators at some point. Not sure about the "dynamic mode" though
The "concat" might be possible to achieve via view + cpy

leejet · 2023-08-14T00:26:46Z

Thank you for the feedback. Thank you for creating such amazing ggml.

Would be nice to upstream the new ggml operators at some point. Not sure about the "dynamic mode" though

OK, I will sort out the code of new operators and upstream later. I'm also considering whether to upstream the "dynamic mode".

The "concat" might be possible to achieve via view + cpy

I've tried it before，but it seems that combining view + cpy cannot fulfill the concatenation requirement along dim=1.

x-legion · 2023-08-14T06:09:44Z

Any plans for sdxl?

leejet · 2023-08-14T12:53:42Z

Any plans for sdxl?

I'm willing to implement SDXL once I've improved the support for SD 1.x and added support for SD 2.x.

Green-Sky · 2023-08-16T08:45:16Z

Took a stab at a larger resolution 768x768

$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768

Details

Option:
    n_threads:       12
    mode:            txt2img
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.00
    width:           768
    height:          768
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1618.72 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1618.31MB
[INFO]  stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.42s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.06s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.07s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.13s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2846 - diffusion context need 153.98MB static memory, with work_size needing 151.88MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 40.92s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 2 sampling completed, taking 40.72s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 3 sampling completed, taking 40.61s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 4 sampling completed, taking 41.12s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 5 sampling completed, taking 42.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 6 sampling completed, taking 44.84s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 7 sampling completed, taking 40.99s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 8 sampling completed, taking 40.95s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 9 sampling completed, taking 40.95s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 10 sampling completed, taking 40.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 11 sampling completed, taking 41.00s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 12 sampling completed, taking 40.87s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 13 sampling completed, taking 41.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 14 sampling completed, taking 40.99s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 15 sampling completed, taking 40.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 16 sampling completed, taking 40.87s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 17 sampling completed, taking 40.98s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 18 sampling completed, taking 40.96s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 19 sampling completed, taking 40.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 20 sampling completed, taking 40.79s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3001 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:3005 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3258 - sampling completed, taking 824.76s
[DEBUG] stable-diffusion.cpp:3153 - vae context need 2593.12MB static memory, with work_size needing 2592.00MB
[DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 104.12s
[INFO]  stable-diffusion.cpp:3188 - vae graph use 5271.16MB of memory: static 2593.12MB, dynamic = 2678.04MB
[DEBUG] stable-diffusion.cpp:3192 - 7077888 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3265 - decode_first_stage completed, taking 104.21s
[INFO]  stable-diffusion.cpp:3266 - txt2img completed in 929.10s, with a runtime memory usage of 5271.16MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

unsurprisingly it takes way (way) longer:

[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 40.92s

whoreson · 2023-08-16T15:00:44Z

Wow, this is so cool. Easy to convert existing models, quantization.. very nice.

https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.

leejet · 2023-08-16T17:04:38Z

my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.

leejet · 2023-08-16T17:06:23Z

Wow, this is so cool. Easy to convert existing models, quantization.. very nice.

https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.

Oh, yeah. Now I'm working hard to make it run faster.

Green-Sky · 2023-08-16T17:41:10Z

I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.

is this already on master? bc i reran my diffusion above with similar timings and memory usage (? memory reporting changed)

Details

$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768
Option:
    n_threads:       12
    mode:            txt2img
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.00
    width:           768
    height:          768
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2525 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2533 - verifying magic
[DEBUG] stable-diffusion.cpp:2544 - loading hparams
[INFO]  stable-diffusion.cpp:2550 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2556 - loading vocab
[DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes
[DEBUG] stable-diffusion.cpp:2589 - clip params ctx size =  126.32 MB
[DEBUG] stable-diffusion.cpp:2608 - unet params ctx size =  1399.91 MB
[DEBUG] stable-diffusion.cpp:2629 - vae params ctx size =  95.51 MB
[DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2666 - loading weights
[DEBUG] stable-diffusion.cpp:2770 - model size = 1618.31MB
[INFO]  stable-diffusion.cpp:2775 - total params size = 1618.61MB (clip 125.09MB, unet 1399.01MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 1.03s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2816 - condition context need 1.43MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2841 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.06s
[INFO]  stable-diffusion.cpp:2866 - condition graph use 129.45MB of memory: params 125.09MB, runtime 4.36MB (static 1.43MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2816 - condition context need 1.43MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2841 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.06s
[INFO]  stable-diffusion.cpp:2866 - condition graph use 129.45MB of memory: params 125.09MB, runtime 4.36MB (static 1.43MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.13s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[DEBUG] stable-diffusion.cpp:2924 - diffusion context need 154.01MB static memory, with work_size needing 151.88MB
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 40.27s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 41.21s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 42.13s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 40.74s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 42.93s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 42.41s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 40.61s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 42.78s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 42.33s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 46.70s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 42.42s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 44.02s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 44.62s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 42.32s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 40.55s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 40.74s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 40.20s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 43.68s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 40.92s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 40.75s
[DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB
[DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3087 - diffusion graph use 4231.06MB of memory: params 1399.01MB, runtime 2832.05MB (static 154.01MB, dynamic 2678.04MB)
[DEBUG] stable-diffusion.cpp:3095 - 147456 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 842.36s
[DEBUG] stable-diffusion.cpp:3254 - vae context need 2593.12MB static memory, with work_size needing 2592.00MB
[DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 108.53s
[INFO]  stable-diffusion.cpp:3296 - vae graph use 5365.67MB of memory: params 94.51MB, runtime 5271.16MB (static 2593.12MB, dynamic 2678.04MB)
[DEBUG] stable-diffusion.cpp:3304 - 7077888 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 108.67s
[INFO]  stable-diffusion.cpp:3401 - txt2img completed in 951.16s, use 5365.67MB of memory: peak params memory 1618.61MB, peak runtime memory 5271.16MB
save result image to 'output.png'

leejet · 2023-08-16T17:56:10Z

$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768

Since you are generating 768x768 images, this will cause the runtime memory to grow, and there is still room for optimization

Green-Sky · 2023-08-17T14:58:14Z

@leejet i dont think that is how that label is supposed to be used 😄

leejet · 2023-08-17T15:12:18Z

@leejet i dont think that is how that label is supposed to be used 😄

You're right, I made a mistake. I accidentally clicked on it while browsing, it wasn't my intention.

BrutalCoding · 2023-08-20T06:41:16Z

Found this repo thanks to HN (hackernews). Had 0 issues when trying out this for the first time yesterday.

Just to share along, I've added 2 outputs of v1-5 in f16 and q4_1. This is coming from my MBP 16" (2021/M1PRO/16GB/512GB).

f16

v1-5-pruned-emaonly-ggml-model-f16.bin

> ./sd -m v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat" 
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2550 - ftype: f16
[INFO]  stable-diffusion.cpp:2779 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 1.85s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.19s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.74s
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.33s
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.37s
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.52s
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 8.95s
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.90s
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.54s
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 8.95s
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.21s
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.00s
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.49s
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.43s
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.38s
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.16s
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.01s
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 8.92s
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.44s
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.68s
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.53s
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 186.68s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 28.31s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 215.18s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

Expand to see re-run in verbose

> ./sd -m v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat" -v
Option: 
    n_threads:       8
    mode:            txt2img
    model_path:      v1-5-pruned-emaonly-ggml-model-f16.bin
    output_path:     output.png
    init_img:        
    prompt:          a lovely cat
    negative_prompt: 
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info: 
    BLAS = 1
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 1
    ARM_FMA = 1
    F16C = 0
    FP16_VA = 1
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin'
[DEBUG] stable-diffusion.cpp:2533 - verifying magic
[DEBUG] stable-diffusion.cpp:2544 - loading hparams
[INFO]  stable-diffusion.cpp:2550 - ftype: f16
[DEBUG] stable-diffusion.cpp:2556 - loading vocab
[DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes
[DEBUG] stable-diffusion.cpp:2589 - clip params ctx size =  236.23 MB
[DEBUG] stable-diffusion.cpp:2608 - unet params ctx size =  1641.36 MB
[DEBUG] stable-diffusion.cpp:2629 - vae params ctx size =  95.51 MB
[DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2666 - loading weights
[DEBUG] stable-diffusion.cpp:2770 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:2779 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 1.84s
[DEBUG] stable-diffusion.cpp:353  - split prompt "a lovely cat" to tokens ["a</w>", "lovely</w>", "cat</w>", ]
[DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB
[DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.09s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB
[DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.13s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.21s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[DEBUG] stable-diffusion.cpp:2926 - diffusion context need 69.56MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.70s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.55s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.80s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.65s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.20s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.55s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.52s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.12s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 10.00s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.40s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.65s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.91s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 10.75s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.75s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.51s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 10.05s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 9.73s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 10.15s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.80s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.50s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[DEBUG] stable-diffusion.cpp:3095 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 194.27s
[DEBUG] stable-diffusion.cpp:3256 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 27.42s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[DEBUG] stable-diffusion.cpp:3304 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 27.47s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 221.96s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

q4_1

v1-5-pruned-emaonly-ggml-model-q4_1.bin

> ./sd -m v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin'
[INFO]  stable-diffusion.cpp:2550 - ftype: q4_1
[INFO]  stable-diffusion.cpp:2779 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 1.38s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.23s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.72s
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.12s
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 10.69s
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.75s
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.51s
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.36s
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.35s
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 9.66s
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.52s
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.36s
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.26s
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.56s
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.56s
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.38s
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.39s
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 10.35s
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.48s
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.48s
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.46s
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 1910.11MB of memory: params 1286.34MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 191.08s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 27.91s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 219.22s, use 2271.63MB of memory: peak params memory 1454.64MB, peak runtime memory 2177.12MB
save result image to 'output.png'

Expand to see re-run in verbose

> ./sd -m v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "a lovely cat" -v
Option: 
    n_threads:       8
    mode:            txt2img
    model_path:      v1-5-pruned-emaonly-ggml-model-q4_1.bin
    output_path:     output.png
    init_img:        
    prompt:          a lovely cat
    negative_prompt: 
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info: 
    BLAS = 1
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 1
    ARM_FMA = 1
    F16C = 0
    FP16_VA = 1
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin'
[DEBUG] stable-diffusion.cpp:2533 - verifying magic
[DEBUG] stable-diffusion.cpp:2544 - loading hparams
[INFO]  stable-diffusion.cpp:2550 - ftype: q4_1
[DEBUG] stable-diffusion.cpp:2556 - loading vocab
[DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes
[DEBUG] stable-diffusion.cpp:2589 - clip params ctx size =  75.02 MB
[DEBUG] stable-diffusion.cpp:2608 - unet params ctx size =  1287.24 MB
[DEBUG] stable-diffusion.cpp:2629 - vae params ctx size =  95.51 MB
[DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2666 - loading weights
[DEBUG] stable-diffusion.cpp:2770 - model size = 1454.34MB
[INFO]  stable-diffusion.cpp:2779 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 0.87s
[DEBUG] stable-diffusion.cpp:353  - split prompt "a lovely cat" to tokens ["a</w>", "lovely</w>", "cat</w>", ]
[DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB
[DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.11s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB
[DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs
[DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.09s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.19s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[DEBUG] stable-diffusion.cpp:2926 - diffusion context need 69.56MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 10.71s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.55s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.71s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.73s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.40s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.15s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.34s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.51s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 9.54s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.44s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.78s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.44s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.38s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.49s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.29s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.40s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 9.44s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.72s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 10.18s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.95s
[DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB
[DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 1910.11MB of memory: params 1286.34MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[DEBUG] stable-diffusion.cpp:3095 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 192.17s
[DEBUG] stable-diffusion.cpp:3256 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 28.61s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[DEBUG] stable-diffusion.cpp:3304 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 28.64s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 221.00s, use 2271.63MB of memory: peak params memory 1454.64MB, peak runtime memory 2177.12MB
save result image to 'output.png'

turbobuilt · 2024-04-19T12:11:18Z

Any chance we could get OpenVino support? Would help a lot!

leejet added the good first issue Good for newcomers label Aug 17, 2023

leejet removed the good first issue Good for newcomers label Aug 17, 2023

grigio mentioned this issue Aug 20, 2023

Benchmark ? #15

Open

walking-octopus mentioned this issue Aug 24, 2023

miniSD/nanoSD (256x256 and 128x128 image generation) results #28

Open

kashimAstro mentioned this issue Jan 6, 2025

error load model SDXL #552

Open

idostyle added a commit to idostyle/stable-diffusion.cpp that referenced this issue Jan 11, 2025

cond cache leejet#1

4b51b84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First impressions info dump #1

First impressions info dump #1

Green-Sky commented Aug 13, 2023 •

edited

Loading

leejet commented Aug 13, 2023 •

edited

Loading

Green-Sky commented Aug 13, 2023

slaren commented Aug 13, 2023

leejet commented Aug 13, 2023

slaren commented Aug 13, 2023

leejet commented Aug 13, 2023

Green-Sky commented Aug 13, 2023

ggerganov commented Aug 13, 2023 •

edited

Loading

leejet commented Aug 14, 2023 •

edited

Loading

x-legion commented Aug 14, 2023

leejet commented Aug 14, 2023

Green-Sky commented Aug 16, 2023 •

edited

Loading

whoreson commented Aug 16, 2023

leejet commented Aug 16, 2023

leejet commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

leejet commented Aug 16, 2023

Green-Sky commented Aug 17, 2023

leejet commented Aug 17, 2023

BrutalCoding commented Aug 20, 2023 •

edited

Loading

turbobuilt commented Apr 19, 2024

First impressions info dump #1

First impressions info dump #1

Comments

Green-Sky commented Aug 13, 2023 • edited Loading

leejet commented Aug 13, 2023 • edited Loading

Green-Sky commented Aug 13, 2023

slaren commented Aug 13, 2023

leejet commented Aug 13, 2023

slaren commented Aug 13, 2023

leejet commented Aug 13, 2023

Green-Sky commented Aug 13, 2023

ggerganov commented Aug 13, 2023 • edited Loading

leejet commented Aug 14, 2023 • edited Loading

x-legion commented Aug 14, 2023

leejet commented Aug 14, 2023

Green-Sky commented Aug 16, 2023 • edited Loading

whoreson commented Aug 16, 2023

leejet commented Aug 16, 2023

leejet commented Aug 16, 2023

Green-Sky commented Aug 16, 2023

leejet commented Aug 16, 2023

Green-Sky commented Aug 17, 2023

leejet commented Aug 17, 2023

BrutalCoding commented Aug 20, 2023 • edited Loading

f16

q4_1

turbobuilt commented Apr 19, 2024

Green-Sky commented Aug 13, 2023 •

edited

Loading

leejet commented Aug 13, 2023 •

edited

Loading

ggerganov commented Aug 13, 2023 •

edited

Loading

leejet commented Aug 14, 2023 •

edited

Loading

Green-Sky commented Aug 16, 2023 •

edited

Loading

BrutalCoding commented Aug 20, 2023 •

edited

Loading