cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816

ggerganov · 2023-10-27T15:46:41Z

Curious observation - using CUBLAS_COMPUTE_32F is faster than CUBLAS_COMPUTE_16F. Tested on V100 and A6000

Seems to improve both TG, PP and Batched decoding speed and we avoid allocating and copying the F16 dst data.

Edit: It leads to improvements on some NVIDIA cards, but not all. For example on 3090 the performance is degraded when using CUBLAS_COMPUTE_32F. Also AMD cards can suffer too.

Leaving this PR as a demonstration that people can try for their specific case to see if it helps

V100 tests

LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.174 |  2949.89 |    2.569 |    49.83 |    2.742 |   233.39 |
|   512 |    128 |    2 |    768 |    0.164 |  3120.24 |    3.021 |    84.73 |    3.185 |   241.10 |
|   512 |    128 |    3 |    896 |    0.159 |  3212.35 |    3.089 |   124.30 |    3.249 |   275.81 |
|   512 |    128 |    4 |   1024 |    0.160 |  3206.65 |    3.141 |   163.02 |    3.300 |   310.26 |
|   512 |    128 |    5 |   1152 |    0.160 |  3198.24 |    3.233 |   197.98 |    3.393 |   339.55 |
|   512 |    128 |    6 |   1280 |    0.162 |  3163.07 |    3.345 |   229.62 |    3.507 |   365.03 |
|   512 |    128 |    7 |   1408 |    0.160 |  3203.58 |    3.437 |   260.67 |    3.597 |   391.42 |
|   512 |    128 |    8 |   1536 |    0.170 |  3020.19 |    3.470 |   295.10 |    3.640 |   422.03 |
|   512 |    128 |   16 |   2560 |    0.161 |  3181.31 |    4.346 |   471.25 |    4.507 |   568.03 |
|   512 |    128 |   32 |   4608 |    0.161 |  3186.16 |    5.147 |   795.78 |    5.308 |   868.14 |

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1


|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.165 |  3107.66 |    2.500 |    51.20 |    2.665 |   240.18 |
|   512 |    128 |    2 |    768 |    0.149 |  3437.12 |    3.037 |    84.30 |    3.186 |   241.07 |
|   512 |    128 |    3 |    896 |    0.149 |  3435.30 |    3.111 |   123.44 |    3.260 |   274.85 |
|   512 |    128 |    4 |   1024 |    0.149 |  3434.40 |    3.128 |   163.66 |    3.277 |   312.44 |
|   512 |    128 |    5 |   1152 |    0.149 |  3435.00 |    3.205 |   199.71 |    3.354 |   343.51 |
|   512 |    128 |    6 |   1280 |    0.156 |  3291.33 |    3.192 |   240.61 |    3.347 |   382.38 |
|   512 |    128 |    7 |   1408 |    0.149 |  3431.13 |    3.311 |   270.64 |    3.460 |   406.95 |
|   512 |    128 |    8 |   1536 |    0.154 |  3325.30 |    3.367 |   304.09 |    3.521 |   436.19 |
|   512 |    128 |   16 |   2560 |    0.158 |  3244.86 |    3.587 |   570.94 |    3.745 |   683.61 |
|   512 |    128 |   32 |   4608 |    0.150 |  3403.44 |    4.949 |   827.70 |    5.099 |   903.69 |


LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.225 |  2274.96 |    1.410 |    90.79 |    1.635 |   391.47 |
|   512 |    128 |    2 |    768 |    0.216 |  2374.49 |    2.300 |   111.31 |    2.516 |   305.30 |
|   512 |    128 |    3 |    896 |    0.214 |  2392.08 |    2.375 |   161.70 |    2.589 |   346.10 |
|   512 |    128 |    4 |   1024 |    0.212 |  2420.06 |    2.407 |   212.73 |    2.618 |   391.09 |
|   512 |    128 |    5 |   1152 |    0.214 |  2396.26 |    3.082 |   207.67 |    3.296 |   349.57 |
|   512 |    128 |    6 |   1280 |    0.212 |  2412.97 |    3.115 |   246.58 |    3.327 |   384.75 |
|   512 |    128 |    7 |   1408 |    0.212 |  2415.08 |    3.210 |   279.15 |    3.422 |   411.49 |
|   512 |    128 |    8 |   1536 |    0.212 |  2412.95 |    3.256 |   314.46 |    3.469 |   442.84 |
|   512 |    128 |   16 |   2560 |    0.215 |  2379.14 |    5.760 |   355.57 |    5.975 |   428.45 |
|   512 |    128 |   32 |   4608 |    0.214 |  2392.13 |   10.047 |   407.68 |   10.261 |   449.07 |

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.217 |  2364.08 |    1.326 |    96.52 |    1.543 |   414.84 |
|   512 |    128 |    2 |    768 |    0.203 |  2526.71 |    2.281 |   112.24 |    2.483 |   309.24 |
|   512 |    128 |    3 |    896 |    0.202 |  2538.35 |    2.366 |   162.30 |    2.568 |   348.94 |
|   512 |    128 |    4 |   1024 |    0.202 |  2540.78 |    2.386 |   214.62 |    2.587 |   395.81 |
|   512 |    128 |    5 |   1152 |    0.202 |  2539.73 |    3.037 |   210.71 |    3.239 |   355.67 |
|   512 |    128 |    6 |   1280 |    0.201 |  2541.99 |    3.075 |   249.78 |    3.276 |   390.71 |
|   512 |    128 |    7 |   1408 |    0.202 |  2539.76 |    3.173 |   282.39 |    3.375 |   417.24 |
|   512 |    128 |    8 |   1536 |    0.202 |  2529.92 |    3.196 |   320.35 |    3.399 |   451.92 |
|   512 |    128 |   16 |   2560 |    0.205 |  2502.50 |    5.201 |   393.75 |    5.406 |   473.57 |
|   512 |    128 |   32 |   4608 |    0.211 |  2427.09 |    9.887 |   414.30 |   10.098 |   456.35 |



LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.173 |  2955.73 |    2.570 |    49.81 |    2.743 |   233.32 |
|   512 |    128 |    1 |    640 |    0.163 |  3138.45 |    2.554 |    50.12 |    2.717 |   235.54 |
|   512 |    800 |    1 |   1312 |    0.162 |  3166.96 |   16.552 |    48.33 |   16.714 |    78.50 |
|  3200 |    128 |    1 |   3328 |    1.397 |  2290.24 |    3.208 |    39.89 |    4.606 |   722.58 |
|  3200 |    128 |    1 |   3328 |    1.404 |  2278.80 |    3.206 |    39.93 |    4.610 |   721.92 |
|  3200 |    800 |    1 |   4000 |    1.399 |  2286.73 |   20.490 |    39.04 |   21.889 |   182.74 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.163 |  3134.09 |    2.483 |    51.54 |    2.647 |   241.81 |
|   512 |    128 |    1 |    640 |    0.149 |  3442.34 |    2.453 |    52.19 |    2.601 |   246.02 |
|   512 |    800 |    1 |   1312 |    0.149 |  3435.07 |   15.608 |    51.26 |   15.757 |    83.26 |
|  3200 |    128 |    1 |   3328 |    1.283 |  2494.81 |    2.754 |    46.47 |    4.037 |   824.40 |
|  3200 |    128 |    1 |   3328 |    1.278 |  2502.93 |    2.758 |    46.42 |    4.036 |   824.56 |
|  3200 |    800 |    1 |   4000 |    1.283 |  2494.49 |   17.413 |    45.94 |   18.695 |   213.96 |


LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.225 |  2272.56 |    1.392 |    91.97 |    1.617 |   395.77 |
|   512 |    128 |    1 |    640 |    0.212 |  2419.60 |    1.382 |    92.60 |    1.594 |   401.54 |
|   512 |    800 |    1 |   1312 |    0.211 |  2424.63 |    9.203 |    86.93 |    9.414 |   139.37 |
|  3200 |    128 |    1 |   3328 |    1.760 |  1818.62 |    2.031 |    63.03 |    3.790 |   878.00 |
|  3200 |    128 |    1 |   3328 |    1.765 |  1813.22 |    2.033 |    62.95 |    3.798 |   876.18 |
|  3200 |    800 |    1 |   4000 |    1.764 |  1813.91 |   13.143 |    60.87 |   14.907 |   268.33 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.216 |  2372.26 |    1.331 |    96.14 |    1.547 |   413.66 |
|   512 |    128 |    1 |    640 |    0.201 |  2551.45 |    1.288 |    99.39 |    1.489 |   429.94 |
|   512 |    800 |    1 |   1312 |    0.201 |  2552.14 |    8.284 |    96.57 |    8.485 |   154.63 |
|  3200 |    128 |    1 |   3328 |    1.651 |  1937.73 |    1.583 |    80.87 |    3.234 |  1029.03 |
|  3200 |    128 |    1 |   3328 |    1.658 |  1929.82 |    1.586 |    80.71 |    3.244 |  1025.86 |
|  3200 |    800 |    1 |   4000 |    1.648 |  1941.65 |   10.096 |    79.24 |   11.744 |   340.59 |

llama-bench

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	3359.77 ± 73.09
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	52.61 ± 0.02
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	2525.13 ± 41.08
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	101.71 ± 0.33

build: c8d6a1f (1431) (master)

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	3648.52 ± 55.04
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	53.53 ± 0.16
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	2668.02 ± 33.54
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	105.22 ± 0.76

build: 3b9ea65 (1432) (PR)

Device 0: NVIDIA RTX A6000, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	4098.15 ± 105.84
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	44.84 ± 0.20
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	3450.93 ± 72.39
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	105.11 ± 0.57

build: c8d6a1f (1431)

Device 0: NVIDIA RTX A6000, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	4306.38 ± 43.09
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	45.20 ± 0.09
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	3473.03 ± 118.79
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	106.17 ± 0.97

build: 3b9ea65 (1432)

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	pp 512	2241.65 ± 20.20
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	tg 128	67.65 ± 0.23

build: c8d6a1f (1431)

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	pp 512	1760.98 ± 5.52
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	tg 128	70.63 ± 0.34

build: 3b9ea65 (1432)

Latest benches after GGML_PREC_F32 addition:

V100 tests

LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.167 |  3071.75 |    2.503 |    51.14 |    2.670 |   239.72 |
|   512 |    128 |    2 |    768 |    0.149 |  3430.51 |    3.029 |    84.53 |    3.178 |   241.67 |
|   512 |    128 |    3 |    896 |    0.149 |  3429.80 |    3.087 |   124.39 |    3.236 |   276.86 |
|   512 |    128 |    4 |   1024 |    0.149 |  3432.35 |    3.111 |   164.55 |    3.261 |   314.05 |
|   512 |    128 |    5 |   1152 |    0.149 |  3435.34 |    3.176 |   201.49 |    3.325 |   346.43 |
|   512 |    128 |    6 |   1280 |    0.149 |  3430.67 |    3.221 |   238.46 |    3.370 |   379.84 |
|   512 |    128 |    7 |   1408 |    0.149 |  3436.45 |    3.299 |   271.56 |    3.448 |   408.30 |
|   512 |    128 |    8 |   1536 |    0.150 |  3423.31 |    3.349 |   305.79 |    3.498 |   439.08 |
|   512 |    128 |   16 |   2560 |    0.150 |  3418.48 |    3.584 |   571.38 |    3.734 |   685.57 |
|   512 |    128 |   32 |   4608 |    0.151 |  3396.19 |    4.495 |   911.33 |    4.645 |   991.97 |

LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.227 |  2253.90 |    1.324 |    96.65 |    1.552 |   412.50 |
|   512 |    128 |    2 |    768 |    0.201 |  2546.87 |    2.273 |   112.63 |    2.474 |   310.44 |
|   512 |    128 |    3 |    896 |    0.201 |  2545.44 |    2.323 |   165.29 |    2.524 |   354.95 |
|   512 |    128 |    4 |   1024 |    0.201 |  2542.89 |    2.378 |   215.28 |    2.580 |   396.95 |
|   512 |    128 |    5 |   1152 |    0.201 |  2543.87 |    3.002 |   213.19 |    3.203 |   359.62 |
|   512 |    128 |    6 |   1280 |    0.201 |  2542.08 |    3.056 |   251.34 |    3.257 |   393.00 |
|   512 |    128 |    7 |   1408 |    0.201 |  2543.49 |    3.120 |   287.19 |    3.321 |   423.94 |
|   512 |    128 |    8 |   1536 |    0.202 |  2538.86 |    3.174 |   322.58 |    3.376 |   454.96 |
|   512 |    128 |   16 |   2560 |    0.202 |  2532.77 |    5.130 |   399.19 |    5.333 |   480.07 |
|   512 |    128 |   32 |   4608 |    0.203 |  2520.90 |    9.382 |   436.57 |    9.585 |   480.73 |


LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.166 |  3080.77 |    2.502 |    51.16 |    2.668 |   239.89 |
|   512 |    128 |    1 |    640 |    0.149 |  3440.05 |    2.496 |    51.28 |    2.645 |   241.96 |
|   512 |    800 |    1 |   1312 |    0.149 |  3444.80 |   15.706 |    50.94 |   15.854 |    82.75 |
|  3200 |    128 |    1 |   3328 |    1.154 |  2773.27 |    2.743 |    46.66 |    3.897 |   853.96 |
|  3200 |    128 |    1 |   3328 |    1.092 |  2930.01 |    2.737 |    46.76 |    3.830 |   869.03 |
|  3200 |    800 |    1 |   4000 |    1.093 |  2928.98 |   17.208 |    46.49 |   18.301 |   218.57 |

LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.219 |  2340.69 |    1.323 |    96.78 |    1.541 |   415.23 |
|   512 |    128 |    1 |    640 |    0.200 |  2554.71 |    1.310 |    97.69 |    1.511 |   423.66 |
|   512 |    800 |    1 |   1312 |    0.202 |  2539.87 |    8.343 |    95.89 |    8.545 |   153.55 |
|  3200 |    128 |    1 |   3328 |    1.515 |  2112.70 |    1.564 |    81.83 |    3.079 |  1080.94 |
|  3200 |    128 |    1 |   3328 |    1.470 |  2177.61 |    1.542 |    83.02 |    3.011 |  1105.17 |
|  3200 |    800 |    1 |   4000 |    1.457 |  2197.03 |    9.854 |    81.19 |   11.310 |   353.66 |

LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench -m ./models/openllama-7b-v2/ggml-model-f16.gguf -m ./models/openllama-7b-v2/ggml-model-q4_k.gguf -ngl 99

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model	size	params	backend	ngl	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	pp 512	3722.34 ± 127.35
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	tg 128	52.80 ± 0.24
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	pp 512	2713.46 ± 37.55
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CUDA	99	tg 128	101.10 ± 0.99

build: a40f611 (1662)

ggerganov · 2023-10-27T16:00:00Z

@slaren Have you noticed this as well? Do you think there is any reason not to switch to F32 compute?

slaren · 2023-10-27T16:09:04Z

I tested this again on a 3090 Ti, and for me master is faster:

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	4672.50 ± 290.50
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	54.62 ± 0.18

build: c8d6a1f (1431) (master)

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	3285.46 ± 11.15
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	55.63 ± 0.19

build: 3b9ea65 (1432) (PR)

KerfuffleV2 · 2023-10-27T16:43:18Z

Might be a dumb question, but would these changes affect quantized models at all?

ggerganov · 2023-10-27T16:46:30Z

@slaren Indeed, on RTX 3090 I also don't observe benefit from 32F mode. So it's not an universal thing

@KerfuffleV2 Yes, because for some of the operations we dequantize to F16 and use cuBLAS. There are some numbers in my post above for quantized models (Q4_K)

ggerganov · 2023-10-27T17:34:17Z

I restored the CUDA_COMPUTE_16F in ggml_cuda_op_mul_mat_cublas() (used for 2D tensors) and kept CUDA_COMPUTE_32F in ggml_cuda_mul_mat_mat_batched_cublas() (used for 3D tensors KV cache). Now the PP performance is maintained for 3090 and maybe slightly better for the others and the slight TG improvement is also maintained:

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	pp 512	2303.96 ± 25.72
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	99	tg 128	69.43 ± 0.24

build: 0f2498f (1433)

LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/llama-2-13b.Q4_K_M.gguf 4096 1 99 1 512,3200 128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.248 |  2060.39 |    1.985 |    64.48 |    2.234 |   286.52 |
|   512 |    800 |    1 |   1312 |    0.240 |  2131.52 |   12.681 |    63.08 |   12.922 |   101.54 |
|  3200 |    128 |    1 |   3328 |    1.946 |  1644.78 |    2.416 |    52.98 |    4.362 |   763.01 |
|  3200 |    800 |    1 |   4000 |    1.929 |  1658.93 |   15.530 |    51.51 |   17.459 |   229.11 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.243 |  2110.64 |    1.898 |    67.43 |    2.141 |   298.96 |
|   512 |    800 |    1 |   1312 |    0.233 |  2197.75 |   12.323 |    64.92 |   12.556 |   104.49 |
|  3200 |    128 |    1 |   3328 |    1.893 |  1690.68 |    2.452 |    52.21 |    4.344 |   766.03 |
|  3200 |    800 |    1 |   4000 |    1.872 |  1709.50 |   15.783 |    50.69 |   17.655 |   226.57 |

Device 0: NVIDIA RTX A6000, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	4182.22 ± 93.96
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	45.23 ± 0.12

build: 0f2498f (1433)

slaren · 2023-10-27T17:42:27Z

The performance is the same now for me too, maybe a little bit better than master, but within the margin of error.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	pp 512	4851.38 ± 70.18
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	99	tg 128	55.83 ± 0.12

build: 0f2498f (1433)

KerfuffleV2 · 2023-10-27T17:54:12Z

After 0f2498f the performance seems about the same as master, before that it was worse on my GPU.

Tested with a Q5_K_M Mistral model, using 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32 on

Device 0: AMD Radeon RX 6600, compute capability 10.3

Expand

master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	1.026	499.12	4.339	29.50	5.364	119.30
512	128	2	768	1.023	500.40	17.023	15.04	18.047	42.56
512	128	3	896	1.027	498.35	17.187	22.34	18.214	49.19
512	128	4	1024	1.029	497.35	17.307	29.58	18.337	55.84
512	128	5	1152	1.033	495.78	17.449	36.68	18.482	62.33
512	128	6	1280	1.032	496.27	17.572	43.71	18.604	68.80
512	128	7	1408	1.029	497.45	17.718	50.57	18.747	75.10
512	128	8	1536	1.033	495.80	17.901	57.20	18.933	81.13
512	128	16	2560	1.027	498.49	19.685	104.04	20.712	123.60
512	128	32	4608	1.030	496.96	26.245	156.07	27.275	168.94

PR pre-0f2498f

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	3.095	165.44	4.565	28.04	7.660	83.55
512	128	2	768	3.096	165.36	17.221	14.87	20.317	37.80
512	128	3	896	3.104	164.97	17.415	22.05	20.519	43.67
512	128	4	1024	3.107	164.78	17.569	29.14	20.676	49.53
512	128	5	1152	3.109	164.71	17.749	36.06	20.858	55.23
512	128	6	1280	3.110	164.64	17.917	42.86	21.027	60.88
512	128	7	1408	3.112	164.51	18.091	49.53	21.204	66.40
512	128	8	1536	3.111	164.57	18.314	55.91	21.425	71.69
512	128	16	2560	3.113	164.47	20.491	99.95	23.604	108.46
512	128	32	4608	3.114	164.44	26.734	153.21	29.847	154.39

PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	1.012	505.92	4.542	28.18	5.554	115.22
512	128	2	768	1.016	504.09	17.205	14.88	18.221	42.15
512	128	3	896	1.016	503.85	17.595	21.82	18.612	48.14
512	128	4	1024	1.018	503.16	17.586	29.11	18.603	55.04
512	128	5	1152	1.018	502.79	17.761	36.03	18.780	61.34
512	128	6	1280	1.017	503.47	17.930	42.83	18.947	67.56
512	128	7	1408	1.017	503.53	18.111	49.47	19.128	73.61
512	128	8	1536	1.020	501.78	18.348	55.81	19.368	79.31
512	128	16	2560	1.017	503.34	20.541	99.70	21.558	118.75
512	128	32	4608	1.019	502.52	26.819	152.73	27.838	165.53

PR + `FORCE_MMQ`

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.983	521.02	4.016	31.87	4.999	128.04
512	128	2	768	0.982	521.25	17.280	14.81	18.263	42.05
512	128	3	896	1.016	503.79	17.448	22.01	18.465	48.53
512	128	4	1024	0.987	519.00	17.664	28.99	18.650	54.91
512	128	5	1152	1.053	486.17	17.780	36.00	18.833	61.17
512	128	6	1280	0.988	518.00	17.942	42.80	18.930	67.62
512	128	7	1408	0.988	518.47	18.091	49.53	19.078	73.80
512	128	8	1536	0.989	517.95	18.360	55.77	19.348	79.39
512	128	16	2560	0.989	517.65	20.503	99.89	21.492	119.12
512	128	32	4608	0.989	517.81	26.741	153.17	27.730	166.17

ggerganov · 2023-10-27T18:01:25Z

@KerfuffleV2 Thanks for the results. Likely this PR is not getting merged as the numbers are not convincing.

cmp-nct · 2023-10-27T18:26:50Z

I had tested FP8 cublas in ggllm, which is 40 series nvidia cards.
FP16 was a small increase over FP32 but FP8 was a big one.
The performance was almost doubled. I didn't have time to continue with ggllm but maybe some of the code is useful as a reference.
That won't work on any older cards, so I used code to select the variant based on compute capability.

The kernels for fp8 conversion and the cublas wrapper are here: https://github.com/cmp-nct/ggllm.cpp/blob/ggfalcon_dev/ggml-cuda.cu

Ph0rk0z · 2023-10-28T17:25:48Z

For dual P40s on 70b, I started having 107 second replies during prompt processing of about 3k tokens. With this PR, those replies have come down to 25 seconds, which is reasonable. Generation speed itself only went from 8.8 tokens to 8.95 tokens. Model is Q4KM on compute 6.1.

Testing dual 3090s, there was some performance hit but it was negligible for me during oneshot generations or chat. I mostly see it in prompt processing and on such fast GPU it's fractions of a second and .XX tokens.

cebtenzzre · 2023-10-28T22:07:27Z

Prompt processing results on my Tesla P40:

GPU	Model	Test	t/s master	t/s PR	Speedup
P40	7b q4_0	pp512	338.89	934.11	2.76
P40	13b q4_k_s	pp512	204.26	472.83	2.31

KerfuffleV2 · 2023-10-28T22:47:34Z

It was much worse for me without 0f2498f (not that anyone else probably really cares about performance on el-cheapo AMD GPUs)

Ph0rk0z · 2023-10-29T12:19:39Z

As expected pascal benefit greatly from FP32, 3090 can go either way and AMD favors FP16.

YavorGIvanov · 2023-10-29T14:56:40Z

As expected pascal benefit greatly from FP32, 3090 can go either way and AMD favors FP16.

Can you explain why it is expected exactly ? You can also point me to readings if it will be easier to explain by them.

Ph0rk0z · 2023-10-29T16:09:12Z

Pascal (at least P40) is 1/3 speed in FP16 ops. Nvidia made it this way and released the P100 for accelerated FP16 (but missing 8bit I think). They told users to pick one or the other based on application. It's why it doesn't work well for back-ends like exllama. It doesn't even have tensor cores.

3090 speed for FP16/FP32 is pretty much similar. Again, it's how they optimized it for what people were doing at the time. More and more workloads are using lower precision so nvidia keeps giving you smaller and smaller tensors. Hence 4xxx cards have stuff like FP8. People used to care about double precision at one point and now not a peep.

AMD I think just came out at a time when people were using FP16 so they accelerated that.

In short, every card is optimized for what was popular and demanded from customers at the time. They tend to tout FLOPS at a given precision in the specs too. https://www.techpowerup.com/gpu-specs/tesla-p40.c2878

julianullrich99 · 2023-10-30T09:32:48Z

I saw this and was excited for better performance on my cheapo P40+1080ti setup. (Both Tesla Cards that have 1:64 in fp16 vs fp32, so should be waaayyy faster, i guess?)
However, I only saw the following improvements:

GPUs	model	test	t/s (master)	t/s (PR)	speedup
P40+1080ti	llama 13b mostly Q5_K medium	tg128	20.3	32.0	1.6
P40+1080ti	llama 70b mostly Q3_K medium	tg128	4.3	5.7	1.3

What is the explanation for this poor improvement?

(q3 70b is the most I can fit in VRAM)

Ph0rk0z · 2023-10-30T11:08:05Z

Seeming 4x speedup on prompt processing is nothing to sneeze at. Go try 3k context now. I think token generation is already FP32 due to the nature of offloading to CPU unless you forced that FP16 compile flag.

FNsi · 2023-11-01T13:32:18Z

i guess it's faster in HBM only...

Ph0rk0z · 2023-11-05T12:23:29Z

Most of the gains of this seems to have been replicated by: #3882

cebtenzzre · 2023-11-05T22:28:10Z

~~Updated performance figures (with -DLLAMA_CUDA_FORCE_MMQ=ON):~~

GPU Model Test t/s master t/s PR Speedup

P40 7b q4_0 pp512 889.54 934.11 1.05

P40 13b q4_k_s pp512 455.17 472.83 1.04

~~It would still be cool if we could make this a command-line flag like mmq used to be.~~

edit: cannot reproduce these numbers for master.

…_16F (ggerganov#3816)

cebtenzzre · 2023-11-27T23:31:44Z

I don't know what I did wrong in my previous testing, but I'm still seeing a 2.76x prompt processing speedup with this PR for 7B LLaMA on my P40.

Ph0rk0z · 2023-11-28T15:33:11Z

PR seems to have been obsoleted so I can't try it along with the new changes. I didn't get any speedup right after when the fix was made but now the code is all different.

ggerganov · 2023-11-28T17:17:18Z

PR seems to have been obsoleted

How come? I merged master yesterday. It's still giving some extra performance on certain GPUs

Ph0rk0z · 2023-11-29T22:51:20Z

When I try to merge it it has conflicts.

edit: I pulled the repo again and re-merged. It worked. I think it was a matter of 99 vs 103 for PP. From 10.x ms per token to 9.x ms per token.

jacooooooooool · 2023-12-04T19:27:47Z

Hello, does the compilation for old graphics cards (NO TENSOR) change THAT MUCH? Have I made a mistake somewhere? I'm shocked :) GTX1080Ti power :)
(flag -DLLAMA_CUDA_FORCE_MMQ=ON)

Today's llama.cpp build, Windows10 + AMD2990WX + GTX1080Ti + 64GB
Build environment MSYS_NT-10.0-19045

Test1: -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON
Test2: -DLLAMA_CUBLAS=ON

Test1 llama.cpp Amd2990WX + GTX1080TI -
Winsows10 = MSYS_NT-10.0-19045 AMD 3.3.6.x86_64 2023-02-22 08:29 UTC x86_64 Msys

--> ./batched-bench C:\models\ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1
 ...........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size  =  512.00 MiB
llama_build_graph: non-view tensors processed: 644/644
llama_new_context_with_model: compute buffer total size = 215.06 MiB
llama_new_context_with_model: VRAM scratch buffer: 212.00 MiB
llama_new_context_with_model: total VRAM used: 7060.76 MiB (model: 6336.76 MiB, context: 724.00 MiB)

===========================Windows 10 CuBlass+MMQ============================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.539 |   949.45 |    3.270 |    39.15 |    3.809 |   168.03 |
|   512 |    128 |    1 |    640 |    0.515 |   994.42 |    3.279 |    39.03 |    3.794 |   168.68 |
|   512 |    800 |    1 |   1312 |    0.515 |   994.69 |   21.146 |    37.83 |   21.661 |    60.57 |
|  3200 |    128 |    1 |   3328 |    4.257 |   751.67 |    4.134 |    30.96 |    8.391 |   396.61 |
|  3200 |    128 |    1 |   3328 |    4.229 |   756.67 |    4.132 |    30.98 |    8.361 |   398.02 |
|  3200 |    800 |    1 |   4000 |    4.226 |   757.22 |   26.518 |    30.17 |   30.744 |   130.11 |

llama_print_timings:        load time =    5098.30 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   14491.21 ms / 11152 tokens (    1.30 ms per token,   769.57 tokens per second)
llama_print_timings:        eval time =   62476.74 ms /  2112 runs   (   29.58 ms per token,    33.80 tokens per second)
llama_print_timings:       total time =   81862.30 ms

============================Linux Ubuntu 22 CuBlass+MMQ==========================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.555 |   923.31 |    2.998 |    42.70 |    3.552 |   180.17 |
|   512 |    128 |    1 |    640 |    0.547 |   936.24 |    2.998 |    42.70 |    3.545 |   180.55 |
|   512 |    800 |    1 |   1312 |    0.548 |   934.53 |   19.725 |    40.56 |   20.273 |    64.72 |
|  3200 |    128 |    1 |   3328 |    4.416 |   724.60 |    3.949 |    32.42 |    8.365 |   397.85 |
|  3200 |    128 |    1 |   3328 |    4.309 |   742.68 |    3.839 |    33.34 |    8.148 |   408.43 |
|  3200 |    800 |    1 |   4000 |    4.286 |   746.54 |   25.247 |    31.69 |   29.534 |   135.44 |

llama_print_timings:        load time =    1931.91 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   14798.92 ms / 11152 tokens (    1.33 ms per token,   753.57 tokens per second)
llama_print_timings:        eval time =   58755.08 ms /  2112 runs   (   27.82 ms per token,    35.95 tokens per second)
llama_print_timings:       total time =   75349.54 ms

============================================================================

Test2 llama.cpp Amd2990WX + GTX1080TI -
( compilation "cmake .. -DLLAMA_CUBLAS=ON" )

---> ./batched-bench C:\models\ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

...........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size  =  512.00 MiB
llama_build_graph: non-view tensors processed: 644/644
llama_new_context_with_model: compute buffer total size = 215.06 MiB
llama_new_context_with_model: VRAM scratch buffer: 212.00 MiB
llama_new_context_with_model: total VRAM used: 7060.76 MiB (model: 6336.76 MiB, context: 724.00 MiB)

===========================Windows 10 CuBlass============================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.050 |   487.77 |    6.238 |    20.52 |    7.288 |    87.82 |
|   512 |    128 |    1 |    640 |    1.035 |   494.87 |    6.254 |    20.47 |    7.289 |    87.80 |
|   512 |    800 |    1 |   1312 |    1.034 |   495.02 |   48.825 |    16.39 |   49.859 |    26.31 |
|  3200 |    128 |    1 |   3328 |   15.880 |   201.51 |   18.666 |     6.86 |   34.547 |    96.33 |
|  3200 |    128 |    1 |   3328 |   15.854 |   201.84 |   18.665 |     6.86 |   34.519 |    96.41 |
|  3200 |    800 |    1 |   4000 |   15.856 |   201.81 |  126.285 |     6.33 |  142.141 |    28.14 |

llama_print_timings:        load time =    4900.65 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   50937.57 ms / 11152 tokens (    4.57 ms per token,   218.93 tokens per second)
llama_print_timings:        eval time =  224928.09 ms /  2112 runs   (  106.50 ms per token,     9.39 tokens per second)
llama_print_timings:       total time =  280545.48 ms

================================================================================

++++++++++++++++++++LLama-Bench+++++++++++++++++++

1Test - llama-bench ((cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON))
2Test - llama-bench ((cmake .. -DLLAMA_CUBLAS=ON ))

=========================Windows 10 CuBlass+MMQ============================

./llama-bench.exe -m C:\models\llama-2-13b.Q4_0.gguf -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     31.41 ± 0.01 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     32.17 ± 0.00 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     31.90 ± 0.03 |

build: 23b5e12 (1610)

==========================Windows 10 CuBlass================================

 ./llama-bench.exe -m C:\models\llama-2-13b.Q4_0.gguf -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     21.93 ± 0.02 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     19.83 ± 0.02 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     16.62 ± 0.00 |

build: 23b5e12 (1610))

=========================Linux Ubuntu 22 CuBlass+MMQ==========================

./llama-bench -m '/models/llama-2-13b.Q4_0.gguf' -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     37.85 ± 0.03 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     37.83 ± 0.04 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     36.73 ± 0.04 |

build: 23b5e12 (1610)

============================Ubuntu-Linux-cublass=============================

./llama-bench -m '/media/models/llama-2-13b.Q4_0.gguf' -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     24.05 ± 0.01 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     21.36 ± 0.29 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     17.91 ± 0.01 |

build: 23b5e12 (1610)

Ph0rk0z · 2023-12-05T14:26:13Z

Yea, mmq changes a lot. For newer cards it wasn't helpful either for single batches.

cmp-nct · 2023-12-05T22:33:03Z

Yea, mmq changes a lot. For newer cards it wasn't helpful either for single batches.

For new cards we'll have to wait for cublas fp8 support, that's several times faster than fp32 and the precision is still awesome.
That's potentil to beat exlv2

Ph0rk0z · 2023-12-06T12:07:43Z

FP8? Sure.. let me just fire up my H100 :P

Already beat exllama on ampere, minus the prompt processing speed. I'm more hopeful for 8 bit kv_cache than FP8. On older GPUs I'm not sure what else can be done.

cmp-nct · 2023-12-06T13:30:13Z

FP8? Sure.. let me just fire up my H100 :P

Already beat exllama on ampere, minus the prompt processing speed. I'm more hopeful for 8 bit kv_cache than FP8. On older GPUs I'm not sure what else can be done.

Any 40 series card and the upcoming Super and 50 series support it.
H100 is only useful if you want to inference or train above 24GB of RAM, but in terms of speed a simple 4090 is just about matching it.
Though the 8 bit KV cache is also a great invention, will be very interesting to see final perplexity

Ph0rk0z · 2023-12-06T14:22:33Z

TIL, 4xxx supports it. In textgen when I did testing on exllamav2 8 or 16bit cache. Didn't appear to make a difference for the same models and wikitext. Hopefully that holds true here. For most "good" models, sadly 24g is now not enough.

cmp-nct · 2023-12-06T14:57:51Z

TIL, 4xxx supports it. In textgen when I did testing on exllamav2 8 or 16bit cache. Didn't appear to make a difference for the same models and wikitext. Hopefully that holds true here. For most "good" models, sadly 24g is now not enough.

I had implemented it for Falcon inference in ggllm and it worked very well on my 4090, a significant speed boost when using cublas as compared to fp16 or fp32.
One advantage is that the tensor cores for fp8 are significantly faster and the second one is that you just have to deal with 25% or 50% of the memory size. All those memory transfers for temporary storage (such as quant N to 16 of 32 bit cublas computation) is costing significant time.

Ph0rk0z · 2023-12-07T11:50:04Z

True but that only benefits bleeding edge cards. I'd rather have a reasonable 103b than an instant 7b. Quality over quantity.

cebtenzzre · 2023-12-24T21:51:27Z

There are conflicts since #4606 was merged.

nalzok · 2023-12-27T02:08:46Z

As noted above, FP32 is much faster than FP16 on the Tesla P40, but it's still a capable card otherwise with its 24GB VRAM. Can we have the option to specify the computation floating point type (and upcast float16 to float32 when necessary)? Besides choosing CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F, it should also affect the FP type used in custom kernels like MMQ and DMMV.

People using cutting-edge GPUs can also benefit from this option when they encounter Inf or NaN. See this comment for more details.

Ph0rk0z · 2023-12-27T14:05:41Z

Unironically I get no benefits on 3090 from using FP16 besides some lower memory use. Through SD I have found that the weights can even be loaded as FP16 for these pascal cards as long as the calculations happen at the correct precision. xformers does this automatically on that end.

ggerganov · 2024-01-02T15:48:00Z

This could be related: https://twitter.com/main_horse/status/1742013125090795531

ggerganov · 2024-01-02T15:57:40Z

ggml-cuda.cu

+                                CUBLAS_COMPUTE_32F,
+                                CUBLAS_GEMM_DEFAULT_TENSOR_OP));
+                } break;
+        }


We might want to merge this particular change of ggml_cuda_op_mul_mat_cublas since it uses less memory than cublasSgemm and still performs the compute in F32 which is needed for models like Phi-2

I am not sure if I am following all the logic, but I would be concerned about down-converting F32 src0/src1 to F16 despite the user requesting GGML_PREC_32. In the long run, I think it would be better to always respect the user types and do all the type conversions in the graph (ggerganov/ggml#455), since it would give users more control and it would simplify the code in the backends. It would also move the temporary buffer from the pool to the compute buffer, which would result in more accurate estimation of the VRAM needed to run a model. It should also help with the issue of to_fp32 and to_fp16 in the CUDA backend being unable to deal with non-contiguous tensors, since it would be done in a ggml_cpy instead.

Ph0rk0z · 2024-01-02T16:06:15Z

Yet every time I tried Triton kernels on 3090 they were overall slower.

cmp-nct · 2024-01-02T16:23:52Z

I noticed significant speed differences based on such small changes when using cuBlas while developing ggllm(the flacon fork).
FP8 was the biggest boost I've seen but also small things like the accumulator mattered.
It is different on a 3090 than on a 4090 as well.

In general, we should also have a look at the EXL2 implementation. Even without FP8 that kernel is 2 times faster than llama.cpp on modern hardware. It delivers up to 14000 tokens/second prompt processing on a single 4090 while on llama.cpp I top out at 5500.

Ph0rk0z · 2024-02-04T17:05:53Z

Heh.. so I have discovered a bad dimm in my server that was causing memory bandwidth to drop from 60gb/s down to 10g/s.

Having fixed that issue, I was properly able to test performance regressions again. Since I merged this PR into main@5a7d312 I have gone from 18.6 t/s down to 15.5t/s on dual 3090s. Using the same kernel settings and also going back to splitting by row.

I had merged this PR for P40s but I don't think it's contents is what returned the performance, I just happened to grab a backup at the right time. So now I have a date and a commit to test against, to find the change that ate my 3 tokens/s. I have other backups from dec 27th and those also have the regression.

edit:

commit is between:7082d24 and f679349

cebtenzzre · 2024-03-16T19:54:39Z

On master I'm getting about 910 t/s with pp512 on a Q4_0 7b llama with or without this change, so it doesn't seem to be necessary for Tesla P40s anymore.

ggerganov added demo Demonstrate some concept or idea, not intended to be merged Nvidia GPU Issues specific to Nvidia GPUs labels Oct 28, 2023

Ph0rk0z mentioned this pull request Nov 5, 2023

Check CUDA memory pool support #3931

Closed

ggerganov force-pushed the cuda-cublas-opts branch from 2648647 to c830a05 Compare November 27, 2023 09:54

cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this pull request Nov 27, 2023

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE…

7c8a37b

…_16F (ggerganov#3816)

ggerganov mentioned this pull request Dec 17, 2023

Support for Phi-2 #4490

Merged

ggerganov force-pushed the cuda-cublas-opts branch from e75889a to a40f611 Compare December 19, 2023 14:38

ggerganov force-pushed the cuda-cublas-opts branch 2 times, most recently from 41f0f44 to 4011f09 Compare December 27, 2023 08:40

ggml : force F32 precision for ggml_mul_mat

4cc78d3

ggerganov force-pushed the cuda-cublas-opts branch from 4011f09 to 4cc78d3 Compare January 2, 2024 15:55

ggerganov commented Jan 2, 2024

View reviewed changes

ggerganov closed this Mar 18, 2024

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816

Conversation

ggerganov commented Oct 27, 2023 • edited Loading

ggerganov commented Oct 27, 2023 • edited Loading

slaren commented Oct 27, 2023

KerfuffleV2 commented Oct 27, 2023

ggerganov commented Oct 27, 2023

ggerganov commented Oct 27, 2023

slaren commented Oct 27, 2023

KerfuffleV2 commented Oct 27, 2023 • edited Loading

master

PR pre-0f2498f

PR

PR + FORCE_MMQ

ggerganov commented Oct 27, 2023

cmp-nct commented Oct 27, 2023

Ph0rk0z commented Oct 28, 2023 • edited Loading

cebtenzzre commented Oct 28, 2023 • edited Loading

KerfuffleV2 commented Oct 28, 2023

Ph0rk0z commented Oct 29, 2023

YavorGIvanov commented Oct 29, 2023

Ph0rk0z commented Oct 29, 2023

julianullrich99 commented Oct 30, 2023

Ph0rk0z commented Oct 30, 2023

FNsi commented Nov 1, 2023

Ph0rk0z commented Nov 5, 2023

cebtenzzre commented Nov 5, 2023 • edited Loading

cebtenzzre commented Nov 27, 2023 • edited Loading

Ph0rk0z commented Nov 28, 2023

ggerganov commented Nov 28, 2023

Ph0rk0z commented Nov 29, 2023 • edited Loading

jacooooooooool commented Dec 4, 2023 • edited by cebtenzzre Loading

Ph0rk0z commented Dec 5, 2023

cmp-nct commented Dec 5, 2023 • edited Loading

Ph0rk0z commented Dec 6, 2023

cmp-nct commented Dec 6, 2023 • edited Loading

Ph0rk0z commented Dec 6, 2023

cmp-nct commented Dec 6, 2023

Ph0rk0z commented Dec 7, 2023

cebtenzzre commented Dec 24, 2023

nalzok commented Dec 27, 2023 • edited Loading

Ph0rk0z commented Dec 27, 2023

ggerganov commented Jan 2, 2024

ggerganov Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

slaren Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

Ph0rk0z commented Jan 2, 2024

cmp-nct commented Jan 2, 2024 • edited Loading

Ph0rk0z commented Feb 4, 2024 • edited Loading

cebtenzzre commented Mar 16, 2024

ggerganov commented Oct 27, 2023 •

edited

Loading

ggerganov commented Oct 27, 2023 •

edited

Loading

KerfuffleV2 commented Oct 27, 2023 •

edited

Loading

PR + `FORCE_MMQ`

Ph0rk0z commented Oct 28, 2023 •

edited

Loading

cebtenzzre commented Oct 28, 2023 •

edited

Loading

cebtenzzre commented Nov 5, 2023 •

edited

Loading

cebtenzzre commented Nov 27, 2023 •

edited

Loading

Ph0rk0z commented Nov 29, 2023 •

edited

Loading

jacooooooooool commented Dec 4, 2023 •

edited by cebtenzzre

Loading

cmp-nct commented Dec 5, 2023 •

edited

Loading

cmp-nct commented Dec 6, 2023 •

edited

Loading

nalzok commented Dec 27, 2023 •

edited

Loading

ggerganov Jan 2, 2024 •

edited

Loading

slaren Jan 2, 2024 •

edited

Loading

cmp-nct commented Jan 2, 2024 •

edited

Loading

Ph0rk0z commented Feb 4, 2024 •

edited

Loading