Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816

Closed
wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 27, 2023

Curious observation - using CUBLAS_COMPUTE_32F is faster than CUBLAS_COMPUTE_16F. Tested on V100 and A6000

Seems to improve both TG, PP and Batched decoding speed and we avoid allocating and copying the F16 dst data.

Edit: It leads to improvements on some NVIDIA cards, but not all. For example on 3090 the performance is degraded when using CUBLAS_COMPUTE_32F. Also AMD cards can suffer too.

Leaving this PR as a demonstration that people can try for their specific case to see if it helps


  • V100 tests
LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.174 |  2949.89 |    2.569 |    49.83 |    2.742 |   233.39 |
|   512 |    128 |    2 |    768 |    0.164 |  3120.24 |    3.021 |    84.73 |    3.185 |   241.10 |
|   512 |    128 |    3 |    896 |    0.159 |  3212.35 |    3.089 |   124.30 |    3.249 |   275.81 |
|   512 |    128 |    4 |   1024 |    0.160 |  3206.65 |    3.141 |   163.02 |    3.300 |   310.26 |
|   512 |    128 |    5 |   1152 |    0.160 |  3198.24 |    3.233 |   197.98 |    3.393 |   339.55 |
|   512 |    128 |    6 |   1280 |    0.162 |  3163.07 |    3.345 |   229.62 |    3.507 |   365.03 |
|   512 |    128 |    7 |   1408 |    0.160 |  3203.58 |    3.437 |   260.67 |    3.597 |   391.42 |
|   512 |    128 |    8 |   1536 |    0.170 |  3020.19 |    3.470 |   295.10 |    3.640 |   422.03 |
|   512 |    128 |   16 |   2560 |    0.161 |  3181.31 |    4.346 |   471.25 |    4.507 |   568.03 |
|   512 |    128 |   32 |   4608 |    0.161 |  3186.16 |    5.147 |   795.78 |    5.308 |   868.14 |

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1


|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.165 |  3107.66 |    2.500 |    51.20 |    2.665 |   240.18 |
|   512 |    128 |    2 |    768 |    0.149 |  3437.12 |    3.037 |    84.30 |    3.186 |   241.07 |
|   512 |    128 |    3 |    896 |    0.149 |  3435.30 |    3.111 |   123.44 |    3.260 |   274.85 |
|   512 |    128 |    4 |   1024 |    0.149 |  3434.40 |    3.128 |   163.66 |    3.277 |   312.44 |
|   512 |    128 |    5 |   1152 |    0.149 |  3435.00 |    3.205 |   199.71 |    3.354 |   343.51 |
|   512 |    128 |    6 |   1280 |    0.156 |  3291.33 |    3.192 |   240.61 |    3.347 |   382.38 |
|   512 |    128 |    7 |   1408 |    0.149 |  3431.13 |    3.311 |   270.64 |    3.460 |   406.95 |
|   512 |    128 |    8 |   1536 |    0.154 |  3325.30 |    3.367 |   304.09 |    3.521 |   436.19 |
|   512 |    128 |   16 |   2560 |    0.158 |  3244.86 |    3.587 |   570.94 |    3.745 |   683.61 |
|   512 |    128 |   32 |   4608 |    0.150 |  3403.44 |    4.949 |   827.70 |    5.099 |   903.69 |


LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.225 |  2274.96 |    1.410 |    90.79 |    1.635 |   391.47 |
|   512 |    128 |    2 |    768 |    0.216 |  2374.49 |    2.300 |   111.31 |    2.516 |   305.30 |
|   512 |    128 |    3 |    896 |    0.214 |  2392.08 |    2.375 |   161.70 |    2.589 |   346.10 |
|   512 |    128 |    4 |   1024 |    0.212 |  2420.06 |    2.407 |   212.73 |    2.618 |   391.09 |
|   512 |    128 |    5 |   1152 |    0.214 |  2396.26 |    3.082 |   207.67 |    3.296 |   349.57 |
|   512 |    128 |    6 |   1280 |    0.212 |  2412.97 |    3.115 |   246.58 |    3.327 |   384.75 |
|   512 |    128 |    7 |   1408 |    0.212 |  2415.08 |    3.210 |   279.15 |    3.422 |   411.49 |
|   512 |    128 |    8 |   1536 |    0.212 |  2412.95 |    3.256 |   314.46 |    3.469 |   442.84 |
|   512 |    128 |   16 |   2560 |    0.215 |  2379.14 |    5.760 |   355.57 |    5.975 |   428.45 |
|   512 |    128 |   32 |   4608 |    0.214 |  2392.13 |   10.047 |   407.68 |   10.261 |   449.07 |

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.217 |  2364.08 |    1.326 |    96.52 |    1.543 |   414.84 |
|   512 |    128 |    2 |    768 |    0.203 |  2526.71 |    2.281 |   112.24 |    2.483 |   309.24 |
|   512 |    128 |    3 |    896 |    0.202 |  2538.35 |    2.366 |   162.30 |    2.568 |   348.94 |
|   512 |    128 |    4 |   1024 |    0.202 |  2540.78 |    2.386 |   214.62 |    2.587 |   395.81 |
|   512 |    128 |    5 |   1152 |    0.202 |  2539.73 |    3.037 |   210.71 |    3.239 |   355.67 |
|   512 |    128 |    6 |   1280 |    0.201 |  2541.99 |    3.075 |   249.78 |    3.276 |   390.71 |
|   512 |    128 |    7 |   1408 |    0.202 |  2539.76 |    3.173 |   282.39 |    3.375 |   417.24 |
|   512 |    128 |    8 |   1536 |    0.202 |  2529.92 |    3.196 |   320.35 |    3.399 |   451.92 |
|   512 |    128 |   16 |   2560 |    0.205 |  2502.50 |    5.201 |   393.75 |    5.406 |   473.57 |
|   512 |    128 |   32 |   4608 |    0.211 |  2427.09 |    9.887 |   414.30 |   10.098 |   456.35 |



LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.173 |  2955.73 |    2.570 |    49.81 |    2.743 |   233.32 |
|   512 |    128 |    1 |    640 |    0.163 |  3138.45 |    2.554 |    50.12 |    2.717 |   235.54 |
|   512 |    800 |    1 |   1312 |    0.162 |  3166.96 |   16.552 |    48.33 |   16.714 |    78.50 |
|  3200 |    128 |    1 |   3328 |    1.397 |  2290.24 |    3.208 |    39.89 |    4.606 |   722.58 |
|  3200 |    128 |    1 |   3328 |    1.404 |  2278.80 |    3.206 |    39.93 |    4.610 |   721.92 |
|  3200 |    800 |    1 |   4000 |    1.399 |  2286.73 |   20.490 |    39.04 |   21.889 |   182.74 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.163 |  3134.09 |    2.483 |    51.54 |    2.647 |   241.81 |
|   512 |    128 |    1 |    640 |    0.149 |  3442.34 |    2.453 |    52.19 |    2.601 |   246.02 |
|   512 |    800 |    1 |   1312 |    0.149 |  3435.07 |   15.608 |    51.26 |   15.757 |    83.26 |
|  3200 |    128 |    1 |   3328 |    1.283 |  2494.81 |    2.754 |    46.47 |    4.037 |   824.40 |
|  3200 |    128 |    1 |   3328 |    1.278 |  2502.93 |    2.758 |    46.42 |    4.036 |   824.56 |
|  3200 |    800 |    1 |   4000 |    1.283 |  2494.49 |   17.413 |    45.94 |   18.695 |   213.96 |


LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.225 |  2272.56 |    1.392 |    91.97 |    1.617 |   395.77 |
|   512 |    128 |    1 |    640 |    0.212 |  2419.60 |    1.382 |    92.60 |    1.594 |   401.54 |
|   512 |    800 |    1 |   1312 |    0.211 |  2424.63 |    9.203 |    86.93 |    9.414 |   139.37 |
|  3200 |    128 |    1 |   3328 |    1.760 |  1818.62 |    2.031 |    63.03 |    3.790 |   878.00 |
|  3200 |    128 |    1 |   3328 |    1.765 |  1813.22 |    2.033 |    62.95 |    3.798 |   876.18 |
|  3200 |    800 |    1 |   4000 |    1.764 |  1813.91 |   13.143 |    60.87 |   14.907 |   268.33 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.216 |  2372.26 |    1.331 |    96.14 |    1.547 |   413.66 |
|   512 |    128 |    1 |    640 |    0.201 |  2551.45 |    1.288 |    99.39 |    1.489 |   429.94 |
|   512 |    800 |    1 |   1312 |    0.201 |  2552.14 |    8.284 |    96.57 |    8.485 |   154.63 |
|  3200 |    128 |    1 |   3328 |    1.651 |  1937.73 |    1.583 |    80.87 |    3.234 |  1029.03 |
|  3200 |    128 |    1 |   3328 |    1.658 |  1929.82 |    1.586 |    80.71 |    3.244 |  1025.86 |
|  3200 |    800 |    1 |   4000 |    1.648 |  1941.65 |   10.096 |    79.24 |   11.744 |   340.59 |
llama-bench

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 3359.77 ± 73.09
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 52.61 ± 0.02
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 2525.13 ± 41.08
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 101.71 ± 0.33

build: c8d6a1f (1431) (master)

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 3648.52 ± 55.04
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 53.53 ± 0.16
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 2668.02 ± 33.54
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 105.22 ± 0.76

build: 3b9ea65 (1432) (PR)


Device 0: NVIDIA RTX A6000, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 4098.15 ± 105.84
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 44.84 ± 0.20
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 3450.93 ± 72.39
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 105.11 ± 0.57

build: c8d6a1f (1431)

Device 0: NVIDIA RTX A6000, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 4306.38 ± 43.09
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 45.20 ± 0.09
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 3473.03 ± 118.79
llama 7B mostly Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 106.17 ± 0.97

build: 3b9ea65 (1432)


Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model size params backend ngl test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 pp 512 2241.65 ± 20.20
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 tg 128 67.65 ± 0.23

build: c8d6a1f (1431)

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model size params backend ngl test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 pp 512 1760.98 ± 5.52
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 tg 128 70.63 ± 0.34

build: 3b9ea65 (1432)


Latest benches after GGML_PREC_F32 addition:

  • V100 tests
LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.167 |  3071.75 |    2.503 |    51.14 |    2.670 |   239.72 |
|   512 |    128 |    2 |    768 |    0.149 |  3430.51 |    3.029 |    84.53 |    3.178 |   241.67 |
|   512 |    128 |    3 |    896 |    0.149 |  3429.80 |    3.087 |   124.39 |    3.236 |   276.86 |
|   512 |    128 |    4 |   1024 |    0.149 |  3432.35 |    3.111 |   164.55 |    3.261 |   314.05 |
|   512 |    128 |    5 |   1152 |    0.149 |  3435.34 |    3.176 |   201.49 |    3.325 |   346.43 |
|   512 |    128 |    6 |   1280 |    0.149 |  3430.67 |    3.221 |   238.46 |    3.370 |   379.84 |
|   512 |    128 |    7 |   1408 |    0.149 |  3436.45 |    3.299 |   271.56 |    3.448 |   408.30 |
|   512 |    128 |    8 |   1536 |    0.150 |  3423.31 |    3.349 |   305.79 |    3.498 |   439.08 |
|   512 |    128 |   16 |   2560 |    0.150 |  3418.48 |    3.584 |   571.38 |    3.734 |   685.57 |
|   512 |    128 |   32 |   4608 |    0.151 |  3396.19 |    4.495 |   911.33 |    4.645 |   991.97 |

LLAMA_CUBLAS=1 make -j batched batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32

### PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.227 |  2253.90 |    1.324 |    96.65 |    1.552 |   412.50 |
|   512 |    128 |    2 |    768 |    0.201 |  2546.87 |    2.273 |   112.63 |    2.474 |   310.44 |
|   512 |    128 |    3 |    896 |    0.201 |  2545.44 |    2.323 |   165.29 |    2.524 |   354.95 |
|   512 |    128 |    4 |   1024 |    0.201 |  2542.89 |    2.378 |   215.28 |    2.580 |   396.95 |
|   512 |    128 |    5 |   1152 |    0.201 |  2543.87 |    3.002 |   213.19 |    3.203 |   359.62 |
|   512 |    128 |    6 |   1280 |    0.201 |  2542.08 |    3.056 |   251.34 |    3.257 |   393.00 |
|   512 |    128 |    7 |   1408 |    0.201 |  2543.49 |    3.120 |   287.19 |    3.321 |   423.94 |
|   512 |    128 |    8 |   1536 |    0.202 |  2538.86 |    3.174 |   322.58 |    3.376 |   454.96 |
|   512 |    128 |   16 |   2560 |    0.202 |  2532.77 |    5.130 |   399.19 |    5.333 |   480.07 |
|   512 |    128 |   32 |   4608 |    0.203 |  2520.90 |    9.382 |   436.57 |    9.585 |   480.73 |


LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.166 |  3080.77 |    2.502 |    51.16 |    2.668 |   239.89 |
|   512 |    128 |    1 |    640 |    0.149 |  3440.05 |    2.496 |    51.28 |    2.645 |   241.96 |
|   512 |    800 |    1 |   1312 |    0.149 |  3444.80 |   15.706 |    50.94 |   15.854 |    82.75 |
|  3200 |    128 |    1 |   3328 |    1.154 |  2773.27 |    2.743 |    46.66 |    3.897 |   853.96 |
|  3200 |    128 |    1 |   3328 |    1.092 |  2930.01 |    2.737 |    46.76 |    3.830 |   869.03 |
|  3200 |    800 |    1 |   4000 |    1.093 |  2928.98 |   17.208 |    46.49 |   18.301 |   218.57 |

LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/openllama-7b-v2/ggml-model-q4_k.gguf 4096 1 99 1 512,3200 128,128,800 1

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 3, n_threads_batch = 3

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.219 |  2340.69 |    1.323 |    96.78 |    1.541 |   415.23 |
|   512 |    128 |    1 |    640 |    0.200 |  2554.71 |    1.310 |    97.69 |    1.511 |   423.66 |
|   512 |    800 |    1 |   1312 |    0.202 |  2539.87 |    8.343 |    95.89 |    8.545 |   153.55 |
|  3200 |    128 |    1 |   3328 |    1.515 |  2112.70 |    1.564 |    81.83 |    3.079 |  1080.94 |
|  3200 |    128 |    1 |   3328 |    1.470 |  2177.61 |    1.542 |    83.02 |    3.011 |  1105.17 |
|  3200 |    800 |    1 |   4000 |    1.457 |  2197.03 |    9.854 |    81.19 |   11.310 |   353.66 |
LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench -m ./models/openllama-7b-v2/ggml-model-f16.gguf -m ./models/openllama-7b-v2/ggml-model-q4_k.gguf -ngl 99

Device 0: Tesla V100-PCIE-16GB, compute capability 7.0

model size params backend ngl test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 pp 512 3722.34 ± 127.35
llama 7B F16 12.55 GiB 6.74 B CUDA 99 tg 128 52.80 ± 0.24
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 pp 512 2713.46 ± 37.55
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CUDA 99 tg 128 101.10 ± 0.99

build: a40f611 (1662)

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 27, 2023

@slaren Have you noticed this as well? Do you think there is any reason not to switch to F32 compute?

@slaren
Copy link
Collaborator

slaren commented Oct 27, 2023

I tested this again on a 3090 Ti, and for me master is faster:

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 4672.50 ± 290.50
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 54.62 ± 0.18

build: c8d6a1f (1431) (master)

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 3285.46 ± 11.15
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 55.63 ± 0.19

build: 3b9ea65 (1432) (PR)

@KerfuffleV2
Copy link
Collaborator

Might be a dumb question, but would these changes affect quantized models at all?

@ggerganov
Copy link
Owner Author

@slaren Indeed, on RTX 3090 I also don't observe benefit from 32F mode. So it's not an universal thing

@KerfuffleV2 Yes, because for some of the operations we dequantize to F16 and use cuBLAS. There are some numbers in my post above for quantized models (Q4_K)

@ggerganov
Copy link
Owner Author

I restored the CUDA_COMPUTE_16F in ggml_cuda_op_mul_mat_cublas() (used for 2D tensors) and kept CUDA_COMPUTE_32F in ggml_cuda_mul_mat_mat_batched_cublas() (used for 3D tensors KV cache). Now the PP performance is maintained for 3090 and maybe slightly better for the others and the slight TG improvement is also maintained:

Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6

model size params backend ngl test t/s
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 pp 512 2303.96 ± 25.72
llama 13B mostly Q4_K - Medium 7.33 GiB 13.02 B CUDA 99 tg 128 69.43 ± 0.24

build: 0f2498f (1433)

LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/llama-2-13b.Q4_K_M.gguf 4096 1 99 1 512,3200 128,800 1

### master

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.248 |  2060.39 |    1.985 |    64.48 |    2.234 |   286.52 |
|   512 |    800 |    1 |   1312 |    0.240 |  2131.52 |   12.681 |    63.08 |   12.922 |   101.54 |
|  3200 |    128 |    1 |   3328 |    1.946 |  1644.78 |    2.416 |    52.98 |    4.362 |   763.01 |
|  3200 |    800 |    1 |   4000 |    1.929 |  1658.93 |   15.530 |    51.51 |   17.459 |   229.11 |

### PR

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.243 |  2110.64 |    1.898 |    67.43 |    2.141 |   298.96 |
|   512 |    800 |    1 |   1312 |    0.233 |  2197.75 |   12.323 |    64.92 |   12.556 |   104.49 |
|  3200 |    128 |    1 |   3328 |    1.893 |  1690.68 |    2.452 |    52.21 |    4.344 |   766.03 |
|  3200 |    800 |    1 |   4000 |    1.872 |  1709.50 |   15.783 |    50.69 |   17.655 |   226.57 |

Device 0: NVIDIA RTX A6000, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 4182.22 ± 93.96
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 45.23 ± 0.12

build: 0f2498f (1433)

@slaren
Copy link
Collaborator

slaren commented Oct 27, 2023

The performance is the same now for me too, maybe a little bit better than master, but within the margin of error.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 pp 512 4851.38 ± 70.18
llama 7B mostly F16 12.55 GiB 6.74 B CUDA 99 tg 128 55.83 ± 0.12

build: 0f2498f (1433)

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 27, 2023

After 0f2498f the performance seems about the same as master, before that it was worse on my GPU.

Tested with a Q5_K_M Mistral model, using 4608 1 99 1 512 128 1,2,3,4,5,6,7,8,16,32 on

Device 0: AMD Radeon RX 6600, compute capability 10.3
Expand

master

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 1.026 499.12 4.339 29.50 5.364 119.30
512 128 2 768 1.023 500.40 17.023 15.04 18.047 42.56
512 128 3 896 1.027 498.35 17.187 22.34 18.214 49.19
512 128 4 1024 1.029 497.35 17.307 29.58 18.337 55.84
512 128 5 1152 1.033 495.78 17.449 36.68 18.482 62.33
512 128 6 1280 1.032 496.27 17.572 43.71 18.604 68.80
512 128 7 1408 1.029 497.45 17.718 50.57 18.747 75.10
512 128 8 1536 1.033 495.80 17.901 57.20 18.933 81.13
512 128 16 2560 1.027 498.49 19.685 104.04 20.712 123.60
512 128 32 4608 1.030 496.96 26.245 156.07 27.275 168.94

PR pre-0f2498f

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 3.095 165.44 4.565 28.04 7.660 83.55
512 128 2 768 3.096 165.36 17.221 14.87 20.317 37.80
512 128 3 896 3.104 164.97 17.415 22.05 20.519 43.67
512 128 4 1024 3.107 164.78 17.569 29.14 20.676 49.53
512 128 5 1152 3.109 164.71 17.749 36.06 20.858 55.23
512 128 6 1280 3.110 164.64 17.917 42.86 21.027 60.88
512 128 7 1408 3.112 164.51 18.091 49.53 21.204 66.40
512 128 8 1536 3.111 164.57 18.314 55.91 21.425 71.69
512 128 16 2560 3.113 164.47 20.491 99.95 23.604 108.46
512 128 32 4608 3.114 164.44 26.734 153.21 29.847 154.39

PR

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 1.012 505.92 4.542 28.18 5.554 115.22
512 128 2 768 1.016 504.09 17.205 14.88 18.221 42.15
512 128 3 896 1.016 503.85 17.595 21.82 18.612 48.14
512 128 4 1024 1.018 503.16 17.586 29.11 18.603 55.04
512 128 5 1152 1.018 502.79 17.761 36.03 18.780 61.34
512 128 6 1280 1.017 503.47 17.930 42.83 18.947 67.56
512 128 7 1408 1.017 503.53 18.111 49.47 19.128 73.61
512 128 8 1536 1.020 501.78 18.348 55.81 19.368 79.31
512 128 16 2560 1.017 503.34 20.541 99.70 21.558 118.75
512 128 32 4608 1.019 502.52 26.819 152.73 27.838 165.53

PR + FORCE_MMQ

main: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 0.983 521.02 4.016 31.87 4.999 128.04
512 128 2 768 0.982 521.25 17.280 14.81 18.263 42.05
512 128 3 896 1.016 503.79 17.448 22.01 18.465 48.53
512 128 4 1024 0.987 519.00 17.664 28.99 18.650 54.91
512 128 5 1152 1.053 486.17 17.780 36.00 18.833 61.17
512 128 6 1280 0.988 518.00 17.942 42.80 18.930 67.62
512 128 7 1408 0.988 518.47 18.091 49.53 19.078 73.80
512 128 8 1536 0.989 517.95 18.360 55.77 19.348 79.39
512 128 16 2560 0.989 517.65 20.503 99.89 21.492 119.12
512 128 32 4608 0.989 517.81 26.741 153.17 27.730 166.17

@ggerganov
Copy link
Owner Author

@KerfuffleV2 Thanks for the results. Likely this PR is not getting merged as the numbers are not convincing.

@cmp-nct
Copy link
Contributor

cmp-nct commented Oct 27, 2023

I had tested FP8 cublas in ggllm, which is 40 series nvidia cards.
FP16 was a small increase over FP32 but FP8 was a big one.
The performance was almost doubled. I didn't have time to continue with ggllm but maybe some of the code is useful as a reference.
That won't work on any older cards, so I used code to select the variant based on compute capability.

The kernels for fp8 conversion and the cublas wrapper are here: https://github.com/cmp-nct/ggllm.cpp/blob/ggfalcon_dev/ggml-cuda.cu

@ggerganov ggerganov added demo Demonstrate some concept or idea, not intended to be merged Nvidia GPU Issues specific to Nvidia GPUs labels Oct 28, 2023
@Ph0rk0z
Copy link

Ph0rk0z commented Oct 28, 2023

For dual P40s on 70b, I started having 107 second replies during prompt processing of about 3k tokens. With this PR, those replies have come down to 25 seconds, which is reasonable. Generation speed itself only went from 8.8 tokens to 8.95 tokens. Model is Q4KM on compute 6.1.

Testing dual 3090s, there was some performance hit but it was negligible for me during oneshot generations or chat. I mostly see it in prompt processing and on such fast GPU it's fractions of a second and .XX tokens.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 28, 2023

Prompt processing results on my Tesla P40:

GPU Model Test t/s master t/s PR Speedup
P40 7b q4_0 pp512 338.89 934.11 2.76
P40 13b q4_k_s pp512 204.26 472.83 2.31

@KerfuffleV2
Copy link
Collaborator

It was much worse for me without 0f2498f (not that anyone else probably really cares about performance on el-cheapo AMD GPUs)

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 29, 2023

As expected pascal benefit greatly from FP32, 3090 can go either way and AMD favors FP16.

@YavorGIvanov
Copy link

As expected pascal benefit greatly from FP32, 3090 can go either way and AMD favors FP16.

Can you explain why it is expected exactly ? You can also point me to readings if it will be easier to explain by them.

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 29, 2023

Pascal (at least P40) is 1/3 speed in FP16 ops. Nvidia made it this way and released the P100 for accelerated FP16 (but missing 8bit I think). They told users to pick one or the other based on application. It's why it doesn't work well for back-ends like exllama. It doesn't even have tensor cores.

3090 speed for FP16/FP32 is pretty much similar. Again, it's how they optimized it for what people were doing at the time. More and more workloads are using lower precision so nvidia keeps giving you smaller and smaller tensors. Hence 4xxx cards have stuff like FP8. People used to care about double precision at one point and now not a peep.

AMD I think just came out at a time when people were using FP16 so they accelerated that.

In short, every card is optimized for what was popular and demanded from customers at the time. They tend to tout FLOPS at a given precision in the specs too. https://www.techpowerup.com/gpu-specs/tesla-p40.c2878

@julianullrich99
Copy link

I saw this and was excited for better performance on my cheapo P40+1080ti setup. (Both Tesla Cards that have 1:64 in fp16 vs fp32, so should be waaayyy faster, i guess?)
However, I only saw the following improvements:

GPUs model test t/s (master) t/s (PR) speedup
P40+1080ti llama 13b mostly Q5_K medium tg128 20.3 32.0 1.6
P40+1080ti llama 70b mostly Q3_K medium tg128 4.3 5.7 1.3

What is the explanation for this poor improvement?

(q3 70b is the most I can fit in VRAM)

@Ph0rk0z
Copy link

Ph0rk0z commented Oct 30, 2023

Seeming 4x speedup on prompt processing is nothing to sneeze at. Go try 3k context now. I think token generation is already FP32 due to the nature of offloading to CPU unless you forced that FP16 compile flag.

@FNsi
Copy link
Contributor

FNsi commented Nov 1, 2023

i guess it's faster in HBM only...

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 5, 2023

Most of the gains of this seems to have been replicated by: #3882

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 5, 2023

Updated performance figures (with -DLLAMA_CUDA_FORCE_MMQ=ON):

GPU Model Test t/s master t/s PR Speedup
P40 7b q4_0 pp512 889.54 934.11 1.05
P40 13b q4_k_s pp512 455.17 472.83 1.04

It would still be cool if we could make this a command-line flag like mmq used to be.

edit: cannot reproduce these numbers for master.

cebtenzzre added a commit to cebtenzzre/llama.cpp that referenced this pull request Nov 27, 2023
@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 27, 2023

I don't know what I did wrong in my previous testing, but I'm still seeing a 2.76x prompt processing speedup with this PR for 7B LLaMA on my P40.

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 28, 2023

PR seems to have been obsoleted so I can't try it along with the new changes. I didn't get any speedup right after when the fix was made but now the code is all different.

@ggerganov
Copy link
Owner Author

PR seems to have been obsoleted

How come? I merged master yesterday. It's still giving some extra performance on certain GPUs

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 29, 2023

When I try to merge it it has conflicts.

edit: I pulled the repo again and re-merged. It worked. I think it was a matter of 99 vs 103 for PP. From 10.x ms per token to 9.x ms per token.

@jacooooooooool
Copy link

jacooooooooool commented Dec 4, 2023

Hello, does the compilation for old graphics cards (NO TENSOR) change THAT MUCH? Have I made a mistake somewhere? I'm shocked :) GTX1080Ti power :)
(flag -DLLAMA_CUDA_FORCE_MMQ=ON)

Today's llama.cpp build, Windows10 + AMD2990WX + GTX1080Ti + 64GB
Build environment MSYS_NT-10.0-19045

Test1: -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON
Test2: -DLLAMA_CUBLAS=ON

Test1 llama.cpp Amd2990WX + GTX1080TI -
Winsows10 = MSYS_NT-10.0-19045 AMD 3.3.6.x86_64 2023-02-22 08:29 UTC x86_64 Msys

--> ./batched-bench C:\models\ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1
 ...........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size  =  512.00 MiB
llama_build_graph: non-view tensors processed: 644/644
llama_new_context_with_model: compute buffer total size = 215.06 MiB
llama_new_context_with_model: VRAM scratch buffer: 212.00 MiB
llama_new_context_with_model: total VRAM used: 7060.76 MiB (model: 6336.76 MiB, context: 724.00 MiB)

===========================Windows 10 CuBlass+MMQ============================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.539 |   949.45 |    3.270 |    39.15 |    3.809 |   168.03 |
|   512 |    128 |    1 |    640 |    0.515 |   994.42 |    3.279 |    39.03 |    3.794 |   168.68 |
|   512 |    800 |    1 |   1312 |    0.515 |   994.69 |   21.146 |    37.83 |   21.661 |    60.57 |
|  3200 |    128 |    1 |   3328 |    4.257 |   751.67 |    4.134 |    30.96 |    8.391 |   396.61 |
|  3200 |    128 |    1 |   3328 |    4.229 |   756.67 |    4.132 |    30.98 |    8.361 |   398.02 |
|  3200 |    800 |    1 |   4000 |    4.226 |   757.22 |   26.518 |    30.17 |   30.744 |   130.11 |

llama_print_timings:        load time =    5098.30 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   14491.21 ms / 11152 tokens (    1.30 ms per token,   769.57 tokens per second)
llama_print_timings:        eval time =   62476.74 ms /  2112 runs   (   29.58 ms per token,    33.80 tokens per second)
llama_print_timings:       total time =   81862.30 ms

============================Linux Ubuntu 22 CuBlass+MMQ==========================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.555 |   923.31 |    2.998 |    42.70 |    3.552 |   180.17 |
|   512 |    128 |    1 |    640 |    0.547 |   936.24 |    2.998 |    42.70 |    3.545 |   180.55 |
|   512 |    800 |    1 |   1312 |    0.548 |   934.53 |   19.725 |    40.56 |   20.273 |    64.72 |
|  3200 |    128 |    1 |   3328 |    4.416 |   724.60 |    3.949 |    32.42 |    8.365 |   397.85 |
|  3200 |    128 |    1 |   3328 |    4.309 |   742.68 |    3.839 |    33.34 |    8.148 |   408.43 |
|  3200 |    800 |    1 |   4000 |    4.286 |   746.54 |   25.247 |    31.69 |   29.534 |   135.44 |

llama_print_timings:        load time =    1931.91 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   14798.92 ms / 11152 tokens (    1.33 ms per token,   753.57 tokens per second)
llama_print_timings:        eval time =   58755.08 ms /  2112 runs   (   27.82 ms per token,    35.95 tokens per second)
llama_print_timings:       total time =   75349.54 ms

============================================================================

Test2 llama.cpp Amd2990WX + GTX1080TI -
( compilation "cmake .. -DLLAMA_CUBLAS=ON" )

---> ./batched-bench C:\models\ggml-model-f16.gguf 4096 1 99 1 512,3200 128,128,800 1

...........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 512.00 MiB
llama_new_context_with_model: kv self size  =  512.00 MiB
llama_build_graph: non-view tensors processed: 644/644
llama_new_context_with_model: compute buffer total size = 215.06 MiB
llama_new_context_with_model: VRAM scratch buffer: 212.00 MiB
llama_new_context_with_model: total VRAM used: 7060.76 MiB (model: 6336.76 MiB, context: 724.00 MiB)

===========================Windows 10 CuBlass============================

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.050 |   487.77 |    6.238 |    20.52 |    7.288 |    87.82 |
|   512 |    128 |    1 |    640 |    1.035 |   494.87 |    6.254 |    20.47 |    7.289 |    87.80 |
|   512 |    800 |    1 |   1312 |    1.034 |   495.02 |   48.825 |    16.39 |   49.859 |    26.31 |
|  3200 |    128 |    1 |   3328 |   15.880 |   201.51 |   18.666 |     6.86 |   34.547 |    96.33 |
|  3200 |    128 |    1 |   3328 |   15.854 |   201.84 |   18.665 |     6.86 |   34.519 |    96.41 |
|  3200 |    800 |    1 |   4000 |   15.856 |   201.81 |  126.285 |     6.33 |  142.141 |    28.14 |

llama_print_timings:        load time =    4900.65 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   50937.57 ms / 11152 tokens (    4.57 ms per token,   218.93 tokens per second)
llama_print_timings:        eval time =  224928.09 ms /  2112 runs   (  106.50 ms per token,     9.39 tokens per second)
llama_print_timings:       total time =  280545.48 ms

================================================================================

++++++++++++++++++++LLama-Bench+++++++++++++++++++

1Test - llama-bench ((cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON))
2Test - llama-bench ((cmake .. -DLLAMA_CUBLAS=ON ))

=========================Windows 10 CuBlass+MMQ============================

./llama-bench.exe -m C:\models\llama-2-13b.Q4_0.gguf -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     31.41 ± 0.01 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     32.17 ± 0.00 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     31.90 ± 0.03 |

build: 23b5e12 (1610)

==========================Windows 10 CuBlass================================

 ./llama-bench.exe -m C:\models\llama-2-13b.Q4_0.gguf -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     21.93 ± 0.02 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     19.83 ± 0.02 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     16.62 ± 0.00 |

build: 23b5e12 (1610))

=========================Linux Ubuntu 22 CuBlass+MMQ==========================

./llama-bench -m '/models/llama-2-13b.Q4_0.gguf' -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     37.85 ± 0.03 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     37.83 ± 0.04 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     36.73 ± 0.04 |

build: 23b5e12 (1610)

============================Ubuntu-Linux-cublass=============================

./llama-bench -m '/media/models/llama-2-13b.Q4_0.gguf' -p 0 -n 128,256,512
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 128     |     24.05 ± 0.01 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 256     |     21.36 ± 0.29 |
| llama 13B mostly Q4_0          |   6.86 GiB |    13.02 B | CUDA       |  99 | tg 512     |     17.91 ± 0.01 |

build: 23b5e12 (1610)

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 5, 2023

Yea, mmq changes a lot. For newer cards it wasn't helpful either for single batches.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 5, 2023

Yea, mmq changes a lot. For newer cards it wasn't helpful either for single batches.

For new cards we'll have to wait for cublas fp8 support, that's several times faster than fp32 and the precision is still awesome.
That's potentil to beat exlv2

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 6, 2023

FP8? Sure.. let me just fire up my H100 :P

Already beat exllama on ampere, minus the prompt processing speed. I'm more hopeful for 8 bit kv_cache than FP8. On older GPUs I'm not sure what else can be done.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 6, 2023

FP8? Sure.. let me just fire up my H100 :P

Already beat exllama on ampere, minus the prompt processing speed. I'm more hopeful for 8 bit kv_cache than FP8. On older GPUs I'm not sure what else can be done.

Any 40 series card and the upcoming Super and 50 series support it.
H100 is only useful if you want to inference or train above 24GB of RAM, but in terms of speed a simple 4090 is just about matching it.
Though the 8 bit KV cache is also a great invention, will be very interesting to see final perplexity

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 6, 2023

TIL, 4xxx supports it. In textgen when I did testing on exllamav2 8 or 16bit cache. Didn't appear to make a difference for the same models and wikitext. Hopefully that holds true here. For most "good" models, sadly 24g is now not enough.

@cmp-nct
Copy link
Contributor

cmp-nct commented Dec 6, 2023

TIL, 4xxx supports it. In textgen when I did testing on exllamav2 8 or 16bit cache. Didn't appear to make a difference for the same models and wikitext. Hopefully that holds true here. For most "good" models, sadly 24g is now not enough.

I had implemented it for Falcon inference in ggllm and it worked very well on my 4090, a significant speed boost when using cublas as compared to fp16 or fp32.
One advantage is that the tensor cores for fp8 are significantly faster and the second one is that you just have to deal with 25% or 50% of the memory size. All those memory transfers for temporary storage (such as quant N to 16 of 32 bit cublas computation) is costing significant time.

@Ph0rk0z
Copy link

Ph0rk0z commented Dec 7, 2023

True but that only benefits bleeding edge cards. I'd rather have a reasonable 103b than an instant 7b. Quality over quantity.

@cebtenzzre
Copy link
Collaborator

There are conflicts since #4606 was merged.

@nalzok
Copy link

nalzok commented Dec 27, 2023

As noted above, FP32 is much faster than FP16 on the Tesla P40, but it's still a capable card otherwise with its 24GB VRAM. Can we have the option to specify the computation floating point type (and upcast float16 to float32 when necessary)? Besides choosing CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F, it should also affect the FP type used in custom kernels like MMQ and DMMV.

People using cutting-edge GPUs can also benefit from this option when they encounter Inf or NaN. See this comment for more details.

@ggerganov ggerganov force-pushed the cuda-cublas-opts branch 2 times, most recently from 41f0f44 to 4011f09 Compare December 27, 2023 08:40
@Ph0rk0z
Copy link

Ph0rk0z commented Dec 27, 2023

Unironically I get no benefits on 3090 from using FP16 besides some lower memory use. Through SD I have found that the weights can even be loaded as FP16 for these pascal cards as long as the calculations happen at the correct precision. xformers does this automatically on that end.

@ggerganov
Copy link
Owner Author

This could be related: https://twitter.com/main_horse/status/1742013125090795531

CUBLAS_COMPUTE_32F,
CUBLAS_GEMM_DEFAULT_TENSOR_OP));
} break;
}
Copy link
Owner Author

@ggerganov ggerganov Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to merge this particular change of ggml_cuda_op_mul_mat_cublas since it uses less memory than cublasSgemm and still performs the compute in F32 which is needed for models like Phi-2

Copy link
Collaborator

@slaren slaren Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I am following all the logic, but I would be concerned about down-converting F32 src0/src1 to F16 despite the user requesting GGML_PREC_32. In the long run, I think it would be better to always respect the user types and do all the type conversions in the graph (ggerganov/ggml#455), since it would give users more control and it would simplify the code in the backends. It would also move the temporary buffer from the pool to the compute buffer, which would result in more accurate estimation of the VRAM needed to run a model. It should also help with the issue of to_fp32 and to_fp16 in the CUDA backend being unable to deal with non-contiguous tensors, since it would be done in a ggml_cpy instead.

@Ph0rk0z
Copy link

Ph0rk0z commented Jan 2, 2024

Yet every time I tried Triton kernels on 3090 they were overall slower.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 2, 2024

I noticed significant speed differences based on such small changes when using cuBlas while developing ggllm(the flacon fork).
FP8 was the biggest boost I've seen but also small things like the accumulator mattered.
It is different on a 3090 than on a 4090 as well.

In general, we should also have a look at the EXL2 implementation. Even without FP8 that kernel is 2 times faster than llama.cpp on modern hardware. It delivers up to 14000 tokens/second prompt processing on a single 4090 while on llama.cpp I top out at 5500.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 4, 2024

Heh.. so I have discovered a bad dimm in my server that was causing memory bandwidth to drop from 60gb/s down to 10g/s.

Having fixed that issue, I was properly able to test performance regressions again. Since I merged this PR into main@5a7d312 I have gone from 18.6 t/s down to 15.5t/s on dual 3090s. Using the same kernel settings and also going back to splitting by row.

I had merged this PR for P40s but I don't think it's contents is what returned the performance, I just happened to grab a backup at the right time. So now I have a date and a commit to test against, to find the change that ate my 3 tokens/s. I have other backups from dec 27th and those also have the regression.

edit:

commit is between:7082d24 and f679349

@cebtenzzre
Copy link
Collaborator

On master I'm getting about 910 t/s with pp512 on a Q4_0 7b llama with or without this change, so it doesn't seem to be necessary for Tesla P40s anymore.

@ggerganov ggerganov closed this Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.