Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

young-developer · 2023-11-01T10:41:27Z

CUDA GPU inference is slower for the latest version(1449) in comparison to 1336:

1449

llama_print_timings:        load time =    3111.53 ms
llama_print_timings:      sample time =      99.26 ms /   617 runs   (    0.16 ms per token,  6215.81 tokens per second)
llama_print_timings: prompt eval time =      73.73 ms /    22 tokens (    3.35 ms per token,   298.37 tokens per second)
llama_print_timings:        eval time =   11428.62 ms /   616 runs   (   18.55 ms per token,    53.90 tokens per second)
llama_print_timings:       total time =   11679.26 ms

1336

llama_print_timings:        load time =    3150.73 ms
llama_print_timings:      sample time =     149.46 ms /   623 runs   (    0.24 ms per token,  4168.31 tokens per second)
llama_print_timings: prompt eval time =     115.86 ms /    23 tokens (    5.04 ms per token,   198.52 tokens per second)
llama_print_timings:        eval time =    9558.68 ms /   622 runs   (   15.37 ms per token,    65.07 tokens per second)
llama_print_timings:       total time =   10518.21 ms

Logs:

logs-fast-1336.txt
logs-slow-1449.txt

The text was updated successfully, but these errors were encountered:

LostRuins · 2023-11-01T11:11:15Z

Related to #3869 ?

young-developer · 2023-11-01T11:26:57Z

Regression started from https://github.com/ggerganov/llama.cpp/releases/tag/b1430

young-developer · 2023-11-01T11:29:21Z

Related to #3869 ?

Yes looks like related.

young-developer · 2023-11-01T11:34:04Z

fixed in: #3882

ggerganov · 2023-11-01T11:52:44Z

@young-developer I don't see how #3882 affects your test (even if it fixes it as you state). Are you sure there was an issue in the first place? I've tested specifically on 3090 before merging #3776 and didn't observe regression in TG speed.

Edit: Actually #3776 did reduce the TG perf for short sequences and improve it for long sequences / big contexts. But still, I don't think #3882 would affect the TG speed compared to master

young-developer · 2023-11-01T16:21:51Z

@ggerganov Yeap. You are right. Still same performance of 52+ t/s in comparison to previous 64+ t/s.

young-developer · 2023-11-01T16:34:05Z

@ggerganov I retested and found versions:

1429:

llama_print_timings:        load time =    3160.30 ms
llama_print_timings:      sample time =     162.78 ms /  1024 runs   (    0.16 ms per token,  6290.54 tokens per second)
llama_print_timings: prompt eval time =      67.68 ms /    22 tokens (    3.08 ms per token,   325.04 tokens per second)
llama_print_timings:        eval time =   15935.03 ms /  1023 runs   (   15.58 ms per token,    64.20 tokens per second)
llama_print_timings:       total time =   16295.92 ms

1430:

llama_print_timings:        load time =    3126.86 ms
llama_print_timings:      sample time =      96.83 ms /   617 runs   (    0.16 ms per token,  6371.99 tokens per second)
llama_print_timings: prompt eval time =      75.29 ms /    22 tokens (    3.42 ms per token,   292.20 tokens per second)
llama_print_timings:        eval time =   11545.69 ms /   616 runs   (   18.74 ms per token,    53.35 tokens per second)
llama_print_timings:       total time =   11794.79 ms

logs-1429.txt
logs-1430.txt

slaren · 2023-11-01T16:40:00Z

I can reproduce this as well, looks like there is a significant regression with models with GQA. This is most likely due to the cudaMalloc and cudaMemcpy in this branch, moving all of this to a kernel that can be executed asynchronously would probably fix this.

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	3432.61 ± 248.71
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	tg 512	83.32 ± 0.51
mistral 7B mostly Q8_0	7.17 GiB	7.24 B	CUDA	99	pp 512	3516.32 ± 18.31
mistral 7B mostly Q8_0	7.17 GiB	7.24 B	CUDA	99	tg 512	59.43 ± 0.16

build: a2758d0 (1455)

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	pp 512	2602.74 ± 5.88
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	99	tg 512	84.11 ± 0.09
mistral 7B mostly Q8_0	7.17 GiB	7.24 B	CUDA	99	pp 512	2492.05 ± 19.55
mistral 7B mostly Q8_0	7.17 GiB	7.24 B	CUDA	99	tg 512	81.29 ± 0.03

build: 34b2a5e (1429)

ggerganov · 2023-11-01T17:35:17Z

Hm, very confusing numbers. Here are mine from today on an RTX 3090 in the cloud:

model: zephyr-7b-alpha.Q8_0.gguf (GQA=4)

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.163	3144.65	1.699	75.35	1.862	343.79
512	800	1	1312	0.148	3450.53	10.796	74.10	10.945	119.88
3200	128	1	3328	1.261	2537.64	1.963	65.22	3.224	1032.34
3200	800	1	4000	1.257	2546.18	12.396	64.54	13.653	292.98

build = 1453 (9a3b4f6)

main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	128	1	640	0.224	2287.12	1.607	79.67	1.830	349.63
512	800	1	1312	0.213	2407.50	10.557	75.78	10.770	121.83
3200	128	1	3328	1.646	1943.74	2.270	56.39	3.916	849.81
3200	800	1	4000	1.640	1951.50	14.721	54.35	16.360	244.49

build = 1429 (34b2a5e)

Log files:

These numbers are consistent with all my tests in #3776, although I tested mostly with GQA=1 models then.
The PP speed is significantly improved, and the TG is slightly degraded for short sequences but much better for longer.

@slaren
Likely the culprit is indeed in that extra malloc.
I'll look into how to implement the kernel that you suggest.

ggerganov · 2023-11-01T17:49:05Z

Could this be OS related? I see @young-developer runs on Windows and @slaren I assume you are also on Windows. I can only run Linux. Maybe the malloc has a more negative effect when running Windows? Sounds weird.

Maybe try to temporary pre-allocate a static buffer for the pointers to confirm that the malloc is causing the regression?

slaren · 2023-11-01T18:15:36Z

This is likely a Windows-only issue, some calls have higher overhead under Windows. I am trying some solutions, I will open a PR soon.

young-developer added the bug-unconfirmed label Nov 1, 2023

young-developer closed this as completed Nov 1, 2023

young-developer reopened this Nov 1, 2023

young-developer changed the title ~~Windows - CUDA GPU - Performance Difference - 1336 vs 1449~~ Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ Nov 1, 2023

slaren mentioned this issue Nov 1, 2023

ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel #3891

Merged

slaren closed this as completed in #3891 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

young-developer commented Nov 1, 2023 •

edited

Loading

LostRuins commented Nov 1, 2023

young-developer commented Nov 1, 2023

young-developer commented Nov 1, 2023

young-developer commented Nov 1, 2023

ggerganov commented Nov 1, 2023 •

edited

Loading

young-developer commented Nov 1, 2023 •

edited

Loading

young-developer commented Nov 1, 2023

slaren commented Nov 1, 2023

ggerganov commented Nov 1, 2023

ggerganov commented Nov 1, 2023

slaren commented Nov 1, 2023

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Comments

young-developer commented Nov 1, 2023 • edited Loading

1449

1336

Logs:

LostRuins commented Nov 1, 2023

young-developer commented Nov 1, 2023

young-developer commented Nov 1, 2023

young-developer commented Nov 1, 2023

ggerganov commented Nov 1, 2023 • edited Loading

young-developer commented Nov 1, 2023 • edited Loading

young-developer commented Nov 1, 2023

1429:

1430:

slaren commented Nov 1, 2023

ggerganov commented Nov 1, 2023

ggerganov commented Nov 1, 2023

slaren commented Nov 1, 2023

young-developer commented Nov 1, 2023 •

edited

Loading

ggerganov commented Nov 1, 2023 •

edited

Loading

young-developer commented Nov 1, 2023 •

edited

Loading