Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Closed
young-developer opened this issue Nov 1, 2023 · 11 comments · Fixed by #3891
Closed

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

young-developer opened this issue Nov 1, 2023 · 11 comments · Fixed by #3891

Comments

@young-developer
Copy link
Contributor

young-developer commented Nov 1, 2023

CUDA GPU inference is slower for the latest version(1449) in comparison to 1336:

1449

llama_print_timings:        load time =    3111.53 ms
llama_print_timings:      sample time =      99.26 ms /   617 runs   (    0.16 ms per token,  6215.81 tokens per second)
llama_print_timings: prompt eval time =      73.73 ms /    22 tokens (    3.35 ms per token,   298.37 tokens per second)
llama_print_timings:        eval time =   11428.62 ms /   616 runs   (   18.55 ms per token,    53.90 tokens per second)
llama_print_timings:       total time =   11679.26 ms

1336

llama_print_timings:        load time =    3150.73 ms
llama_print_timings:      sample time =     149.46 ms /   623 runs   (    0.24 ms per token,  4168.31 tokens per second)
llama_print_timings: prompt eval time =     115.86 ms /    23 tokens (    5.04 ms per token,   198.52 tokens per second)
llama_print_timings:        eval time =    9558.68 ms /   622 runs   (   15.37 ms per token,    65.07 tokens per second)
llama_print_timings:       total time =   10518.21 ms

Logs:

logs-fast-1336.txt
logs-slow-1449.txt

@LostRuins
Copy link
Collaborator

Related to #3869 ?

@young-developer
Copy link
Contributor Author

@young-developer
Copy link
Contributor Author

Related to #3869 ?

Yes looks like related.

@young-developer
Copy link
Contributor Author

fixed in: #3882

@ggerganov
Copy link
Owner

ggerganov commented Nov 1, 2023

@young-developer I don't see how #3882 affects your test (even if it fixes it as you state). Are you sure there was an issue in the first place? I've tested specifically on 3090 before merging #3776 and didn't observe regression in TG speed.

Edit: Actually #3776 did reduce the TG perf for short sequences and improve it for long sequences / big contexts. But still, I don't think #3882 would affect the TG speed compared to master

@young-developer
Copy link
Contributor Author

young-developer commented Nov 1, 2023

@ggerganov Yeap. You are right. Still same performance of 52+ t/s in comparison to previous 64+ t/s.

@young-developer
Copy link
Contributor Author

@ggerganov I retested and found versions:

1429:

llama_print_timings:        load time =    3160.30 ms
llama_print_timings:      sample time =     162.78 ms /  1024 runs   (    0.16 ms per token,  6290.54 tokens per second)
llama_print_timings: prompt eval time =      67.68 ms /    22 tokens (    3.08 ms per token,   325.04 tokens per second)
llama_print_timings:        eval time =   15935.03 ms /  1023 runs   (   15.58 ms per token,    64.20 tokens per second)
llama_print_timings:       total time =   16295.92 ms

1430:

llama_print_timings:        load time =    3126.86 ms
llama_print_timings:      sample time =      96.83 ms /   617 runs   (    0.16 ms per token,  6371.99 tokens per second)
llama_print_timings: prompt eval time =      75.29 ms /    22 tokens (    3.42 ms per token,   292.20 tokens per second)
llama_print_timings:        eval time =   11545.69 ms /   616 runs   (   18.74 ms per token,    53.35 tokens per second)
llama_print_timings:       total time =   11794.79 ms

logs-1429.txt
logs-1430.txt

@young-developer young-developer changed the title Windows - CUDA GPU - Performance Difference - 1336 vs 1449 Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ Nov 1, 2023
@slaren
Copy link
Collaborator

slaren commented Nov 1, 2023

I can reproduce this as well, looks like there is a significant regression with models with GQA. This is most likely due to the cudaMalloc and cudaMemcpy in this branch, moving all of this to a kernel that can be executed asynchronously would probably fix this.

model size params backend ngl test t/s
llama 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 99 pp 512 3432.61 ± 248.71
llama 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 99 tg 512 83.32 ± 0.51
mistral 7B mostly Q8_0 7.17 GiB 7.24 B CUDA 99 pp 512 3516.32 ± 18.31
mistral 7B mostly Q8_0 7.17 GiB 7.24 B CUDA 99 tg 512 59.43 ± 0.16

build: a2758d0 (1455)

model size params backend ngl test t/s
llama 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 99 pp 512 2602.74 ± 5.88
llama 7B mostly Q8_0 6.67 GiB 6.74 B CUDA 99 tg 512 84.11 ± 0.09
mistral 7B mostly Q8_0 7.17 GiB 7.24 B CUDA 99 pp 512 2492.05 ± 19.55
mistral 7B mostly Q8_0 7.17 GiB 7.24 B CUDA 99 tg 512 81.29 ± 0.03

build: 34b2a5e (1429)

@ggerganov
Copy link
Owner

Hm, very confusing numbers. Here are mine from today on an RTX 3090 in the cloud:

  • model: zephyr-7b-alpha.Q8_0.gguf (GQA=4)
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 0.163 3144.65 1.699 75.35 1.862 343.79
512 800 1 1312 0.148 3450.53 10.796 74.10 10.945 119.88
3200 128 1 3328 1.261 2537.64 1.963 65.22 3.224 1032.34
3200 800 1 4000 1.257 2546.18 12.396 64.54 13.653 292.98

build = 1453 (9a3b4f6)


main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 128 1 640 0.224 2287.12 1.607 79.67 1.830 349.63
512 800 1 1312 0.213 2407.50 10.557 75.78 10.770 121.83
3200 128 1 3328 1.646 1943.74 2.270 56.39 3.916 849.81
3200 800 1 4000 1.640 1951.50 14.721 54.35 16.360 244.49

build = 1429 (34b2a5e)

Log files:

These numbers are consistent with all my tests in #3776, although I tested mostly with GQA=1 models then.
The PP speed is significantly improved, and the TG is slightly degraded for short sequences but much better for longer.

@slaren
Likely the culprit is indeed in that extra malloc.
I'll look into how to implement the kernel that you suggest.

@ggerganov
Copy link
Owner

Could this be OS related? I see @young-developer runs on Windows and @slaren I assume you are also on Windows. I can only run Linux. Maybe the malloc has a more negative effect when running Windows? Sounds weird.

Maybe try to temporary pre-allocate a static buffer for the pointers to confirm that the malloc is causing the regression?

@slaren
Copy link
Collaborator

slaren commented Nov 1, 2023

This is likely a Windows-only issue, some calls have higher overhead under Windows. I am trying some solutions, I will open a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants