-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884
Comments
Related to #3869 ? |
Regression started from https://github.com/ggerganov/llama.cpp/releases/tag/b1430 |
Yes looks like related. |
fixed in: #3882 |
@young-developer I don't see how #3882 affects your test (even if it fixes it as you state). Are you sure there was an issue in the first place? I've tested specifically on 3090 before merging #3776 and didn't observe regression in TG speed. Edit: Actually #3776 did reduce the TG perf for short sequences and improve it for long sequences / big contexts. But still, I don't think #3882 would affect the TG speed compared to |
@ggerganov Yeap. You are right. Still same performance of 52+ t/s in comparison to previous 64+ t/s. |
@ggerganov I retested and found versions: 1429:
1430:
|
I can reproduce this as well, looks like there is a significant regression with models with GQA. This is most likely due to the
build: a2758d0 (1455)
build: 34b2a5e (1429) |
Hm, very confusing numbers. Here are mine from today on an RTX 3090 in the cloud:
build = 1453 (9a3b4f6) main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
build = 1429 (34b2a5e) Log files: These numbers are consistent with all my tests in #3776, although I tested mostly with GQA=1 models then. @slaren |
Could this be OS related? I see @young-developer runs on Windows and @slaren I assume you are also on Windows. I can only run Linux. Maybe the malloc has a more negative effect when running Windows? Sounds weird. Maybe try to temporary pre-allocate a static buffer for the pointers to confirm that the malloc is causing the regression? |
This is likely a Windows-only issue, some calls have higher overhead under Windows. I am trying some solutions, I will open a PR soon. |
CUDA GPU inference is slower for the latest version(1449) in comparison to 1336:
1449
1336
Logs:
logs-fast-1336.txt
logs-slow-1449.txt
The text was updated successfully, but these errors were encountered: