-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTX Processing regression for Pascal - Commit 2b4ea35 #3869
Comments
Sounds like the same issue as mine (#3780) |
Try a
I think after that commit is where the problem started |
Switching to 'cuda-cublas-opts' branch fixed it for me. |
@askmyteapot Can you check if branch try-fix-3869 fixes the issue with LLAMA_CUDA_FORCE_MMQ=1 set |
@ggerganov
Built with -DLLAMA_CUDA_FORCE_MMQ=ON and cublas on.
Has corrected the issue. But, is there any way the force MMQ compile time flag be a runtime flag? |
I agree that it would be nice if we had a runtime flag that could enable quantized mat*mat, even on modern GPUs. It does use less VRAM. |
You can, but unless you're willing to build multiple sets of CUDA kernels and swapping between them based on batch size, you may lose the small-batch MMQ optimizations since those are determined at compile time.
or maybe some intermediate value between these 2 options could provide a good compromise? I personally don't really benefit from the small batch optimization much, I am on RTX 2060 6GB which should be CC 7.5, but maybe my hardware is kinda crappy either way and cublas wasn't that special for me. So I am back to using MMQ with the original (large batch) values |
@LostRuins I think you mentioned earlier that for full offload, the new version on RTX 2060 is faster compared to MMQ and that you observe a regression for not-fully offloaded models due to 1-2 GPU layers less. How big is the latter regression? Is it a regression both for short and long (>1024) contexts? |
@ggerganov Sure, let me try to do a bit more methodical testing with the bencher instead for my RTX 2060 6GB. NEW = running from llama-b1468-bin-win-cublas-cu11.7.1-x64 latest CI build 7B, ngl=99 NEW
7B, ngl=99 OLD
13B results:Trying the max layers i can offload before going OOM, counting downwards... which is 23 for the benchmark tool: 13B, ngl=23 NEW
13B, ngl=25 OLD
So across the board CuBLAS helps with PP no doubt. But TG is hit in the new versions. |
One more run with b1420 at ngl=23 just to compare 13B, ngl=23 OLD
So with layer parity, TG speeds are comparable between b1420 and b1468. |
@Dampfinchen try compare your setup with the exact same models and layer counts, using 2 builds b1420 and b1468 from llama.cpp releases page, see your results. 1468 might have solved your issue after slaren's fix |
Thank you, these are overall inline with my expectations.
I think you might be reading this wrong. From what I see, the new build is ~39 t/s a bit faster than the old one even for TG when the model is fully offloaded. This is nice to see, although it deviates from my expectation for slight regression at short TG 128 tests. In any case, I believe you will see even bigger gains with the new build when the context is large. The TG regression (-8.8%) with partial offloading is expected due to the fewer layers, but at least the PP got some non-negligible improvement (+19.0%). In the future, I think we will compensate for this as I explained in earlier comment |
Agreed, prompt processing using cublas is indeed faster for my card. I know Pascal users did experience major PP slowdowns though, but I think that has been resolved already. edit: for reference
|
Alright, here's my result. Text generation speed is 4x slower compared to LostRuin's result with the same hardware. ` Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
build: 1efae9b (1469)` I don't think a further test for older builds is needed, I get the same result as LostRuin of around 4.3 token/s TG in those. Older builds before the huge CUDA changes worked flawlessly with my current system configuration. But if you need proof, I can provide that. Just ask. Edit: I tested various builds of llama.cpp and they exhibit the same problem. With older builds I meant koboldcpp, which works somehow, even though its based on llama.cpp. |
we have the same card, but i am using an older driver, which will OOM at 25 layers. I suspect your newer driver is doing something funky as 25 layers will not fit. @Dampfinchen try ngl 23 |
I do have the latest driver (546.01), however I disabled System Mem policity so it OOMs when there's not enough VRAM, just like with the old drivers. The new driver has better memory management allowing me to do more layers. Also I'm using Q4K_S, while you're using Q4K_M, so I can offload more layers naturally. Testing with ngl 23 yielded no improvement in my test. If swapping to RAM would be the issue, prompt processing would slow down too, but that is not the case. |
Alright, I've tried everything. I did try various commits, including from before tensor core support and batched cuda processing, but it was always the same, slow result. I also tried using MMQ only. Same thing here. IDK what's going on here. Koboldcpp version 1.47.2 which builds on llama.cpp doesn't have this issue using the same amount of layers, model, and system configuration and a similar sized prompt. I suspect since the only major difference between LostRuin's system and mine is the driver, it could have something to do with the system mem policy Nvidia introduced with the latest driver (https://nvidia.custhelp.com/app/answers/detail/a_id/5490) However, I already checked its function and it works great. When the VRAM full, it crashes like before, when setting to the prefer no system mem fallback. Hmm. It's worth mentioning that 7b with partial offloading (28 layers) is super slow as well, but performs as expected when using full GPU offloading. |
Alright, thanks to Slaren I was able to fix the problem. The issue was that I was not compiling it with AVX2 support, as I have assumed that's just enabled by default, which it isn't anymore. Performance is great as expected with AVX2. Case closed! |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
There is a regression on Context processing introduced in commit 2b4ea35
This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX
Current Behavior
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
5800X + 64GB DDR 3733
3060ti (8GB) + TESLA P40 (24GB)
Operating System, e.g. for Linux: Windows 11
SDK version, : MSVC 2022
@LostRuins
The text was updated successfully, but these errors were encountered: