Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

Closed
4 tasks done
askmyteapot opened this issue Oct 31, 2023 · 18 comments · Fixed by #3882
Closed
4 tasks done

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

askmyteapot opened this issue Oct 31, 2023 · 18 comments · Fixed by #3882
Labels
performance Speed related topics

Comments

@askmyteapot
Copy link

askmyteapot commented Oct 31, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

There is a regression on Context processing introduced in commit 2b4ea35

This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX

  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.03 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.30 ± 0.00 |

build: daab3d7 (1421)

Current Behavior

ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    207.34 ± 0.28 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.28 ± 0.01 |

build: 2b4ea35 (1422)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads |   main_gpu | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | pp 512     |    208.54 ± 0.58 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | tg 128     |     18.29 ± 0.00 |

build: 207b519 (1446)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • 5800X + 64GB DDR 3733

  • 3060ti (8GB) + TESLA P40 (24GB)

  • Operating System, e.g. for Linux: Windows 11

  • SDK version, : MSVC 2022

$ python3 -- 3.10.11
$ Cmake --version 3.27.4

@LostRuins

@askmyteapot askmyteapot changed the title CTX Processing regression for Pascal - Commit CTX Processing regression for Pascal - Commit 2b4ea35 Oct 31, 2023
@quarterturn
Copy link

Sounds like the same issue as mine (#3780)

@quarterturn
Copy link

quarterturn commented Oct 31, 2023

Try a

git reset --hard b1421
git pull
make clean && make -j LLAMA_CUBLAS=1

I think after that commit is where the problem started

@quarterturn
Copy link

Switching to 'cuda-cublas-opts' branch fixed it for me.

@ggerganov
Copy link
Owner

@askmyteapot Can you check if branch try-fix-3869 fixes the issue with LLAMA_CUDA_FORCE_MMQ=1 set

@askmyteapot
Copy link
Author

askmyteapot commented Oct 31, 2023

@ggerganov
Built with only cublas flag

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    209.00 ± 1.04 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     13.81 ± 0.00 |

build: 22cc9be (1447)

Built with -DLLAMA_CUDA_FORCE_MMQ=ON and cublas on.

D:\llama.cpp\build\bin\Release>llama-bench.exe -m D:\text-generation-webui\models\MythoMax-L2-13b-gguf-q8_0.gguf -t 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.27 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.32 ± 0.01 |

build: 22cc9be (1447)

Has corrected the issue.

But, is there any way the force MMQ compile time flag be a runtime flag?

@cebtenzzre
Copy link
Collaborator

I agree that it would be nice if we had a runtime flag that could enable quantized mat*mat, even on modern GPUs. It does use less VRAM.

@LostRuins
Copy link
Collaborator

You can, but unless you're willing to build multiple sets of CUDA kernels and swapping between them based on batch size, you may lose the small-batch MMQ optimizations since those are determined at compile time.

#if defined(CUDA_USE_TENSOR_CORES)
#define  MMQ_X_Q4_0_AMPERE 4
#define  MMQ_Y_Q4_0_AMPERE 32
#define NWARPS_Q4_0_AMPERE 4
#else
#define  MMQ_X_Q4_0_AMPERE 64
#define  MMQ_Y_Q4_0_AMPERE 128
#define NWARPS_Q4_0_AMPERE 4
#endif

or maybe some intermediate value between these 2 options could provide a good compromise?

I personally don't really benefit from the small batch optimization much, I am on RTX 2060 6GB which should be CC 7.5, but maybe my hardware is kinda crappy either way and cublas wasn't that special for me. So I am back to using MMQ with the original (large batch) values

@ggerganov
Copy link
Owner

@LostRuins I think you mentioned earlier that for full offload, the new version on RTX 2060 is faster compared to MMQ and that you observe a regression for not-fully offloaded models due to 1-2 GPU layers less.

How big is the latter regression? Is it a regression both for short and long (>1024) contexts?
If you can post some numbers for PP and TG would be helpful to get a sense of the impact of the change

@LostRuins
Copy link
Collaborator

@ggerganov Sure, let me try to do a bit more methodical testing with the bencher instead for my RTX 2060 6GB.
Let me test from the newest CI build first, and compare with the CI build from 24 Oct before all these changes. All builds taken directly from this repo and run with the commands listed here.

NEW = running from llama-b1468-bin-win-cublas-cu11.7.1-x64 latest CI build
OLD = rollback to before all the new changes to llama-b1420-bin-win-cublas-cu11.7.1-x64, and attempting the same process

7B, ngl=99 NEW

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | pp 512     |  998.41 ± 114.47 |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | tg 128     |     39.39 ± 0.16 |

build: b12fa0d (1468)

7B, ngl=99 OLD

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | pp 512     |    445.26 ± 7.08 |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | tg 128     |     38.04 ± 0.12 |

build: 469c9ad (1420)

13B results:

Trying the max layers i can offload before going OOM, counting downwards... which is 23 for the benchmark tool:

13B, ngl=23 NEW

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 24
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:5770: out of memory
current device: 0
PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 23
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | pp 512     |    153.94 ± 5.69 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | tg 128     |      4.29 ± 0.05 |

build: b12fa0d (1468)

13B, ngl=25 OLD

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 25
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  25 | pp 512     |    129.11 ± 2.64 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  25 | tg 128     |      4.69 ± 0.15 |

build: 469c9ad (1420)

So across the board CuBLAS helps with PP no doubt. But TG is hit in the new versions.
@ggerganov strangely, even with full offload on 7B, my TG speed seems to have take a small hit too (39 to 38 t/s) though I think the difference is minor enough.

@LostRuins
Copy link
Collaborator

One more run with b1420 at ngl=23 just to compare

13B, ngl=23 OLD

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | pp 512     |    129.32 ± 0.93 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | tg 128     |      4.28 ± 0.10 |

build: 469c9ad (1420)

So with layer parity, TG speeds are comparable between b1420 and b1468.

@LostRuins
Copy link
Collaborator

LostRuins commented Nov 2, 2023

@Dampfinchen try compare your setup with the exact same models and layer counts, using 2 builds b1420 and b1468 from llama.cpp releases page, see your results. 1468 might have solved your issue after slaren's fix

@ggerganov
Copy link
Owner

Thank you, these are overall inline with my expectations.

@ggerganov strangely, even with full offload on 7B, my TG speed seems to have take a small hit too (39 to 38 t/s) though I think the difference is minor enough.

I think you might be reading this wrong. From what I see, the new build is ~39 t/s a bit faster than the old one even for TG when the model is fully offloaded. This is nice to see, although it deviates from my expectation for slight regression at short TG 128 tests. In any case, I believe you will see even bigger gains with the new build when the context is large.

The TG regression (-8.8%) with partial offloading is expected due to the fewer layers, but at least the PP got some non-negligible improvement (+19.0%). In the future, I think we will compensate for this as I explained in earlier comment

@LostRuins
Copy link
Collaborator

LostRuins commented Nov 2, 2023

Agreed, prompt processing using cublas is indeed faster for my card. I know Pascal users did experience major PP slowdowns though, but I think that has been resolved already.

edit: for reference

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.18                 Driver Version: 531.18       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060       WDDM | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8               14W /  N/A|      0MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@Dampfinchen
Copy link

Dampfinchen commented Nov 2, 2023

Alright, here's my result. Text generation speed is 4x slower compared to LostRuin's result with the same hardware.

` Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5

model size params backend ngl test t/s
llama 13B mostly Q4_K - Small 6.90 GiB 13.02 B CUDA 25 pp 512 149.15 ± 2.24
llama 13B mostly Q4_K - Small 6.90 GiB 13.02 B CUDA 25 tg 128 1.08 ± 0.07

build: 1efae9b (1469)`

I don't think a further test for older builds is needed, I get the same result as LostRuin of around 4.3 token/s TG in those. Older builds before the huge CUDA changes worked flawlessly with my current system configuration. But if you need proof, I can provide that. Just ask.

Edit: I tested various builds of llama.cpp and they exhibit the same problem. With older builds I meant koboldcpp, which works somehow, even though its based on llama.cpp.

@LostRuins
Copy link
Collaborator

LostRuins commented Nov 2, 2023

we have the same card, but i am using an older driver, which will OOM at 25 layers. I suspect your newer driver is doing something funky as 25 layers will not fit.

@Dampfinchen try ngl 23

@Dampfinchen
Copy link

Dampfinchen commented Nov 2, 2023

we have the same card, but i am using an older driver, which will OOM at 25 layers. I suspect your newer driver is doing something funky as 25 layers will not fit.

I do have the latest driver (546.01), however I disabled System Mem policity so it OOMs when there's not enough VRAM, just like with the old drivers. The new driver has better memory management allowing me to do more layers. Also I'm using Q4K_S, while you're using Q4K_M, so I can offload more layers naturally.

Testing with ngl 23 yielded no improvement in my test.

If swapping to RAM would be the issue, prompt processing would slow down too, but that is not the case.

@Dampfinchen
Copy link

Dampfinchen commented Nov 2, 2023

Alright, I've tried everything. I did try various commits, including from before tensor core support and batched cuda processing, but it was always the same, slow result. I also tried using MMQ only. Same thing here.

IDK what's going on here. Koboldcpp version 1.47.2 which builds on llama.cpp doesn't have this issue using the same amount of layers, model, and system configuration and a similar sized prompt.

I suspect since the only major difference between LostRuin's system and mine is the driver, it could have something to do with the system mem policy Nvidia introduced with the latest driver (https://nvidia.custhelp.com/app/answers/detail/a_id/5490) However, I already checked its function and it works great. When the VRAM full, it crashes like before, when setting to the prefer no system mem fallback.

Hmm.

It's worth mentioning that 7b with partial offloading (28 layers) is super slow as well, but performs as expected when using full GPU offloading.

@Dampfinchen
Copy link

Alright, thanks to Slaren I was able to fix the problem. The issue was that I was not compiling it with AVX2 support, as I have assumed that's just enabled by default, which it isn't anymore.

Performance is great as expected with AVX2. Case closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants