CTX Processing regression for Pascal - Commit 2b4ea35 #3869

askmyteapot · 2023-10-31T11:01:49Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

There is a regression on Context processing introduced in commit 2b4ea35

This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX

  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.03 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.30 ± 0.00 |

build: daab3d7 (1421)

Current Behavior

ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    207.34 ± 0.28 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.28 ± 0.01 |

build: 2b4ea35 (1422)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads |   main_gpu | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | pp 512     |    208.54 ± 0.58 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | tg 128     |     18.29 ± 0.00 |

build: 207b519 (1446)

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

5800X + 64GB DDR 3733
3060ti (8GB) + TESLA P40 (24GB)
Operating System, e.g. for Linux: Windows 11
SDK version, : MSVC 2022

$ python3 -- 3.10.11
$ Cmake --version 3.27.4

@LostRuins

The text was updated successfully, but these errors were encountered:

quarterturn · 2023-10-31T13:15:14Z

Sounds like the same issue as mine (#3780)

quarterturn · 2023-10-31T13:55:09Z

Try a

git reset --hard b1421
git pull
make clean && make -j LLAMA_CUBLAS=1

I think after that commit is where the problem started

quarterturn · 2023-10-31T17:48:56Z

Switching to 'cuda-cublas-opts' branch fixed it for me.

ggerganov · 2023-10-31T18:03:17Z

@askmyteapot Can you check if branch try-fix-3869 fixes the issue with LLAMA_CUDA_FORCE_MMQ=1 set

askmyteapot · 2023-10-31T22:04:17Z

@ggerganov
Built with only cublas flag

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    209.00 ± 1.04 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     13.81 ± 0.00 |

build: 22cc9be (1447)

Built with -DLLAMA_CUDA_FORCE_MMQ=ON and cublas on.

D:\llama.cpp\build\bin\Release>llama-bench.exe -m D:\text-generation-webui\models\MythoMax-L2-13b-gguf-q8_0.gguf -t 1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.27 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.32 ± 0.01 |

build: 22cc9be (1447)

Has corrected the issue.

But, is there any way the force MMQ compile time flag be a runtime flag?

cebtenzzre · 2023-11-02T05:08:54Z

I agree that it would be nice if we had a runtime flag that could enable quantized mat*mat, even on modern GPUs. It does use less VRAM.

LostRuins · 2023-11-02T05:30:41Z

You can, but unless you're willing to build multiple sets of CUDA kernels and swapping between them based on batch size, you may lose the small-batch MMQ optimizations since those are determined at compile time.

#if defined(CUDA_USE_TENSOR_CORES)
#define  MMQ_X_Q4_0_AMPERE 4
#define  MMQ_Y_Q4_0_AMPERE 32
#define NWARPS_Q4_0_AMPERE 4
#else
#define  MMQ_X_Q4_0_AMPERE 64
#define  MMQ_Y_Q4_0_AMPERE 128
#define NWARPS_Q4_0_AMPERE 4
#endif

or maybe some intermediate value between these 2 options could provide a good compromise?

I personally don't really benefit from the small batch optimization much, I am on RTX 2060 6GB which should be CC 7.5, but maybe my hardware is kinda crappy either way and cublas wasn't that special for me. So I am back to using MMQ with the original (large batch) values

ggerganov · 2023-11-02T06:47:43Z

@LostRuins I think you mentioned earlier that for full offload, the new version on RTX 2060 is faster compared to MMQ and that you observe a regression for not-fully offloaded models due to 1-2 GPU layers less.

How big is the latter regression? Is it a regression both for short and long (>1024) contexts?
If you can post some numbers for PP and TG would be helpful to get a sense of the impact of the change

LostRuins · 2023-11-02T08:06:39Z

@ggerganov Sure, let me try to do a bit more methodical testing with the bencher instead for my RTX 2060 6GB.
Let me test from the newest CI build first, and compare with the CI build from 24 Oct before all these changes. All builds taken directly from this repo and run with the commands listed here.

NEW = running from llama-b1468-bin-win-cublas-cu11.7.1-x64 latest CI build
OLD = rollback to before all the new changes to llama-b1420-bin-win-cublas-cu11.7.1-x64, and attempting the same process

7B, ngl=99 NEW

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | pp 512     |  998.41 ± 114.47 |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | tg 128     |     39.39 ± 0.16 |

build: b12fa0d (1468)

7B, ngl=99 OLD

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | pp 512     |    445.26 ± 7.08 |
| llama 7B mostly Q4_K - Small   |   3.86 GiB |     7.24 B | CUDA       |  99 | tg 128     |     38.04 ± 0.12 |

build: 469c9ad (1420)

13B results:

Trying the max layers i can offload before going OOM, counting downwards... which is 23 for the benchmark tool:

13B, ngl=23 NEW

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 24
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |

CUDA error 2 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:5770: out of memory
current device: 0
PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 23
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | pp 512     |    153.94 ± 5.69 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | tg 128     |      4.29 ± 0.05 |

build: b12fa0d (1468)

13B, ngl=25 OLD

PS E:\LLaMA\llamacpp> .\llama-bench.exe -m e:\LLaMA\models\mythomax-l2-13b.Q4_K_M.gguf -ngl 25
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  25 | pp 512     |    129.11 ± 2.64 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  25 | tg 128     |      4.69 ± 0.15 |

build: 469c9ad (1420)

So across the board CuBLAS helps with PP no doubt. But TG is hit in the new versions.
@ggerganov strangely, even with full offload on 7B, my TG speed seems to have take a small hit too (39 to 38 t/s) though I think the difference is minor enough.

LostRuins · 2023-11-02T08:15:56Z

One more run with b1420 at ngl=23 just to compare

13B, ngl=23 OLD

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | pp 512     |    129.32 ± 0.93 |
| llama 13B mostly Q4_K - Medium |   7.33 GiB |    13.02 B | CUDA       |  23 | tg 128     |      4.28 ± 0.10 |

build: 469c9ad (1420)

So with layer parity, TG speeds are comparable between b1420 and b1468.

LostRuins · 2023-11-02T08:19:45Z

@Dampfinchen try compare your setup with the exact same models and layer counts, using 2 builds b1420 and b1468 from llama.cpp releases page, see your results. 1468 might have solved your issue after slaren's fix

ggerganov · 2023-11-02T08:30:53Z

Thank you, these are overall inline with my expectations.

@ggerganov strangely, even with full offload on 7B, my TG speed seems to have take a small hit too (39 to 38 t/s) though I think the difference is minor enough.

I think you might be reading this wrong. From what I see, the new build is ~39 t/s a bit faster than the old one even for TG when the model is fully offloaded. This is nice to see, although it deviates from my expectation for slight regression at short TG 128 tests. In any case, I believe you will see even bigger gains with the new build when the context is large.

The TG regression (-8.8%) with partial offloading is expected due to the fewer layers, but at least the PP got some non-negligible improvement (+19.0%). In the future, I think we will compensate for this as I explained in earlier comment

LostRuins · 2023-11-02T08:50:59Z

Agreed, prompt processing using cublas is indeed faster for my card. I know Pascal users did experience major PP slowdowns though, but I think that has been resolved already.

edit: for reference

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.18                 Driver Version: 531.18       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060       WDDM | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8               14W /  N/A|      0MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Dampfinchen · 2023-11-02T08:54:05Z

Alright, here's my result. Text generation speed is 4x slower compared to LostRuin's result with the same hardware.

` Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5

model	size	params	backend	ngl	test	t/s
llama 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CUDA	25	pp 512	149.15 ± 2.24
llama 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CUDA	25	tg 128	1.08 ± 0.07

build: 1efae9b (1469)`

I don't think a further test for older builds is needed, I get the same result as LostRuin of around 4.3 token/s TG in those. Older builds before the huge CUDA changes worked flawlessly with my current system configuration. But if you need proof, I can provide that. Just ask.

Edit: I tested various builds of llama.cpp and they exhibit the same problem. With older builds I meant koboldcpp, which works somehow, even though its based on llama.cpp.

LostRuins · 2023-11-02T08:55:30Z

we have the same card, but i am using an older driver, which will OOM at 25 layers. I suspect your newer driver is doing something funky as 25 layers will not fit.

@Dampfinchen try ngl 23

Dampfinchen · 2023-11-02T08:57:18Z

we have the same card, but i am using an older driver, which will OOM at 25 layers. I suspect your newer driver is doing something funky as 25 layers will not fit.

I do have the latest driver (546.01), however I disabled System Mem policity so it OOMs when there's not enough VRAM, just like with the old drivers. The new driver has better memory management allowing me to do more layers. Also I'm using Q4K_S, while you're using Q4K_M, so I can offload more layers naturally.

Testing with ngl 23 yielded no improvement in my test.

If swapping to RAM would be the issue, prompt processing would slow down too, but that is not the case.

Dampfinchen · 2023-11-02T10:39:01Z

Alright, I've tried everything. I did try various commits, including from before tensor core support and batched cuda processing, but it was always the same, slow result. I also tried using MMQ only. Same thing here.

IDK what's going on here. Koboldcpp version 1.47.2 which builds on llama.cpp doesn't have this issue using the same amount of layers, model, and system configuration and a similar sized prompt.

I suspect since the only major difference between LostRuin's system and mine is the driver, it could have something to do with the system mem policy Nvidia introduced with the latest driver (https://nvidia.custhelp.com/app/answers/detail/a_id/5490) However, I already checked its function and it works great. When the VRAM full, it crashes like before, when setting to the prefer no system mem fallback.

Hmm.

It's worth mentioning that 7b with partial offloading (28 layers) is super slow as well, but performs as expected when using full GPU offloading.

Dampfinchen · 2023-11-02T12:00:05Z

Alright, thanks to Slaren I was able to fix the problem. The issue was that I was not compiling it with AVX2 support, as I have assumed that's just enabled by default, which it isn't anymore.

Performance is great as expected with AVX2. Case closed!

askmyteapot added the bug-unconfirmed label Oct 31, 2023

askmyteapot changed the title ~~CTX Processing regression for Pascal - Commit~~ CTX Processing regression for Pascal - Commit 2b4ea35 Oct 31, 2023

cebtenzzre added performance Speed related topics and removed bug-unconfirmed labels Nov 1, 2023

ggerganov mentioned this issue Nov 1, 2023

cuda : do not use batched GEMM when tensor cores are not available #3882

Merged

LostRuins mentioned this issue Nov 1, 2023

Windows - CUDA GPU - Performance Difference - 1429 vs 1430+ #3884

Closed

ggerganov closed this as completed in #3882 Nov 2, 2023

neowisard mentioned this issue Feb 6, 2024

Force MMQ to YES and TensorCores to NO abetlen/llama-cpp-python#1162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

askmyteapot commented Oct 31, 2023 •

edited

Loading

quarterturn commented Oct 31, 2023

quarterturn commented Oct 31, 2023 •

edited

Loading

quarterturn commented Oct 31, 2023

ggerganov commented Oct 31, 2023

askmyteapot commented Oct 31, 2023 •

edited

Loading

cebtenzzre commented Nov 2, 2023

LostRuins commented Nov 2, 2023

ggerganov commented Nov 2, 2023

LostRuins commented Nov 2, 2023

LostRuins commented Nov 2, 2023

LostRuins commented Nov 2, 2023 •

edited

Loading

ggerganov commented Nov 2, 2023

LostRuins commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading

LostRuins commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

Comments

askmyteapot commented Oct 31, 2023 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

quarterturn commented Oct 31, 2023

quarterturn commented Oct 31, 2023 • edited Loading

quarterturn commented Oct 31, 2023

ggerganov commented Oct 31, 2023

askmyteapot commented Oct 31, 2023 • edited Loading

cebtenzzre commented Nov 2, 2023

LostRuins commented Nov 2, 2023

ggerganov commented Nov 2, 2023

LostRuins commented Nov 2, 2023

7B, ngl=99 NEW

7B, ngl=99 OLD

13B results:

13B, ngl=23 NEW

13B, ngl=25 OLD

LostRuins commented Nov 2, 2023

13B, ngl=23 OLD

LostRuins commented Nov 2, 2023 • edited Loading

ggerganov commented Nov 2, 2023

LostRuins commented Nov 2, 2023 • edited Loading

Dampfinchen commented Nov 2, 2023 • edited Loading

LostRuins commented Nov 2, 2023 • edited Loading

Dampfinchen commented Nov 2, 2023 • edited Loading

Dampfinchen commented Nov 2, 2023 • edited Loading

Dampfinchen commented Nov 2, 2023

askmyteapot commented Oct 31, 2023 •

edited

Loading

quarterturn commented Oct 31, 2023 •

edited

Loading

askmyteapot commented Oct 31, 2023 •

edited

Loading

LostRuins commented Nov 2, 2023 •

edited

Loading

LostRuins commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading

LostRuins commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading

Dampfinchen commented Nov 2, 2023 •

edited

Loading