Force MMQ to YES and TensorCores to NO #1162

neowisard · 2024-02-06T11:00:57Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ v] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ v] I carefully followed the README.md.
[ v] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ v] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

after start python3 -m llama_cpp.server --model /ai/models/functionary-7b-v1.Q5_K.gguf --n_gpu_layers 99 --main_gpu 1 --tensor_split 0.45 0.55 --n_ctx 4096 --host 192.168.0.55 --port 5000 --api_key toofoo

I expect the compiled parameters llama.cpp to be like this:

init_cublas: GGML_CUDA_FORCE_MMQ: yes
init_cublas: CUDA_USE_TENSOR_CORES: no
as is ggerganov/llama.cpp#3869 (comment)

Current Behavior

(cpppython) root@oc:/ai/llama-cpp-python# LLAMA_CUDA_FORCE_MMQ=1 python3 -m llama_cpp.server --model /ai/models/functionary-7b-v1.Q5_K.gguf --n_gpu_layers 99 --main_gpu 1 --tensor_split 0.45 0.55 --n_ctx 4096 --host 192.168.0.55 --port 5000 --api_key toofoo
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

When i start llama.cpp or localai , all is ok:

(cpppython) root@oc:/ai/llama-cpp-python/vendor/llama.cpp/build/bin# ./benchmark
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Creating new tensors

Environment and Context

using last llama-cpp-pythom from source

(cpppython) root@oc:/ai/llama-cpp-python# git log
commit 7467f12 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrei [email protected]
Date: Fri Feb 2 12:18:55 2024 -0500
Revert "Fix: fileno error google colab (#729) (#1156)" (#1157)

LLama.cpp compiled and linked as symlink to /ai/llama-cpp-python/vendor/llama.cpp

virtual hardware you are using, e.g. for Linux:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
CPU family: 6
Model: 45
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Stepping: 7
BogoMIPS: 5799.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good n
opl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti
ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust xsaveopt arat umip md_clear arch_capabilities
Virtualization features:
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full

Operating System, e.g. for Linux:
Linux oc 5.15.0-92-generic (windows) model doesn't generate nothing and stays running #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
SDK version, e.g. for Linux:

$ python3 --version
Python 3.10.13

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation,

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

It doesn't block work, it just affects performance. Is there any way I can toggle these settings?
GGML_CUDA_FORCE_MMQ to YES
CUDA_USE_TENSOR_CORES to NO
Example environment info:

llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette|numpy"
fastapi           0.109.1
numpy             1.26.3
sse-starlette     2.0.0
uvicorn           0.27.0.post1

The text was updated successfully, but these errors were encountered:

abetlen · 2024-02-06T16:25:07Z

Hey @neowisard I think LLAMA_CUDA_FORCE_MMQ is a build flag so you would need to set it when you (re-)install llama-cpp-python

neowisard · 2024-02-07T13:37:19Z

Hey @neowisard I think LLAMA_CUDA_FORCE_MMQ is a build flag so you would need to set it when you (re-)install llama-cpp-python

Yep. It SET , and other apps works with it.
And I checked a lot of tickets and discussions and didn't see a single parameter exposed otherwise.

Can i somehow override these parameters using llama-cpp-python?

neowisard · 2024-03-01T14:05:20Z

my wrong, just reinstall from pip

neowisard changed the title ~~Force MMQ ot YES and TensorCores to NO in LLAMA.CPP~~ Force MMQ to YES and TensorCores to NO Feb 6, 2024

neowisard closed this as completed Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force MMQ to YES and TensorCores to NO #1162

Force MMQ to YES and TensorCores to NO #1162

neowisard commented Feb 6, 2024 •

edited

Loading

abetlen commented Feb 6, 2024

neowisard commented Feb 7, 2024 •

edited

Loading

neowisard commented Mar 1, 2024

Force MMQ to YES and TensorCores to NO #1162

Force MMQ to YES and TensorCores to NO #1162

Comments

neowisard commented Feb 6, 2024 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

abetlen commented Feb 6, 2024

neowisard commented Feb 7, 2024 • edited Loading

neowisard commented Mar 1, 2024

neowisard commented Feb 6, 2024 •

edited

Loading

neowisard commented Feb 7, 2024 •

edited

Loading