Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force MMQ to YES and TensorCores to NO #1162

Closed
neowisard opened this issue Feb 6, 2024 · 3 comments
Closed

Force MMQ to YES and TensorCores to NO #1162

neowisard opened this issue Feb 6, 2024 · 3 comments

Comments

@neowisard
Copy link

neowisard commented Feb 6, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ v] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ v] I carefully followed the README.md.
  • [ v] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ v] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

after start python3 -m llama_cpp.server --model /ai/models/functionary-7b-v1.Q5_K.gguf --n_gpu_layers 99 --main_gpu 1 --tensor_split 0.45 0.55 --n_ctx 4096 --host 192.168.0.55 --port 5000 --api_key toofoo

I expect the compiled parameters llama.cpp to be like this:

init_cublas: GGML_CUDA_FORCE_MMQ: yes
init_cublas: CUDA_USE_TENSOR_CORES: no
as is ggerganov/llama.cpp#3869 (comment)

Current Behavior

(cpppython) root@oc:/ai/llama-cpp-python# LLAMA_CUDA_FORCE_MMQ=1 python3 -m llama_cpp.server --model /ai/models/functionary-7b-v1.Q5_K.gguf --n_gpu_layers 99 --main_gpu 1 --tensor_split 0.45 0.55 --n_ctx 4096 --host 192.168.0.55 --port 5000 --api_key toofoo
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes

When i start llama.cpp or localai , all is ok:

(cpppython) root@oc:/ai/llama-cpp-python/vendor/llama.cpp/build/bin# ./benchmark
main: build = 2074 (098f6d73)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
Creating new tensors

Environment and Context

using last llama-cpp-pythom from source

(cpppython) root@oc:/ai/llama-cpp-python# git log
commit 7467f12 (HEAD -> main, origin/main, origin/HEAD)
Author: Andrei [email protected]
Date: Fri Feb 2 12:18:55 2024 -0500

Revert "Fix: fileno error google colab (#729) (#1156)" (#1157)

LLama.cpp compiled and linked as symlink to /ai/llama-cpp-python/vendor/llama.cpp

  • virtual hardware you are using, e.g. for Linux:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
CPU family: 6
Model: 45
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Stepping: 7
BogoMIPS: 5799.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good n
opl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault pti
ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust xsaveopt arat umip md_clear arch_capabilities
Virtualization features:
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full

$ python3 --version
Python 3.10.13

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation,

$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Failure Information (for bugs)

It doesn't block work, it just affects performance. Is there any way I can toggle these settings?
GGML_CUDA_FORCE_MMQ to YES
CUDA_USE_TENSOR_CORES to NO
Example environment info:

llama-cpp-python$ pip list | egrep "uvicorn|fastapi|sse-starlette|numpy"
fastapi           0.109.1
numpy             1.26.3
sse-starlette     2.0.0
uvicorn           0.27.0.post1



@neowisard neowisard changed the title Force MMQ ot YES and TensorCores to NO in LLAMA.CPP Force MMQ to YES and TensorCores to NO Feb 6, 2024
@abetlen
Copy link
Owner

abetlen commented Feb 6, 2024

Hey @neowisard I think LLAMA_CUDA_FORCE_MMQ is a build flag so you would need to set it when you (re-)install llama-cpp-python

@neowisard
Copy link
Author

neowisard commented Feb 7, 2024

Hey @neowisard I think LLAMA_CUDA_FORCE_MMQ is a build flag so you would need to set it when you (re-)install llama-cpp-python

Yep. It SET , and other apps works with it.
And I checked a lot of tickets and discussions and didn't see a single parameter exposed otherwise.

Can i somehow override these parameters using llama-cpp-python?

@neowisard
Copy link
Author

my wrong, just reinstall from pip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants