-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA support for QMatMul
#655
Comments
Have you ever seen comparisons with GTPQ ? I was under the impression the quality loss was much less than GGML (also everything will run in f16, weights being in 4bits). GPTQ kernels already exist. |
Personally i find it very hard to compare GGML with GPTQ. Some benchmarks claim GGML performs better, some claim GPTQ is better. Same for inference speed. In my opinion i wouldn't compare GGML and GPTQ at all as they work fundamentally different with GGML's quantization being a simple numerical process and GPTQ being an actual post-training quantization strategy which needs to be created against an actual dataset. Anyhow, optimally support for both techniques would be great. And both techniques need a way of storing their quantized data on the gpu, which means some mechanism of enabling gpu storage for custom data structures would be great. |
GPTQ doesn't need anything custom actually. It uses float32 for the int4, and all other tensors are real tensors. I've always wondered about performance of putting the scales and zeros close to their quantized counterparts. It's good to know there are GPU kernels for ggml quantization, then that performance could actually be measured. |
ggml
containsf32
xquantized block
matmul cuda kernels, which could be used to speed up inference on gpu accelerated machines significantly. Currently theQMatMul
opperation only holds the quanitzed data on the cpu which means it needs to be extended to allow offloading the data to a gpu, which optimally should only be done once during the loading process. Then the matmul cuda kernel could be called with references to thef32
tensor andquantized blocks
.The text was updated successfully, but these errors were encountered: