CUDA support for `QMatMul` #655

LLukas22 · 2023-08-29T11:50:38Z

ggml contains f32 x quantized block matmul cuda kernels, which could be used to speed up inference on gpu accelerated machines significantly. Currently the QMatMul opperation only holds the quanitzed data on the cpu which means it needs to be extended to allow offloading the data to a gpu, which optimally should only be done once during the loading process. Then the matmul cuda kernel could be called with references to the f32 tensor and quantized blocks.

The text was updated successfully, but these errors were encountered:

Narsil · 2023-08-29T12:01:40Z

Have you ever seen comparisons with GTPQ ?

I was under the impression the quality loss was much less than GGML (also everything will run in f16, weights being in 4bits).

GPTQ kernels already exist.

LLukas22 · 2023-08-29T12:34:07Z

Personally i find it very hard to compare GGML with GPTQ. Some benchmarks claim GGML performs better, some claim GPTQ is better. Same for inference speed. In my opinion i wouldn't compare GGML and GPTQ at all as they work fundamentally different with GGML's quantization being a simple numerical process and GPTQ being an actual post-training quantization strategy which needs to be created against an actual dataset.

Anyhow, optimally support for both techniques would be great. And both techniques need a way of storing their quantized data on the gpu, which means some mechanism of enabling gpu storage for custom data structures would be great.

Narsil · 2023-08-29T13:15:35Z

GPTQ doesn't need anything custom actually. It uses float32 for the int4, and all other tensors are real tensors.

I've always wondered about performance of putting the scales and zeros close to their quantized counterparts. It's good to know there are GPU kernels for ggml quantization, then that performance could actually be measured.

danielclough · 2023-11-23T05:16:56Z

Related open issues

Support for quantisation: #359
You are here: #655
Error: no cuda implementation for qmatmul: #696
Quantized models on Cuda: #1250

LaurentMazare · 2024-02-25T20:56:16Z

Closing this now in favor of more recent issues, e.g. #1250 (and #1754 should bring some preliminary support for quantized models on cuda).

LLukas22 mentioned this issue Nov 5, 2023

Quantized models on Cuda #1250

Open

danielclough mentioned this issue Nov 23, 2023

Error: no cuda implementation for qmatmul #696

Closed

danielclough mentioned this issue Dec 27, 2023

from_gguf only supports CPU? #1486

Closed

LaurentMazare closed this as completed Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA support for `QMatMul` #655

CUDA support for `QMatMul` #655

LLukas22 commented Aug 29, 2023

Narsil commented Aug 29, 2023

LLukas22 commented Aug 29, 2023

Narsil commented Aug 29, 2023

danielclough commented Nov 23, 2023

LaurentMazare commented Feb 25, 2024

CUDA support for QMatMul #655

CUDA support for QMatMul #655

Comments

LLukas22 commented Aug 29, 2023

Narsil commented Aug 29, 2023

LLukas22 commented Aug 29, 2023

Narsil commented Aug 29, 2023

danielclough commented Nov 23, 2023

Related open issues

LaurentMazare commented Feb 25, 2024

CUDA support for `QMatMul` #655

CUDA support for `QMatMul` #655