Cuda acceleration for quantized model. #1754

LaurentMazare · 2024-02-24T17:08:33Z

This is a first set of changes to enable quantized tensor handling with cuda.
On my Ryzen 2600X with a RTX 2800, the inference speed without cuda is of ~5 tokens/s and with cuda ~20 tokens/s with the default quantized example setup (this uses Q4_0).

danielclough · 2024-02-25T21:26:39Z

On RTX 4070Ti:

Running a Mistral model that gives good results with .safetensors.
I have several models with all the quantized formats in my hf account if you want to use any for testing.
They were built with the Candle tensor-tools example, ofc.

I got a garbled mess for q2k, a nice error for q8k (not supported yet), and a good response for q4_0.

Awkwardly, the q4_0 gave me <|im_end|> at the beginning of the response, so it confuses my app, but I don't think that is a Candle problem. 🤗

I can try some more tests later.

Quantized responses are quick. ⚡

akhildevelops · 2024-02-27T05:28:07Z

Tried hands-on and below are the results from cuda (gpu) and mkl (cpu) by running a 4bit wuantized 7b mistral model. The prompt was: Who created you ?

Cuda:

cargo run --example quantized --release --features cuda -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
    Finished release [optimized] target(s) in 0.21s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.15s
model built
<s>[INST]Who created you ?[/INST]Additional Information:

**Company Name:** Acurify Solutions Corp.

**Address:** 123 Main Street, Suite 400, Anytown, USA

**Email:** [email protected]

**Phone:** 555-555-5555

**Year Founded:** 2018

**Website:** www.acurifysol.com

Acurify Solutions Corp. is a technology consulting firm that was founded in 2018 by a team of experienced professionals with a passion for delivering exceptional results to their clients. Our mission is to help businesses improve their operations and increase their efficiency through the innovative use of technology solutions. We specialize in cloud computing, data analytics, and software development services. Our team of experts have extensive experience working with leading technologies such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).

At Acurify Solutions Corp., we believe that every business is unique, which is why we offer customized solutions tailored to our clients' specific needs. We take a collaborative approach to working with our clients, ensuring that we fully understand their goals and challenges before providing recommendations and implementing solutions. Our team of experts will work closely with your team to ensure a seamless transition and maximum value from your technology investment.

Contact us today to learn more about how Acurify Solutions Corp. can help your business achieve its full potential through innovative technology solutions.

**Services:**

- Cloud Computing Consulting & Migration
- Data Analytics & Business Intelligence
- Software Development
  - Web Applications
  - Mobile Applications
  - Custom Applications
- Technical Support & Managed Services

**Technologies:**

- Microsoft Azure: A comprehensive set of cloud services that allows businesses to build, deploy, and manage applications through the Microsoft Azure platform.
- Amazon Web Services (AWS): A secure cloud services platform offering computing power, storage, databases, and various functionalities required for modern application development.
- Google Cloud Platform (GCP): A suite of cloud computing services that allows businesses to create, deploy, and manage applications, websites, and services in a flexible infrastructure.

  13 prompt tokens processed: 19.33 token/s
 482 tokens generated: 20.24 token/s

CPU:

cargo run --example quantized --release --features mkl -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654    
    Finished release [optimized] target(s) in 0.17s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (4.14GB) in 0.04s
model built
<s>[INST]Who created you ?[/INST]I was created by Mistral AI. I'm an artificial intelligence language model designed to generate human-like text based on the data I was trained on. I don't have a physical form or the ability to create other entities. I exist purely as a program running on computer servers.

  13 prompt tokens processed: 7.76 token/s
  59 tokens generated: 5.83 token/s

CPU was good and Cuda performed poor in generating response.

LaurentMazare · 2024-02-27T05:51:15Z

Tried hands-on and below are the results from cuda (gpu) and mkl (cpu) by running a 4bit wuantized 7b mistral model. The prompt was: Who created you ?

Cuda:

cargo run --example quantized --release --features cuda -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
    Finished release [optimized] target(s) in 0.21s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.15s
model built
<s>[INST]Who created you ?[/INST]Additional Information:

**Company Name:** Acurify Solutions Corp.

**Address:** 123 Main Street, Suite 400, Anytown, USA

**Email:** [email protected]

**Phone:** 555-555-5555

**Year Founded:** 2018

**Website:** www.acurifysol.com

Acurify Solutions Corp. is a technology consulting firm that was founded in 2018 by a team of experienced professionals with a passion for delivering exceptional results to their clients. Our mission is to help businesses improve their operations and increase their efficiency through the innovative use of technology solutions. We specialize in cloud computing, data analytics, and software development services. Our team of experts have extensive experience working with leading technologies such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).

At Acurify Solutions Corp., we believe that every business is unique, which is why we offer customized solutions tailored to our clients' specific needs. We take a collaborative approach to working with our clients, ensuring that we fully understand their goals and challenges before providing recommendations and implementing solutions. Our team of experts will work closely with your team to ensure a seamless transition and maximum value from your technology investment.

Contact us today to learn more about how Acurify Solutions Corp. can help your business achieve its full potential through innovative technology solutions.

**Services:**

- Cloud Computing Consulting & Migration
- Data Analytics & Business Intelligence
- Software Development
  - Web Applications
  - Mobile Applications
  - Custom Applications
- Technical Support & Managed Services

**Technologies:**

- Microsoft Azure: A comprehensive set of cloud services that allows businesses to build, deploy, and manage applications through the Microsoft Azure platform.
- Amazon Web Services (AWS): A secure cloud services platform offering computing power, storage, databases, and various functionalities required for modern application development.
- Google Cloud Platform (GCP): A suite of cloud computing services that allows businesses to create, deploy, and manage applications, websites, and services in a flexible infrastructure.

  13 prompt tokens processed: 19.33 token/s
 482 tokens generated: 20.24 token/s

CPU:

cargo run --example quantized --release --features mkl -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654    
    Finished release [optimized] target(s) in 0.17s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (4.14GB) in 0.04s
model built
<s>[INST]Who created you ?[/INST]I was created by Mistral AI. I'm an artificial intelligence language model designed to generate human-like text based on the data I was trained on. I don't have a physical form or the ability to create other entities. I exist purely as a program running on computer servers.

  13 prompt tokens processed: 7.76 token/s
  59 tokens generated: 5.83 token/s

CPU was good and Cuda performed poor in generating response.

Thanks for reporting this, I've created a separate issue to track it. #1765

LaurentMazare added 18 commits February 24, 2024 18:08

Boilerplate for the quantized cuda support.

312251c

More basic cuda support.

e53266b

More cuda quantization (quantize on cpu for now).

c5ec1e2

Add the dequantization bit.

5172087

Start adding some dedicated cuda kernels from llama.cpp.

9199f2a

Move the kernel code.

f530d0c

Start interfacing with the kernel.

b4b8a6d

Tweak the kernel launch params.

c33913e

Bugfix for quantized metal.

e1b15cb

Fix some clippy lints.

3539038

Tweak the launch parameters.

ae66428

Tweak cuda basics to perform a quantized matmul.

c9e62ae

Perform the dequantization on the cpu + use cublas for matmul.

808be2a

Add the dequantization kernel.

c42128a

Test the qmatmul.

17e57a0

More kernels.

6e59016

Matmul-vec kernel.

ff93134

Add a couple kernels.

1a0c75d

LaurentMazare changed the title ~~Boilerplate for the quantized cuda support.~~ Cuda acceleration for quantized model. Feb 25, 2024

LaurentMazare mentioned this pull request Feb 25, 2024

Quantized models on Cuda #1250

Open

More dequantization kernels.

dc59051

LaurentMazare merged commit 2f22afd into main Feb 25, 2024
10 checks passed

LaurentMazare deleted the quantized-cuda branch February 25, 2024 17:11

This was referenced Feb 25, 2024

Support for quantisation #359

Closed

Error: no cuda implementation for qmatmul #696

Closed

CUDA support for QMatMul #655

Closed

from_gguf only supports CPU? #1486

Closed

andychenbruce mentioned this pull request Feb 27, 2024

Possibility a good idea to replace llama.cpp with candle to run quantized models? sobelio/llm-chain#276

Open

LaurentMazare mentioned this pull request Feb 27, 2024

Poor generation when using quantised models on cuda #1765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda acceleration for quantized model. #1754

Cuda acceleration for quantized model. #1754

LaurentMazare commented Feb 24, 2024 •

edited

Loading

danielclough commented Feb 25, 2024

akhildevelops commented Feb 27, 2024

LaurentMazare commented Feb 27, 2024

Cuda acceleration for quantized model. #1754

Cuda acceleration for quantized model. #1754

Conversation

LaurentMazare commented Feb 24, 2024 • edited Loading

danielclough commented Feb 25, 2024

akhildevelops commented Feb 27, 2024

LaurentMazare commented Feb 27, 2024

LaurentMazare commented Feb 24, 2024 •

edited

Loading