Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda acceleration for quantized model. #1754

Merged
merged 19 commits into from
Feb 25, 2024
Merged

Cuda acceleration for quantized model. #1754

merged 19 commits into from
Feb 25, 2024

Conversation

LaurentMazare
Copy link
Collaborator

@LaurentMazare LaurentMazare commented Feb 24, 2024

This is a first set of changes to enable quantized tensor handling with cuda.
On my Ryzen 2600X with a RTX 2800, the inference speed without cuda is of ~5 tokens/s and with cuda ~20 tokens/s with the default quantized example setup (this uses Q4_0).

@LaurentMazare LaurentMazare changed the title Boilerplate for the quantized cuda support. Cuda acceleration for quantized model. Feb 25, 2024
@LaurentMazare LaurentMazare merged commit 2f22afd into main Feb 25, 2024
10 checks passed
@LaurentMazare LaurentMazare deleted the quantized-cuda branch February 25, 2024 17:11
@danielclough
Copy link
Contributor

On RTX 4070Ti:

Running a Mistral model that gives good results with .safetensors.
I have several models with all the quantized formats in my hf account if you want to use any for testing.
They were built with the Candle tensor-tools example, ofc.

I got a garbled mess for q2k, a nice error for q8k (not supported yet), and a good response for q4_0.

Awkwardly, the q4_0 gave me <|im_end|> at the beginning of the response, so it confuses my app, but I don't think that is a Candle problem. 🤗

I can try some more tests later.

Quantized responses are quick. ⚡

@akhildevelops
Copy link

Tried hands-on and below are the results from cuda (gpu) and mkl (cpu) by running a 4bit wuantized 7b mistral model. The prompt was: Who created you ?

Cuda:

cargo run --example quantized --release --features cuda -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
    Finished release [optimized] target(s) in 0.21s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.15s
model built
<s>[INST]Who created you ?[/INST]Additional Information:

**Company Name:** Acurify Solutions Corp.

**Address:** 123 Main Street, Suite 400, Anytown, USA

**Email:** [email protected]

**Phone:** 555-555-5555

**Year Founded:** 2018

**Website:** www.acurifysol.com

Acurify Solutions Corp. is a technology consulting firm that was founded in 2018 by a team of experienced professionals with a passion for delivering exceptional results to their clients. Our mission is to help businesses improve their operations and increase their efficiency through the innovative use of technology solutions. We specialize in cloud computing, data analytics, and software development services. Our team of experts have extensive experience working with leading technologies such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).

At Acurify Solutions Corp., we believe that every business is unique, which is why we offer customized solutions tailored to our clients' specific needs. We take a collaborative approach to working with our clients, ensuring that we fully understand their goals and challenges before providing recommendations and implementing solutions. Our team of experts will work closely with your team to ensure a seamless transition and maximum value from your technology investment.

Contact us today to learn more about how Acurify Solutions Corp. can help your business achieve its full potential through innovative technology solutions.

**Services:**

- Cloud Computing Consulting & Migration
- Data Analytics & Business Intelligence
- Software Development
  - Web Applications
  - Mobile Applications
  - Custom Applications
- Technical Support & Managed Services

**Technologies:**

- Microsoft Azure: A comprehensive set of cloud services that allows businesses to build, deploy, and manage applications through the Microsoft Azure platform.
- Amazon Web Services (AWS): A secure cloud services platform offering computing power, storage, databases, and various functionalities required for modern application development.
- Google Cloud Platform (GCP): A suite of cloud computing services that allows businesses to create, deploy, and manage applications, websites, and services in a flexible infrastructure.

  13 prompt tokens processed: 19.33 token/s
 482 tokens generated: 20.24 token/s

CPU:

cargo run --example quantized --release --features mkl -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654    
    Finished release [optimized] target(s) in 0.17s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (4.14GB) in 0.04s
model built
<s>[INST]Who created you ?[/INST]I was created by Mistral AI. I'm an artificial intelligence language model designed to generate human-like text based on the data I was trained on. I don't have a physical form or the ability to create other entities. I exist purely as a program running on computer servers.

  13 prompt tokens processed: 7.76 token/s
  59 tokens generated: 5.83 token/s

CPU was good and Cuda performed poor in generating response.

@LaurentMazare
Copy link
Collaborator Author

Tried hands-on and below are the results from cuda (gpu) and mkl (cpu) by running a 4bit wuantized 7b mistral model. The prompt was: Who created you ?

Cuda:

cargo run --example quantized --release --features cuda -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
    Finished release [optimized] target(s) in 0.21s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.15s
model built
<s>[INST]Who created you ?[/INST]Additional Information:

**Company Name:** Acurify Solutions Corp.

**Address:** 123 Main Street, Suite 400, Anytown, USA

**Email:** [email protected]

**Phone:** 555-555-5555

**Year Founded:** 2018

**Website:** www.acurifysol.com

Acurify Solutions Corp. is a technology consulting firm that was founded in 2018 by a team of experienced professionals with a passion for delivering exceptional results to their clients. Our mission is to help businesses improve their operations and increase their efficiency through the innovative use of technology solutions. We specialize in cloud computing, data analytics, and software development services. Our team of experts have extensive experience working with leading technologies such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).

At Acurify Solutions Corp., we believe that every business is unique, which is why we offer customized solutions tailored to our clients' specific needs. We take a collaborative approach to working with our clients, ensuring that we fully understand their goals and challenges before providing recommendations and implementing solutions. Our team of experts will work closely with your team to ensure a seamless transition and maximum value from your technology investment.

Contact us today to learn more about how Acurify Solutions Corp. can help your business achieve its full potential through innovative technology solutions.

**Services:**

- Cloud Computing Consulting & Migration
- Data Analytics & Business Intelligence
- Software Development
  - Web Applications
  - Mobile Applications
  - Custom Applications
- Technical Support & Managed Services

**Technologies:**

- Microsoft Azure: A comprehensive set of cloud services that allows businesses to build, deploy, and manage applications through the Microsoft Azure platform.
- Amazon Web Services (AWS): A secure cloud services platform offering computing power, storage, databases, and various functionalities required for modern application development.
- Google Cloud Platform (GCP): A suite of cloud computing services that allows businesses to create, deploy, and manage applications, websites, and services in a flexible infrastructure.

  13 prompt tokens processed: 19.33 token/s
 482 tokens generated: 20.24 token/s

CPU:

cargo run --example quantized --release --features mkl -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654    
    Finished release [optimized] target(s) in 0.17s
     Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (4.14GB) in 0.04s
model built
<s>[INST]Who created you ?[/INST]I was created by Mistral AI. I'm an artificial intelligence language model designed to generate human-like text based on the data I was trained on. I don't have a physical form or the ability to create other entities. I exist purely as a program running on computer servers.

  13 prompt tokens processed: 7.76 token/s
  59 tokens generated: 5.83 token/s

CPU was good and Cuda performed poor in generating response.

Thanks for reporting this, I've created a separate issue to track it. #1765

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants