-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda acceleration for quantized model. #1754
Conversation
On RTX 4070Ti: Running a Mistral model that gives good results with .safetensors. I got a garbled mess for q2k, a nice error for q8k (not supported yet), and a good response for q4_0. Awkwardly, the q4_0 gave me I can try some more tests later. Quantized responses are quick. ⚡ |
Tried hands-on and below are the results from cuda (gpu) and mkl (cpu) by running a 4bit wuantized 7b mistral model. The prompt was: Cuda: cargo run --example quantized --release --features cuda -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
Finished release [optimized] target(s) in 0.21s
Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 291 tensors (4.14GB) in 0.15s
model built
<s>[INST]Who created you ?[/INST]Additional Information:
**Company Name:** Acurify Solutions Corp.
**Address:** 123 Main Street, Suite 400, Anytown, USA
**Email:** [email protected]
**Phone:** 555-555-5555
**Year Founded:** 2018
**Website:** www.acurifysol.com
Acurify Solutions Corp. is a technology consulting firm that was founded in 2018 by a team of experienced professionals with a passion for delivering exceptional results to their clients. Our mission is to help businesses improve their operations and increase their efficiency through the innovative use of technology solutions. We specialize in cloud computing, data analytics, and software development services. Our team of experts have extensive experience working with leading technologies such as Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).
At Acurify Solutions Corp., we believe that every business is unique, which is why we offer customized solutions tailored to our clients' specific needs. We take a collaborative approach to working with our clients, ensuring that we fully understand their goals and challenges before providing recommendations and implementing solutions. Our team of experts will work closely with your team to ensure a seamless transition and maximum value from your technology investment.
Contact us today to learn more about how Acurify Solutions Corp. can help your business achieve its full potential through innovative technology solutions.
**Services:**
- Cloud Computing Consulting & Migration
- Data Analytics & Business Intelligence
- Software Development
- Web Applications
- Mobile Applications
- Custom Applications
- Technical Support & Managed Services
**Technologies:**
- Microsoft Azure: A comprehensive set of cloud services that allows businesses to build, deploy, and manage applications through the Microsoft Azure platform.
- Amazon Web Services (AWS): A secure cloud services platform offering computing power, storage, databases, and various functionalities required for modern application development.
- Google Cloud Platform (GCP): A suite of cloud computing services that allows businesses to create, deploy, and manage applications, websites, and services in a flexible infrastructure.
13 prompt tokens processed: 19.33 token/s
482 tokens generated: 20.24 token/s CPU: cargo run --example quantized --release --features mkl -- --which 7b-mistral-instruct-v0.2 --prompt "<s>[INST]Who created you ?[/INST]" --seed 87654
Finished release [optimized] target(s) in 0.17s
Running `target/release/examples/quantized --which 7b-mistral-instruct-v0.2 --prompt '<s>[INST]Who created you ?[/INST]' --seed 87654`
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
Running on CPU, to run on GPU, build this example with `--features cuda`
loaded 291 tensors (4.14GB) in 0.04s
model built
<s>[INST]Who created you ?[/INST]I was created by Mistral AI. I'm an artificial intelligence language model designed to generate human-like text based on the data I was trained on. I don't have a physical form or the ability to create other entities. I exist purely as a program running on computer servers.
13 prompt tokens processed: 7.76 token/s
59 tokens generated: 5.83 token/s CPU was good and Cuda performed poor in generating response. |
Thanks for reporting this, I've created a separate issue to track it. #1765 |
This is a first set of changes to enable quantized tensor handling with cuda.
On my Ryzen 2600X with a RTX 2800, the inference speed without cuda is of ~5 tokens/s and with cuda ~20 tokens/s with the default
quantized
example setup (this uses Q4_0).