Fused Mixtral Performance #359

saduf · 2024-02-21T17:46:39Z

Env:
cuda: 12.2
Pytorch 2.2.0
tranformers 4.38.0
accelerate 0.27.2
autoawq 0.2.2

Thanks for you amazing work.
I quantized 'mistralai/Mixtral-8x7B-Instruct-v0.1' to use #352, and the quality of the gerated test is good, but I get the same tokens/s I get from "casperhansen/mixtral-instruct-awq" model. I was expecting a performance improvement.

For reference,
test: 4x3090ti single request close to 32k tokens for summarization
casperhansen/mixtral-instruct-awq - ~ 33 t/s
mixtral-instruct-awq-fused ~ 33 t/s

This is the configuration file I'm using, am I missing something?

import time
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

start = time.time()
model_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mixtral-instruct-awq'
modules_to_not_convert = ["gate"]
quant_config = {
    "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
    "modules_to_not_convert": modules_to_not_convert
}

# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
    model_path, safetensors=True, **{"low_cpu_mem_usage": True}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(
    tokenizer,
    quant_config=quant_config
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')
print("Time to quantize: {} s".format(time.time()-start))

The text was updated successfully, but these errors were encountered:

casper-hansen · 2024-02-21T18:58:16Z

Please use the from_quantized method when creating a quantized model.

Reference doc:
https://casper-hansen.github.io/AutoAWQ/reference/#awq.models.base.BaseAWQForCausalLM.from_quantized

saduf · 2024-02-21T19:06:03Z

@casper-hansen Thanks for the quick response.
I'm deploying with vLLM, do you know if this performance improvement from fused model should also be reflected in vLLM?

casper-hansen · 2024-02-21T19:07:32Z

It is not yet reflected in vLLM
vllm-project/vllm#2761

casper-hansen closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Mixtral Performance #359

Fused Mixtral Performance #359

saduf commented Feb 21, 2024 •

edited

Loading

casper-hansen commented Feb 21, 2024

saduf commented Feb 21, 2024

casper-hansen commented Feb 21, 2024

Fused Mixtral Performance #359

Fused Mixtral Performance #359

Comments

saduf commented Feb 21, 2024 • edited Loading

casper-hansen commented Feb 21, 2024

saduf commented Feb 21, 2024

casper-hansen commented Feb 21, 2024

saduf commented Feb 21, 2024 •

edited

Loading