You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for you amazing work.
I quantized 'mistralai/Mixtral-8x7B-Instruct-v0.1' to use #352, and the quality of the gerated test is good, but I get the same tokens/s I get from "casperhansen/mixtral-instruct-awq" model. I was expecting a performance improvement.
For reference,
test: 4x3090ti single request close to 32k tokens for summarization
casperhansen/mixtral-instruct-awq - ~ 33 t/s
mixtral-instruct-awq-fused ~ 33 t/s
This is the configuration file I'm using, am I missing something?
import time
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
start = time.time()
model_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mixtral-instruct-awq'
modules_to_not_convert = ["gate"]
quant_config = {
"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
"modules_to_not_convert": modules_to_not_convert
}
# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
model_path, safetensors=True, **{"low_cpu_mem_usage": True}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(
tokenizer,
quant_config=quant_config
)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
print("Time to quantize: {} s".format(time.time()-start))
The text was updated successfully, but these errors were encountered:
@casper-hansen Thanks for the quick response.
I'm deploying with vLLM, do you know if this performance improvement from fused model should also be reflected in vLLM?
Env:
cuda: 12.2
Pytorch 2.2.0
tranformers 4.38.0
accelerate 0.27.2
autoawq 0.2.2
Thanks for you amazing work.
I quantized 'mistralai/Mixtral-8x7B-Instruct-v0.1' to use #352, and the quality of the gerated test is good, but I get the same tokens/s I get from "casperhansen/mixtral-instruct-awq" model. I was expecting a performance improvement.
For reference,
test: 4x3090ti single request close to 32k tokens for summarization
casperhansen/mixtral-instruct-awq - ~ 33 t/s
mixtral-instruct-awq-fused ~ 33 t/s
This is the configuration file I'm using, am I missing something?
The text was updated successfully, but these errors were encountered: