-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add onnxslim intergration #811
Conversation
Perfect! Can you also add onnxslim to scripts/requirements.txt (the version with support for subgraphs + weight tying)? |
Hi @xenova onnxslim version fixed |
Hi @xenova do you have tests scripts for model accuracy and speed, I have gpu server available for tests. |
Great! I'm testing out the PR now, with a bunch of mobilenet models. Looks like it's working great!
Huge improvements!
I have a bunch of unorganized colab notebooks spread out, but nothing official and good for release. I would absolutely love to consolidate everything into a single evaluation script, so it can be used for evaluating different quantization settings too. Is this something your (or another community member) would be interested in developing? |
Can I work with you |
To improve the conversion script and evaluation? I would love that! I'm currently working on some other things (like Florence2 support), but feel free to submit a PR and I can review it 😎 |
Merged the PR! Thanks so much! 🔥 |
I'm now writing scripts to test all the models in hugging face under xenova namespace |
btw, the version 0.1.29.1 is unstable, because we have fixed inisis/OnnxSlim#10, I would recommend the latest version 0.1.31 |
I've updated the version to 0.1.31 🤗👍 |
@inisis I'm running into a few issues when quantizing models produced by onnxslim. Here's an example model: Quantization code: import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = './decoder_with_past_model.onnx'
model_quant = './decoder_with_past_model_quantized.onnx'
quantized_model = quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8,
extra_options={'EnableSubgraph': True},
per_channel=False,
reduce_range=False,
) this works but if you run
followed by import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = './onnx/decoder_with_past_model_slimmed.onnx'
model_quant = './onnx/decoder_with_past_model_slimmed_quantized.onnx'
quantized_model = quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8,
extra_options={'EnableSubgraph': True},
per_channel=False,
reduce_range=False,
) it produces the following error:
can you look into this? 🙏 |
@xenova sorry for the bug, you can try
btw, can I add you on linkedin |
Thanks for the quick fix! I can confirm it works. 👍 I will update to the latest version in requirements.txt when you release 👌
Sure, feel free to send a request! :) I also ran into an issue with https://github.com/snakers4/silero-vad/blob/master/files/silero_vad.onnx
|
@xenova bug fixed, I swear I have never seen so many onnx models with this many subgraphs, thanks for reporting this! |
Thanks! The model does export correctly, but now it produces:
Example code: import onnxruntime as ort
import numpy as np
batch_size = 2
input = np.zeros((batch_size, 256), dtype=np.float32)
sr = np.array(16000)
state = np.zeros((2, batch_size, 128), dtype=np.float32)
ort_sess = ort.InferenceSession('model.onnx')
outputs = ort_sess.run(None, {'input': input, 'sr': sr, 'state': state})
# Print Result
outputs |
Sorry for the bug, I have fixed it, thanks for you patience and help. |
Thanks so much and no worries! 🤗 I've noticed another issue with quantization, where quantization after slimming the model does not work. Steps to reproduce:
wget https://huggingface.co/onnx-community/whisper-tiny.en_timestamped/resolve/d4469fcf29fc2898f0d57632d811fa0ed21de5cc/onnx/decoder_model_merged.onnx
onnxslim decoder_model_merged.onnx decoder_model_merged_slimmed.onnx
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = './decoder_model_merged_slimmed.onnx'
model_quant = './decoder_model_merged_slimmed_quantized.onnx'
quantized_model = quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8,
extra_options={'EnableSubgraph': True},
per_channel=False,
reduce_range=False,
)
$ ls -l
-rw-r--r-- 1 root root 118606545 Jul 2 14:01 decoder_model_merged.onnx
-rw-r--r-- 1 root root 118662672 Jul 2 14:13 decoder_model_merged_slimmed.onnx
-rw-r--r-- 1 root root 110356751 Jul 2 14:14 decoder_model_merged_slimmed_quantized.onnx However, if you were to run quantization on the original model, you'd get a much smaller output: import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
model_fp32 = './decoder_model_merged.onnx'
model_quant = './decoder_model_merged_quantized.onnx'
quantized_model = quantize_dynamic(
model_input=model_fp32,
model_output=model_quant,
weight_type=QuantType.QInt8,
extra_options={'EnableSubgraph': True},
per_channel=False,
reduce_range=False,
)
Any idea what's going wrong? Thanks! |
and another weird one: this model becomes empty? wget https://huggingface.co/onnx-community/whisper-tiny_timestamped/resolve/ae48508b4bc9b594a3a84d21f4a365a29d8d66ad/onnx/decoder_model_merged_fp16.onnx followed by: onnxslim decoder_model_merged_fp16.onnx decoder_model_merged_fp16_slimmed.onnx produces:
|
it seems this model is invalid. you can check it here
|
This seems a little bit complicated, can you help check the output correctness for the raw float model and the slimmed model, |
and I think the reason why the quantized slimmed model get larger is because it has tied weight that is not tied. |
I have raised an issue here microsoft/onnxruntime#21277, they suggest using quantization preprocess to resolve this issue |
Thanks! 👍 I ran into another issue for this model:
produces an invalid model:
Running import onnx
onnx_model = onnx.load("slimmed.onnx")
onnx.checker.check_model(onnx_model) produces
Any idea what the problem is? Thanks! |
@xenova you can try pip install git+https://github.com/inisis/OnnxSlim@main, I have tested it, and I will release a new version tonight |
Thanks - I did try that, but still same issue 👀 The last commit was ~1 week ago, correct? |
|
Hi @xenova I have created a repo called OnnxLLM, specialing in onnxruntime llm inference, currently supported models are llama3, qwen2, chatglm3, I see that chatglm is not supported currently in your repo, can we work together to intergrate it. Thanks! |
Hi @xenova I have made a pr in huggingface/optimum#1744, can you help back me up. I think it's a very useful tool. |
Hi, as discussed here, #797, onnxslim can reduce number of operators, and some performace test can be seen huggingface/optimum#1744