-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support directly loading a quantized model. #1727
Comments
@Maratyszcza what's the progress of ONNX support? |
no ONNX quantization spec yet |
Now we started it with supporting quantized ResNet50 : https://github.com/caffe2/models/tree/master/resnet50_quantized
The main difference of the above ops from the regular ops is two args added:
2) When a node is created with quantized value, we need to pass the scale and offset explicitly. ### Now one thing needs to be discussed here: |
Thanks @beicy for going through the required op support.
Looking at some other backends, it seems that majority (except Glow) rely on the 32bits representation for bias (also look at one of the google's papers https://arxiv.org/pdf/1712.05877.pdf). Int8 bias was introduced originally to simplify handling of quantization procedure. While it's beneficial to unify behavior and quantize bias in 32bits. I think it'll be an easier transition if we add support of conv with int32 bias and keep int8 bias support in interpreter/cpu and then completely eliminate int8 bias. |
Thank you for the analysis!
I don't think that scale and offset need to be stored directly inside the node. Please look at what we do for |
@artemrakhov Yes, Actually, I used the way you mentioned to carry scale and offset :) Sorry I didn't represent it clearly :P |
The tflite app includes several pretrained quantized models. They are uint8 for kernel weights and volume data, and signed int32 for bias. It would probably be good to try to translate some of those models. They do use the zero offset and scale values. The zero offset is a uint8. The scale is a float. There are float scale values for input, output and kernel. During the execution, the downscale can be used as a float, or it can be converted to a pair of integer multiply and right shift values. You should keep in mind that embedded targets may not have fp, or may want to avoid it for low power apps, so it might be useful to precalculate the integer mpy and right shift values and store those in whatever is the quantized persistence form. Also, though tflite/gemmlowp use this uint8 form with zero offset, it requires some extra processing for the multiplies. The intel format uses uint8 following activations, where the offset would be zero anyway. That may be a more efficient implementation, or at least reasonable vs fooling with the zero offsets everywhere. So, maybe a good idea to consider supporting it as well. |
Currently, the only way to run a quantized model is to do the profiling and reload it. We would like to support to directly load a quantized model. Since there is no quantized ONNX op support yet, we will start with Caffe2 loader. More details will be updated here later.
@qcolombet
The text was updated successfully, but these errors were encountered: