Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support directly loading a quantized model. #1727

Closed
beicy opened this issue Sep 27, 2018 · 7 comments
Closed

Support directly loading a quantized model. #1727

beicy opened this issue Sep 27, 2018 · 7 comments
Assignees

Comments

@beicy
Copy link
Contributor

beicy commented Sep 27, 2018

Currently, the only way to run a quantized model is to do the profiling and reload it. We would like to support to directly load a quantized model. Since there is no quantized ONNX op support yet, we will start with Caffe2 loader. More details will be updated here later.

@qcolombet

@beicy beicy self-assigned this Sep 27, 2018
@yinghai
Copy link
Contributor

yinghai commented Sep 27, 2018

@Maratyszcza what's the progress of ONNX support?

@Maratyszcza
Copy link

no ONNX quantization spec yet

@beicy
Copy link
Contributor Author

beicy commented Oct 9, 2018

Now we started it with supporting quantized ResNet50 : https://github.com/caffe2/models/tree/master/resnet50_quantized
To support this model,
1) the following ops need to be added:

1. Int8Quantize
2. Int8Dequantize
3. Int8Conv
4. Int8ConvRelu
5. Int8MaxPool
6. Int8AveragePool
7. Int8FC
8. Int8SumRelu
9. Int8GivenIntTensorFill  -- This is int32 quantized data.
10. Int8GivenTensorFill     -- This is int8 quantized data.

The main difference of the above ops from the regular ops is two args added:

arg {                                                                                                                                                               
      name: "Y_scale"                                                                                                                                                     
      f: 0.00044428                                                                                                                                                       
      }                                                                                                                                                                  
arg {                                                                                                                                                               
      name: "Y_zero_point"   --- this one is offset                                                                                                                                               
      i: 128                                                                                                                                                              
     } 

2) When a node is created with quantized value, we need to pass the scale and offset explicitly.
Based on the above new ops, createConstant, createMaxPool, createAvgPool, createFullyConnected need to be improved to support passing scale and offset.
3) Nodes and Instrs verification function need to be updated based on quantized / non-quantized type.

### Now one thing needs to be discussed here:
So far, we quantize weights and bias into int8. Both the interpreter and CPU backend support int8 bias. However, this quantized resnet50 model quantizes weights into int8, but bias into int32. According to Haixin, it is because the partial sum of the matrix-matrix multiplication is accumulated into int32, so int32 bias can be added to the int32 partial sum for better accuracy (i.e. int8 bias caused accuracy drop).
So, for our backend, do we need to support both int8 quantized bias and int32 quantize bias or only int32 quantized bias ? Or any other idea? Thanks!
@rdzhabarov @artemrakhov @opti-mix @nadavrot

@rdzhabarov
Copy link
Contributor

Thanks @beicy for going through the required op support.

So, for our backend, do we need to support both int8 quantized bias and int32 quantize bias or only int32 quantized bias ? Or any other idea?

Looking at some other backends, it seems that majority (except Glow) rely on the 32bits representation for bias (also look at one of the google's papers https://arxiv.org/pdf/1712.05877.pdf).

Int8 bias was introduced originally to simplify handling of quantization procedure. While it's beneficial to unify behavior and quantize bias in 32bits. I think it'll be an easier transition if we add support of conv with int32 bias and keep int8 bias support in interpreter/cpu and then completely eliminate int8 bias.

@artemrakhov-glow
Copy link
Contributor

Thank you for the analysis!

  1. When a node is created with quantized value, we need to pass the scale and offset explicitly.

I don't think that scale and offset need to be stored directly inside the node. Please look at what we do for createFullyConnected(name, input, W, B, outTy): outTy can represent a quantized type, which stores not only shape, but also scale and offset. I think it is good design that we have scale and offset as part of the Type class. We always know output Type, which stores the quantization params. We need to introduce more builder methods for other nodes, which accept outTy like the one I mentioned.

@beicy
Copy link
Contributor Author

beicy commented Oct 10, 2018

@artemrakhov Yes, Actually, I used the way you mentioned to carry scale and offset :) Sorry I didn't represent it clearly :P

@jnorwood
Copy link

The tflite app includes several pretrained quantized models. They are uint8 for kernel weights and volume data, and signed int32 for bias. It would probably be good to try to translate some of those models. They do use the zero offset and scale values. The zero offset is a uint8. The scale is a float. There are float scale values for input, output and kernel. During the execution, the downscale can be used as a float, or it can be converted to a pair of integer multiply and right shift values. You should keep in mind that embedded targets may not have fp, or may want to avoid it for low power apps, so it might be useful to precalculate the integer mpy and right shift values and store those in whatever is the quantized persistence form.

Also, though tflite/gemmlowp use this uint8 form with zero offset, it requires some extra processing for the multiplies. The intel format uses uint8 following activations, where the offset would be zero anyway. That may be a more efficient implementation, or at least reasonable vs fooling with the zero offsets everywhere. So, maybe a good idea to consider supporting it as well.

@beicy beicy closed this as completed Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants