Support directly loading a quantized model. #1727

beicy · 2018-09-27T03:49:17Z

Currently, the only way to run a quantized model is to do the profiling and reload it. We would like to support to directly load a quantized model. Since there is no quantized ONNX op support yet, we will start with Caffe2 loader. More details will be updated here later.

@qcolombet

yinghai · 2018-09-27T23:35:13Z

@Maratyszcza what's the progress of ONNX support?

Maratyszcza · 2018-09-27T23:39:03Z

no ONNX quantization spec yet

beicy · 2018-10-09T23:04:56Z

Now we started it with supporting quantized ResNet50 : https://github.com/caffe2/models/tree/master/resnet50_quantized
To support this model,
1) the following ops need to be added:

1. Int8Quantize
2. Int8Dequantize
3. Int8Conv
4. Int8ConvRelu
5. Int8MaxPool
6. Int8AveragePool
7. Int8FC
8. Int8SumRelu
9. Int8GivenIntTensorFill  -- This is int32 quantized data.
10. Int8GivenTensorFill     -- This is int8 quantized data.

The main difference of the above ops from the regular ops is two args added:

arg {                                                                                                                                                               
      name: "Y_scale"                                                                                                                                                     
      f: 0.00044428                                                                                                                                                       
      }                                                                                                                                                                  
arg {                                                                                                                                                               
      name: "Y_zero_point"   --- this one is offset                                                                                                                                               
      i: 128                                                                                                                                                              
     }

2) When a node is created with quantized value, we need to pass the scale and offset explicitly.
Based on the above new ops, createConstant, createMaxPool, createAvgPool, createFullyConnected need to be improved to support passing scale and offset.
3) Nodes and Instrs verification function need to be updated based on quantized / non-quantized type.

### Now one thing needs to be discussed here:
So far, we quantize weights and bias into int8. Both the interpreter and CPU backend support int8 bias. However, this quantized resnet50 model quantizes weights into int8, but bias into int32. According to Haixin, it is because the partial sum of the matrix-matrix multiplication is accumulated into int32, so int32 bias can be added to the int32 partial sum for better accuracy (i.e. int8 bias caused accuracy drop).
So, for our backend, do we need to support both int8 quantized bias and int32 quantize bias or only int32 quantized bias ? Or any other idea? Thanks!
@rdzhabarov @artemrakhov @opti-mix @nadavrot

rdzhabarov · 2018-10-10T00:30:53Z

Thanks @beicy for going through the required op support.

So, for our backend, do we need to support both int8 quantized bias and int32 quantize bias or only int32 quantized bias ? Or any other idea?

Looking at some other backends, it seems that majority (except Glow) rely on the 32bits representation for bias (also look at one of the google's papers https://arxiv.org/pdf/1712.05877.pdf).

Int8 bias was introduced originally to simplify handling of quantization procedure. While it's beneficial to unify behavior and quantize bias in 32bits. I think it'll be an easier transition if we add support of conv with int32 bias and keep int8 bias support in interpreter/cpu and then completely eliminate int8 bias.

artemrakhov-glow · 2018-10-10T02:07:55Z

Thank you for the analysis!

When a node is created with quantized value, we need to pass the scale and offset explicitly.

I don't think that scale and offset need to be stored directly inside the node. Please look at what we do for createFullyConnected(name, input, W, B, outTy): outTy can represent a quantized type, which stores not only shape, but also scale and offset. I think it is good design that we have scale and offset as part of the Type class. We always know output Type, which stores the quantization params. We need to introduce more builder methods for other nodes, which accept outTy like the one I mentioned.

beicy · 2018-10-10T02:35:29Z

@artemrakhov Yes, Actually, I used the way you mentioned to carry scale and offset :) Sorry I didn't represent it clearly :P

jnorwood · 2018-11-20T22:59:44Z

The tflite app includes several pretrained quantized models. They are uint8 for kernel weights and volume data, and signed int32 for bias. It would probably be good to try to translate some of those models. They do use the zero offset and scale values. The zero offset is a uint8. The scale is a float. There are float scale values for input, output and kernel. During the execution, the downscale can be used as a float, or it can be converted to a pair of integer multiply and right shift values. You should keep in mind that embedded targets may not have fp, or may want to avoid it for low power apps, so it might be useful to precalculate the integer mpy and right shift values and store those in whatever is the quantized persistence form.

Also, though tflite/gemmlowp use this uint8 form with zero offset, it requires some extra processing for the multiplies. The intel format uses uint8 following activations, where the offset would be zero anyway. That may be a more efficient implementation, or at least reasonable vs fooling with the zero offsets everywhere. So, maybe a good idea to consider supporting it as well.

beicy self-assigned this Sep 27, 2018

bertmaher mentioned this issue Oct 2, 2018

Add quantized ResNet50 to Glow's test suite #1762

Closed

This was referenced Oct 12, 2018

[Importer] Support arg "order" in Caffe2 Conv, MaxPool and AveragePool ops #1853

Merged

[Quantization] Support int32 quantized bias for quantized Conv #1876

Merged

beicy mentioned this issue Oct 30, 2018

[Quantization] Support int32 quantized bias of FullyConnected. #1940

Merged

beicy mentioned this issue Nov 13, 2018

[Quantization] Load quantized resnet50 model #2016

Merged

beicy closed this as completed Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support directly loading a quantized model. #1727

Support directly loading a quantized model. #1727

beicy commented Sep 27, 2018

yinghai commented Sep 27, 2018

Maratyszcza commented Sep 27, 2018

beicy commented Oct 9, 2018

rdzhabarov commented Oct 10, 2018

artemrakhov-glow commented Oct 10, 2018

beicy commented Oct 10, 2018

jnorwood commented Nov 20, 2018

Support directly loading a quantized model. #1727

Support directly loading a quantized model. #1727

Comments

beicy commented Sep 27, 2018

yinghai commented Sep 27, 2018

Maratyszcza commented Sep 27, 2018

beicy commented Oct 9, 2018

rdzhabarov commented Oct 10, 2018

artemrakhov-glow commented Oct 10, 2018

beicy commented Oct 10, 2018

jnorwood commented Nov 20, 2018