Q: In the example, we set the pruning rate of the whole network. How to adjust the pruning rate of a specific layer?
# example config
sparsity: 0.25
metrics: l2_norm # The available metrics are listed in `tinynn/graph/modifier.py`
A: After calling pruner.prune()
, a new configuration file with the sparsity for each operator will be generated inplace. You can use this file as the configuration for the pruner or generate a new configuration file based on this one. (e.g. line 42 in examples/oneshot/oneshot_prune.py
)
# new yaml generated
sparsity:
default: 0.25
model_0_0: 0.25
model_1_3: 0.25
model_2_3: 0.25
model_3_3: 0.25
model_4_3: 0.25
model_5_3: 0.25
model_6_3: 0.25
model_7_3: 0.25
model_8_3: 0.25
model_9_3: 0.25
model_10_3: 0.25
model_11_3: 0.25
model_12_3: 0.25
model_13_3: 0.25
metrics: l2_norm # Other supported values: random, l1_norm, l2_norm, fpgm
The training in TinyNeuralNetwork is based on PyTorch. Usually, the bottleneck is in the data processing part, which you can try to use LMDB and other in-memory databases to accelerate.
Q: Some operators such as max_pool2d_with_indices will fail when quantizing
A: The quantization-aware training of TinyNeuralNetwork is based on that of PyTorch, and only reduces its complexity related to operator fusion and computational graph rewrite.
TinyNeuralNetwork does not support operators that are not natively supported by PyTorch, such as LeakyReLU and etc. Please wrap up torch.quantization.QuantWrapper
on those modules.
(More operators are supported in higher versions of PyTorch. So, please consult us first or try a higher version if you encounter any failure)
Q: How to quantize only part of a quantized graph when the default is to perform quantization on the whole graph?
# Quantization with the whole graph
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='out')
qat_model = quantizer.quantize()
A: First, perform quantization for the whole graph. Then, manually modify the positions of QuantStub and DeQuantStub. After that, using the code below to load the model.
# Reload the model with modification
with model_tracer():
quantizer = QATQuantizer(model, dummy_input, work_dir='out', config={'force_overwrite': False})
qat_model = quantizer.quantize()
Q: Models may have some extra logic in the training phase that are not needed in inference, such as the model below (which is also a common scenario in real world OCR and face recognition). This will result in the quantization model code generated by codegen during training is not available for inference.
class FloatModel(nn.Module):
def __init__(self):
self.conv = nn.Conv2d()
self.conv1 = nn.Conv2d()
def forward(self, x):
x = self.conv(x)
if self.training:
x = self.conv1(x)
return x
A: There are generally two ways to tackle this problem.
- Use the code generator in TinyNeuralNetwork to create
qat_train_model.py
,qat_eval_model.py
in case ofmodel.train()
,model.eval()
, respectively Useqat_train_model.py
for training, and then useqat_eval_model.py
to load the weights trained by the former when inference is needed (Since there is noself.conv1
inqat_eval_model.py
, you need to setstrict=False
when callingload_state_dict
) - Like the former one, generate two different copies of the model in training mode and evaluation mode respectively. And then, make a copy of
qat_train_model.py
and replace the forward function with that inqat_eval_model.py
manually. Finally, use the modified script as the one for the evaluation mode.
How to fuse normalization in preprocessing and the Quantize OP, so that the raw image data is used as input?
Assuming normalization is done in preprocessing using normalized = (image - mean) / std
, you can pass in the parameter 'quantized_input_stats': [(mean, std)]
when constructing Quantizer
, as well as constructing Converter
with fuse_quant_dequant=True
, then the image data (image
in the formula) can be passed in as the uint8
data format.
For example, the following preprocessing process is often used for images in torchvision.
transforms = transforms.Compose(
[
Resize(img_size),
ToTensor(),
Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
]
)
Except for the Resize process, ToTensor
converts the data to floating point and divides it by 255, then Normalize
performs normalization according to mean=(0.4914, 0.4822, 0.4465)
and std=(0.2023, 0.1994, 0.2010)
. In this case, it is simliar to the normalization of mean=114.3884
and std=58.3021
, which of course leads to some accuracy loss. If you want to have higher accuracy, you can try out the following things.
- Unify the normalization parameters of the channels before training the floating-point model or before QAT training
- Make sure
mean
is set to an integer, because the corresponding quantization parameterzero_point
can only be an integer.
P.S. For inputs of the int8
type , you may need to perform the uint8
to int8
conversion yourself before feeding to the model as input (subtract 128 manually)
There are a large number of operators in PyTorch. We cannot cover all operators, but only most of the commonly used ones. Therefore, if you have unsupported operators in your model, you have the following options:
-
Submit a new issue
-
You can also try to implement it yourself. The process of operator translation in model conversion is actually the process of mapping between corresponding TorchScript OP and TFLite OP.
The locations of the relevant code
- TFLite
- OP schema (without I/O tensors): generated_ops.py
- Full schema: https://www.tensorflow.org/mlir/tfl_ops
- TorchScript
- ATen schema aten_schema.py
- Quantized schema quantized_schema.py
- Quantized schema torchvision_schema.py
- Translation logic
- ATen OPs aten.py
- Quantized OPs quantized.py
- Registration of OP translation logic
- Registration table __init__.py
Implementation steps:
- Read through the schema of both TorchScript and TFLite, and select the appropriate OP(s) on both sides
- Add an entry in the OP translation registration table
- Add a new parser class to the translation logic. This class needs to inherit the corresponding TorchScript schema class.
- Implement the function
parse
of the aforementioned class
For details, please refer to the implementation of SiLU: https://github.com/alibaba/TinyNeuralNetwork/commit/ebd30325761a103c5469cf6dd4be730d93725356
- TFLite
Model conversion fails for unknown reasons. How to provide the model to the developers for debugging purposes?
You can use export_converter_files
to export your models with some related configuration files. For details, see the code below
from tinynn.util.converter_util import export_converter_files
model = Model()
model.cpu()
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)
export_dir = 'out'
export_name = 'test_model'
export_converter_files(model, dummy_input, export_dir, export_name)
Executing this code, you'll get two files in the specified directory, including the TorchScript model (.pt) and the input and output description files (.json). These two files can be shared with developers for debugging.
Generally, for a vision model, the memory layout of the input data used by PyTorch is NCHW, and on the embedded device side, the layout of the supported image data is usually NHWC. Therefore, the 4-dimensional input and output is transformed by default. If you do not need this behaviour, you can add the parameter nchw_transpose=False
(or input_transpose=False
and output_transpose=False
) when defining TFLiteConverter.
Since TFLite does not officially support grouped (de)convolution, we have extended the implementation of grouped (de)convolution internally based on the CONV_2D
and the TRANSPOSE_CONV
operator. To generate a standard TFLite model, you can add the parameter group_conv_rewrite=True
when defining TFLiteConverter.
You may add the parameter map_bilstm_to_lstm=True
when defining TFLiteConverter.
Since the target format of our conversion is TFLite, we need to understand how LSTM works in PyTorch and Tensorflow respectively.
When using TF2.X to export the LSTM model to Tensorflow Lite, it will be translated into the UnidirectionalLSTM
operator, and the state tensors in it will be saved as a Variable
(a.k.a persistent memory). The state of each mini-batch will be automatically be accumulated. These state tensors are not included in the input and output of the model.
In PyTorch, LSTM contains an optional state input and state output. When the state is not passed in, the initial hidden layer state always remains all 0 for each mini-batch inference, which is different from Tensorflow.
Therefore, in order to simulate the behavior on the Tensorflow side, when exporting the LSTM model on the PyTorch side, be sure to delete the LSTM state inputs and outputs from the model inputs and outputs.
Next, for streaming and non-streaming scenarios, how should we use the exported LSTM model?
In this case, we just need to set the state inputs to 0. Fortunately, Tensorflow Lite's Interpreter provides a convenient interface reset_all_variables.
So, we only need to call reset_all_variables
before each call to invoke
.
In this case, it's somehow more complicated because we need to read and write state variables. We can use Netron to open the generated model, locate all LSTM nodes, and view the input whose name contains state. For example, for the states in a unidirectional LSTM node, the attributes are named output_state_in
and cell_state_in
, you can expand and see that their kind is Variable
. Record their locations (i.e. the location
property).
When using Tensorflow Lite's Interpreter, you only need to read or write these state variables according to these location
s, combined with methods like get_tensor
and set_tensor
. For details, see here.
Note: These state variables are all two-dimensional with the shape of [batch_size, hidden_size or input_size]
. So in the streaming scenario, you only need to split these variables according to the first dimension.
Usually, when the number of hidden layers is large enough (128+), the LSTM OP will be time-consuming in the TFLite backend. In this case, consider using dynamic range quantization to optimize its performance, see dynamic.py.