Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Simple fixes for ONNX export of MobileBERT.
Before the fix, a MatMul in the Embedding section of MobileBERT was not being converted to MatMulInteger, even though the inputs are quantized.
In short, a DequantizeLinear node used as part of the embedding quantization must be propagated down a few Slice and Concat nodes such that it can sit next to the MatMul node. This allows the proper pattern matching to convert that MatMul to MatMulInteger.
This behavior was already present in the onnx export logic, but there is a data type check on the embedding weights and it expected uint8. However, the weights were defined as int8 (conversion to uint8 happens at a later step). This PR adds logic to support int8, and also accounts for non-zero zero-point.
Testing plan:
Did the onnx export for the model below and checked that the MatMul is converted to ConvInteger. Checked that deepsparse now supports 99.35% of ops. Accuracy matches value reported in the zoo.
zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni