-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to perform dynamic quantization on MultiheadAttention #1416
Comments
Hi @nagbhat25, Thanks for raising this issue. |
Hi @Kaihui-intel , Thanks for taking a look into this. The model I used is defined here in class GeoLayoutLMVIEModel: Link to model I am trying to run it just run dynamic quantisation (which I think can run without any data). Here is the sample snippet I use
The quantisation goes through and I get below log:
However, when I try to infer by calling the model I get the below error, (possibly from torch module). Looks like there isn't a support for NonDynamicallyQuantizableLinear layer.
If there is a way to avoid quantizing this layer altogether, that might solve this. |
I have verified it's reproducible, and I am working on it.
|
Hi @Kaihui-intel , I did try this already but no luck. The quantization succeeds by untouching the configured layer weights but the inference still fails. Here is the error reports: Dynamic Quantization report:
Errro during inference with op_type_dict change:
Thanks for looking into this! |
Hello @nagbhat25, |
Thanks @Kaihui-intel . I tested the fix using your branch and the inference works perfectly fine now. The model size reduces by 60% or so but However, one thing I notice is despite converting to int8, there is no gain in inference time. When I compare the inference time numbers to that of fp32 model, its almost same (in some cases the quantized model is slightly higher too). Is there any reason why this would happen or any flags that I can tweak. Here is the current quantization code:
Thanks a lot for looking into this. |
Thank you for your feedback.
Profile output of fp32 model:
From the above, it can be seen that there is not much difference in inference time between the two operators. There is a known issue here about torch quantized::linear_dynamic |
Thanks @Kaihui-intel . I did do some profiling for different quantization approaches and this seems to be the main issue. Hope the community provides some solution to this in future releases. |
Hello
I am trying to perform dynamic quantization on GeolayoutLM model which internally uses the torch.nn.MultiHeadAttention layer. When I try to quantize this model using dynamic quantization, I get an error in
torch/nn/modules/activation.py
. I think it is mostly because it uses NonDynamicallyQuantizableLinear internally.I would like to know if there is any way to get around this, or is it totally not supported. Is there a way to skip a layer in quantization. (my knowledge is very limited in this)
model link - geolayoutlm
Any help would be appreciated. Thanks
The text was updated successfully, but these errors were encountered: