-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low accuracy of TF-Lite model for Mobilenet (Quantization aware training) #368
Comments
I skimmed through your colab. Could you try one thing I didn't see? If you try taking your "Keras model without quantization aware training" (0.99), converting it to TFLite, and then evaluating it in a manner similar to how you got the 0.20% accuracy number, could you see what you get? |
I updated colab notebook.
|
Can you try to train your q_aware model much longer, e.g. q_aware_history = q_aware_model.fit(train.repeat(),
initial_epoch=10,
epochs=200,
steps_per_epoch=500,
validation_data=validation.repeat(),
validation_steps=validation_steps) there are running exponential averages in the quantized layers which may need to converge. |
You can take a look on this issue: #309 |
@kmkolasinski I tried two patterns ( training with QAT).
I need quite a long time and a large number of epochs. Also, It is not possible to confirm the gap between the Keras model and the TF-Lite model from the accuracy and loss metrics. How can I find that the gap disappears during training? Also, can I guess how many epochs to set? |
@NobuoTsukamoto, @krzys-ostrowski: this is good feedback. Just from the analysis, there are some things we could possibly do:
and then with regards to how long it takes
|
Indeed, having native callback for EMA monitoring would be a nice feature. Additionally, since EMA decay in the moving average quantizer is set to beta=0.999 we need approximately 1000 steps to 'forget' about the initial state. Here is a table which shows how many steps you need to 'forget' about the initial state of the quantizer min/max values: Probably, setting the default EMA decay to 0.995 would be a better choice for users with simpler problems. One can also monitor the GAP between Keras model and TFLite during training via custom callback. For example I use model output statistics as a proxy for measuring the GAP. Here is how it looks like in my case (source):
The problem with this approach is that, predictions through TFLite model can be very slow on non arm architectures and this type of test should be run in background in order to not block the training loop. |
It would be nice if the convergence can see in the Tensorboard log. |
I think there is likely some confusion here. Exponential Moving Average is used during QAT to calculate the ranges of dynamic tensors. Since the initial cold start is [-6, 6], it can lead to a huge accuracy drop at the beginning of QAT. Say a tensor only has values in [-0.1, 0.1], then most of the range is wasted and can lead to huge losses. As training goes on, this range slowly converges to the actual range. As @kmkolasinski mentioned, ~1000 steps. And the QAT accuracy goes up. However, when converting to TFLite, these same ranges are used which are used in QAT. So the TF and TFLite accuracy and values should be very close. QAT tries to emulate TFLite as closely as possible, and there shouldn't be such divergences. We don't see it in our local tests either. For example, if you run There can be some subtle differences. We don't place FakeQuants after Softmax for instance since it hinders with convergence. There's a possibility that's happening, but I can't be sure of it. I'm trying to recreate the issue. |
There is a chance that I'm doing something wrong, however It seems that I'm not the only one with this issue. You can check much bigger model than the one used in the |
We've found the issue. One of the quantized kernel activation ranges had a problem, but was getting hidden when the range has converged. We'll have a fix out soon. tf-nightly should have it. |
Thanks a lot for your help reporting and helping reproduce this issue. Would've been really hard to narrow down without the reproduction code. |
@nutsiepully could you mention if there's any specific version of TensorFlow that would have the fix? Or should |
Cool thanks for feedback @nutsiepully ! I will check it today. Out of curiosity, was it some general issue or something related to MobileNet models or specific layer etc ? @sayakpaul Yes, you can also use |
@sayakpaul - @kmkolasinski - I'll point out the commit here once it's in so you can see it. It was a general issue, with the DepthConv kernel implementation, which got triggered when ranges hadn't converged. |
Thanks, it makes sense to me, few weeks ago I've switched to a custom ResNet model which does not have DepthConvs and I got better results. |
Thanks for letting me know. I will check and report back. |
@nutsiepully I can definitely see the improvement and this Colab Gist reproduces this. Additionally, I worked on this report for folks to make the onboarding process for quantization a bit easier. It incorporates many of your suggestions as well. Happy to address any feedback. Thank you so much for all your help :) |
Thanks a lot @sayakpaul. Really appreciate the feedback and the effort. Thanks @kmkolasinski and @NobuoTsukamoto for the detailed bug reports and feedback. I'm closing the bug. Please reopen if you face any further issues. @sayakpaul, the report is awesome! Great work, this explains the value of the tooling really well. |
Hi. I’m facing the same issue with MobileNetV3 where I see a large drop in accuracy in the TFLite model compared to the QAT Keras Model. I’m using Tensorflow version 2.15.0 and Tensorflow Model Optimization version 0.7.5. I had to refactor MobileNetV3 a little to make it compatible with QAT by using OnlyOutputQuantizeConfig for the Multiply layers (Moving Average Quantizer) and replacing the Add operations in Hard Sigmoid with Rescaling but I don’t think that should be the cause of this issue? Would appreciate any help. Thanks! |
Hello @tarushbansal, |
Describe the bug
The accuracy of TF-Lite model becomes extremely low after the quantization aware training of tf.keras.applications.mobilenet (v1/v2).
System information
TensorFlow installed from (source or binary): binary
TensorFlow version: tf-nightly-gpu (2.2.0.dev20200420)
TensorFlow Model Optimization version: 0.3.0
Python version: 3.6.9
Describe the expected behavior
The accuracy of Keras model (with quantization aware training) and TF-Lite model are almost the same.
Image classification with tools
Describe the current behavior
Accuracy is extremely low: 0.20%
If the model is defined as follows, the accuracy of Keras model and TF-Lite model will be almost the same.
Code to reproduce the issue
(Google Colab notebook)
https://gist.github.com/NobuoTsukamoto/b42128104531a7612e5c85e246cb2dac
Screenshots
Additional context
The text was updated successfully, but these errors were encountered: