You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The hypothesis that we also discussed in #161 is that distillation might not work as expected with the on-the-fly augmentation since the student is supposed to be train on the exact outputs of the teacher. The gap between the teacher and student models is way too large. See also #231. I'll try to disable augmentation on the fly for the student model first to see how much it helps.
If it's the case, the proper fix would be to do augmentation of the corpus before decoding by the teachers.
The text was updated successfully, but these errors were encountered:
The hypothesis that we also discussed in #161 is that distillation might not work as expected with the on-the-fly augmentation since the student is supposed to be train on the exact outputs of the teacher. The gap between the teacher and student models is way too large. See also #231. I'll try to disable augmentation on the fly for the student model first to see how much it helps.
If it's the case, the proper fix would be to do augmentation of the corpus before decoding by the teachers.
The text was updated successfully, but these errors were encountered: