-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard to reproduce the results of GLUE benchmark #5
Comments
Same reproduce problem, can not get a result as good as published in paper. We really appreciate it if you can offer the experiment setting and clean code. |
Hi @Harry-zzh, thanks for your interest in our work! Just to confirm, 3% lower than reported means absolutely, right? Then this is lower than all baselines in Table 1, even vanilla KD? @MichaelZhouwang and I will take a closer look and it'll be great if you can share the exact command used with us. |
Hi @Harry-zzh First, could you please share on which dataset you conduct your experiments? If it is some small datasets, 3% variation may indeed come from different random seeds. Otherwise, can you share the exact command for your best result on the task. Also you may check the following points:
|
Thank your for your reply. @JetRunner @MichaelZhouwang
As you can see, almost all the results on the test set are 3% lower than your reported results. I can reproduce the results of KD listed in [1] but yours are significantly higher than theirs, I can’t reproduce.
And my teacher achieves even better performance than your reported results.
References: |
@Harry-zzh Thanks for the info. Is it on test set (i.e., GLUE server) or validation set? If it's on test set, could you please also provide the results on the development set? @MichaelZhouwang could you give it a look? |
By the way, in NLP experiments, the students in our implementation of KD and our approach are initialized with pretrained BERT (well-read student) rather than fine-tuned teacher. That's probably the reason why the vanilla KD reported by us is significantly higher? (See the caption under Table 1) |
@Harry-zzh Can you share the exact command for your best result on the task? Also, can you share the results on the dev set of the GLUE benchmark? You can first focus on reproducing the results on the dev set. |
Thanks for your reply. I have shown the results on the test set before, and the results on the dev set are as follows:
I try grid search over the sets of the hyper-parameters as I described before, and I choose the best checkpoint on the dev set to make predictions on the test set. An example of my command on MNLI dataset is : And, @JetRunner said your approach is initialized with pretrained BERT (well-read student), and @MichaelZhouwang said your approach is initialized with fine-tuned teacher. I feel a bit confused. Looking forward to your reply, and I would be grateful if you could offer exact experiment settings on each dataset. |
Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance. First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty. For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs. Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer. |
For further questions, maybe you can send me an email with your wechat ID to [email protected] so that I can offer further guidance and help more promptly and conveniently. |
Thanks, I will have a try. |
@Harry-zzh hi, so, can you reproduce the results as shown in this paper now? |
@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot! |
Sorry for late reply. I fail to reproduce the results. |
Sorry for late reply. I fail to reproduce the results. |
Thanks for your excellent work.
I tried to do grid search on the settings that you described in your paper and codes, but it is still hard for me to reproduce the results of GLUE benchmark. My experiment results on both the dev set and test set are about 3% lower than yours.
I would be very grateful if you could offer exact experiment settings on each dataset, or codes that can reproduce the results of GLUE benchmark.
Looking forward to your reply, thank you !
The text was updated successfully, but these errors were encountered: