Hard to reproduce the results of GLUE benchmark #5

Harry-zzh · 2022-08-10T14:18:55Z

Thanks for your excellent work.
I tried to do grid search on the settings that you described in your paper and codes, but it is still hard for me to reproduce the results of GLUE benchmark. My experiment results on both the dev set and test set are about 3% lower than yours.
I would be very grateful if you could offer exact experiment settings on each dataset, or codes that can reproduce the results of GLUE benchmark.
Looking forward to your reply, thank you !

LittleMouseInCoding · 2022-08-11T22:42:08Z

Same reproduce problem, can not get a result as good as published in paper. We really appreciate it if you can offer the experiment setting and clean code.
Thanks a lot!

JetRunner · 2022-08-15T04:18:47Z

Hi @Harry-zzh, thanks for your interest in our work! Just to confirm, 3% lower than reported means absolutely, right? Then this is lower than all baselines in Table 1, even vanilla KD?

@MichaelZhouwang and I will take a closer look and it'll be great if you can share the exact command used with us.

MichaelZhouwang · 2022-08-15T09:07:43Z

Hi @Harry-zzh First, could you please share on which dataset you conduct your experiments? If it is some small datasets, 3% variation may indeed come from different random seeds. Otherwise, can you share the exact command for your best result on the task.

Also you may check the following points:

Is your teacher achieving similar results as presented in the paper? And can you reproduce the results of KD and PKD reported in our paper on the same dataset? If not, its probably your pre-trained teacher or basic setup for BERT-KD/PKD is not correct.
Are you initializing the student with fine-tuned teacher parameters? This should be achieved by setting the --student_model to be the same as the teacher model.
For small to medium size datasets, you should adapt the --logging_rounds so that the model is evaluated at least 5 to 10 times per epoch in order to select the best performing checkpoint. This is especially important for MetaDistil because the peak performance often does not appear at the end of training.
To get better performance you need to further tune some hyperparameters including the warmup_steps/temperature/KD weight/weight_decay.

Harry-zzh · 2022-08-16T04:10:24Z

Thank your for your reply. @JetRunner @MichaelZhouwang

The result for the test set on GLUE benchmark is listed as follows:

Model	MRPC (F1/Acc.)	RTE (Acc.)	SST-2 (Acc.)	STS-B (Pear./Spear.)	MNLI ( Acc.)	QNLI (Acc.)	QQP (F1/Acc.)
Vanilla KD (mine)	86.2/80.3	64.7	91.7	83.4/81.9	80.4/79.8	87.5	69.7/88.6
Vanilla KD [1]	86.2/80.6	64.7	91.5	/	80.2/79.8	88.3	70.1/88.8

Model	MRPC (F1/Acc.)	RTE (Acc.)	SST-2 (Acc.)	STS-B (Pear./Spear.)	MNLI ( Acc.)	QNLI (Acc.)	QQP (F1/Acc.)
Meta Distill (mine)	85.2/79.5	65.6	91.4	83.1/81.4	80.8/80.0	87.4	70.1/88.5
Meta Distill [2]	88.7/84.7	67.2	93.5	86.1/85.0	83.8/83.2	90.2	71.1/88.9

As you can see, almost all the results on the test set are 3% lower than your reported results. I can reproduce the results of KD listed in [1] but yours are significantly higher than theirs, I can’t reproduce.

Model	MRPC (F1/Acc.)	RTE (Acc.)	SST-2 (Acc.)	STS-B (Pear./Spear.)	MNLI ( Acc.)	QNLI (Acc.)	QQP (F1/Acc.)
BERT-Base (mine)	89.0/85.2	69.5	93.2	87.2/85.9	84.3/83.9	91.1	71.5/89.2
BERT-Base [2]	88.9/84.8	66.4	93.5	87.1/85.8	84.6/83.4	90.5	71.2/89.2

And my teacher achieves even better performance than your reported results.

I initialize the student with fine-tuned teacher parameters.
For small and medium datasets, such as MRPC, RTE, and SST2, I set logging steps to 20. And for larger datasets, I set it to 500 or 1000.
I try almost all hyper-parameters and fail to reach a reasonable result. I perform grid search over the sets of the student learning rate from {1e-5, 2e-5, 3e-5}, the teacher learning rate from {2e-6, 5e-6, 1e-5}, the weight of KD loss
from {0.4, 0.5, 0.6}, the seed from {12,42,2022}, the temperature from {2, 5}, the warmup_steps from {100, 200}, the gradient accumulation step to {1, 2} . Since Meta Distill is computation-consuming, so I set the batch size to 32.

References:
[1] Sun S, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 4323-4332.
[2] Zhou W, Xu C, McAuley J. BERT learns to teach: Knowledge distillation with meta learning[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 7037-7049.

JetRunner · 2022-08-16T04:26:51Z

@Harry-zzh Thanks for the info. Is it on test set (i.e., GLUE server) or validation set? If it's on test set, could you please also provide the results on the development set?

@MichaelZhouwang could you give it a look?

JetRunner · 2022-08-16T04:31:39Z

By the way, in NLP experiments, the students in our implementation of KD and our approach are initialized with pretrained BERT (well-read student) rather than fine-tuned teacher. That's probably the reason why the vanilla KD reported by us is significantly higher? (See the caption under Table 1)

MichaelZhouwang · 2022-08-16T08:03:10Z

@Harry-zzh Can you share the exact command for your best result on the task? Also, can you share the results on the dev set of the GLUE benchmark? You can first focus on reproducing the results on the dev set.

Harry-zzh · 2022-08-16T14:10:34Z

Thanks for your reply. I have shown the results on the test set before, and the results on the dev set are as follows:

Model	MRPC (F1/Acc.)	RTE (Acc.)	SST-2 (Acc.)	STS-B (Pear./Spear.)	MNLI ( Acc.)	QNLI (Acc.)	QQP (F1/Acc.)
Vanilla KD (ours)	89.6/84.8	68.6	91.7	88.6/88.5	80.9	87.7	86.6/90.1
Meta Distill (ours)	89.4/84.3	69.3	91.3	88.3/88.0	81.3	87.9	87.2/90.4
BERT-Base (ours)	91.6/88.2	73.3	93.1	89.8/89.4	85.1	91.6	88.0/91.1

I try grid search over the sets of the hyper-parameters as I described before, and I choose the best checkpoint on the dev set to make predictions on the test set. An example of my command on MNLI dataset is :
python nlp/run_glue_distillation_meta.py --model_type bert --teacher_model nlp/bert-base-finetuned/mnli --student_model nlp/bert-base-finetuned/mnli --num_hidden_layers 6 --alpha 0.5 --task_name MNLI --do_train --do_eval --beta 0 --do_lower_case --data_dir nlp/glue_data/MNLI --assume_s_step_size 2e-05 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 2e-05 --teacher_learning_rate 2e-06 --max_seq_length 128 --num_train_epochs 5 --output_dir output/mnli --warmup_steps 200 --gradient_accumulation_steps 2 --temperature 5 --seed 42 --logging_rounds 1000 --save_steps 1000

And, @JetRunner said your approach is initialized with pretrained BERT (well-read student), and @MichaelZhouwang said your approach is initialized with fine-tuned teacher. I feel a bit confused.

Looking forward to your reply, and I would be grateful if you could offer exact experiment settings on each dataset.

MichaelZhouwang · 2022-08-16T14:35:11Z

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.

First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.

For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.

Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

MichaelZhouwang · 2022-08-16T14:42:31Z

For further questions, maybe you can send me an email with your wechat ID to [email protected] so that I can offer further guidance and help more promptly and conveniently.

Harry-zzh · 2022-08-17T01:08:19Z

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.

First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.

For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.

Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

Hakeyi · 2022-11-02T09:55:00Z

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?

amaraAI · 2023-01-03T08:08:39Z

@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!

Harry-zzh · 2023-01-20T01:43:06Z

@Harry-zzh @Hakeyi Hi! Were you able to reproduce the results? If yes, is it possible to share your findings? Thanks a lot!

Sorry for late reply. I fail to reproduce the results.

Harry-zzh · 2023-01-20T01:43:15Z

Hi, that was my mistake. The teacher is initialized by pretrained BERT (well-read student). But using fine-tuned teacher should be able to achieve similar performance.
First I think you should change --num_held_batches from 0 to something like 1/2/4, which introduces randomness for teacher training and speeds up the training process. Also, I think for MNLI you should use larger warmup steps such as 1000/1500/2000, a larger alpha such as 0.6/0.7, a lower temperature such as 2/3, a smaller logging_rounds such as 200, and a larger teacher lr such as 5e-6. And you may need to add some regularization such as weight_decay. You should be able to achieve something higher than 83.5 on MNLI without much difficulty.
For smaller datasets such as MRPC, a (effective) batch size of 64 is certainly too large. Also, you should carefully tune the warmup steps together with the num_train_epochs.
Unfortunately we can not offer you exact experiment settings on each dataset because we are no longer having it. Nevertheless, the notes above are some tips we can offer.

Thanks, I will have a try.

@Harry-zzh hi, so, can you reproduce the results as shown in this paper now?

Sorry for late reply. I fail to reproduce the results.

JetRunner mentioned this issue Aug 16, 2022

Reproducing problems #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard to reproduce the results of GLUE benchmark #5

Hard to reproduce the results of GLUE benchmark #5

Harry-zzh commented Aug 10, 2022

LittleMouseInCoding commented Aug 11, 2022

JetRunner commented Aug 15, 2022

MichaelZhouwang commented Aug 15, 2022

Harry-zzh commented Aug 16, 2022

JetRunner commented Aug 16, 2022

JetRunner commented Aug 16, 2022 •

edited

Loading

MichaelZhouwang commented Aug 16, 2022

Harry-zzh commented Aug 16, 2022

MichaelZhouwang commented Aug 16, 2022

MichaelZhouwang commented Aug 16, 2022

Harry-zzh commented Aug 17, 2022

Hakeyi commented Nov 2, 2022

amaraAI commented Jan 3, 2023

Harry-zzh commented Jan 20, 2023

Harry-zzh commented Jan 20, 2023

Hard to reproduce the results of GLUE benchmark #5

Hard to reproduce the results of GLUE benchmark #5

Comments

Harry-zzh commented Aug 10, 2022

LittleMouseInCoding commented Aug 11, 2022

JetRunner commented Aug 15, 2022

MichaelZhouwang commented Aug 15, 2022

Harry-zzh commented Aug 16, 2022

JetRunner commented Aug 16, 2022

JetRunner commented Aug 16, 2022 • edited Loading

MichaelZhouwang commented Aug 16, 2022

Harry-zzh commented Aug 16, 2022

MichaelZhouwang commented Aug 16, 2022

MichaelZhouwang commented Aug 16, 2022

Harry-zzh commented Aug 17, 2022

Hakeyi commented Nov 2, 2022

amaraAI commented Jan 3, 2023

Harry-zzh commented Jan 20, 2023

Harry-zzh commented Jan 20, 2023

JetRunner commented Aug 16, 2022 •

edited

Loading