-
Notifications
You must be signed in to change notification settings - Fork 3.5k
--eval_run_autoregressive is slow and wrong #407
Comments
@lukaszkaiser @martinpopel |
So it's true that with and without the flag you get different eval metrics, but that's expected because it's actually using the output at each timestep as the input to the next timestep which will certainly be worse than using the ground truth as the inputs, which is the default behavior. |
@rsepassi we do get it that it is different. But we are just saying the reported BLEU in autoregressive mode is just wrong, much much too low to be good. |
Ah, ok. Yeah, this may be a bug. |
i found the same problem in my transformer model |
approx_bleu (with or without eval_run_autoregressive) is not expected to give the same value as real BLEU because approx_bleu is computed on subword ids, while for real BLEU we need to convert subwords into words, detokenize, and then tokenize using a BLEU-compatible tokenization and possibly lowercase, see #436 |
Has there been any progress on fixing this? |
A quick question on this one - If both input/target modality is REAL, does specifying --eval_run_autoregressive (during train-and-evalute schedule) actually have any effect (thus avoid teacher-forcing during evaluation) at all? It does not appear have any effect for me since the evaluation metrics RMSE is actually as low as before and as fast as before. |
@rsepassi do you happen to have any idea on the --eval_run_autoregressive behaviour during evaluation for REAL modality datasets? thanks! |
When evaluating a fully trained model (translate_ende_wmt32k) in the internal evaluation (eval_steps=68, batch_size=1500), I get approx_bleu=0.41 and the evaluation takes 16 seconds.
When evaluating the same model with
--eval_run_autoregressive
, I get approx_bleu=0.10 and the evaluation takes 750 seconds.Decoding the dev set (3000 sentences, containing 68*1500 subwords) takes 260 seconds (including checkpoint loading and beam search of size 4, which is not included in the times above).
As suggested by @vince62s and @lukaszkaiser at Gitter, the slowdown is because
last_position_only=false
by default in eval_autoregressive even with SymbolModality.It is unclear what causes the lower score.
The text was updated successfully, but these errors were encountered: