Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

--eval_run_autoregressive is slow and wrong #407

Open
martinpopel opened this issue Nov 8, 2017 · 9 comments
Open

--eval_run_autoregressive is slow and wrong #407

martinpopel opened this issue Nov 8, 2017 · 9 comments
Labels

Comments

@martinpopel
Copy link
Contributor

When evaluating a fully trained model (translate_ende_wmt32k) in the internal evaluation (eval_steps=68, batch_size=1500), I get approx_bleu=0.41 and the evaluation takes 16 seconds.
When evaluating the same model with --eval_run_autoregressive, I get approx_bleu=0.10 and the evaluation takes 750 seconds.
Decoding the dev set (3000 sentences, containing 68*1500 subwords) takes 260 seconds (including checkpoint loading and beam search of size 4, which is not included in the times above).

As suggested by @vince62s and @lukaszkaiser at Gitter, the slowdown is because last_position_only=false by default in eval_autoregressive even with SymbolModality.
It is unclear what causes the lower score.

@vince62s
Copy link
Contributor

@lukaszkaiser @martinpopel
after 1.2.8 upgrade eval_autoregressive is no longer slow but still very wrong for approx_bleu.

@rsepassi
Copy link
Contributor

So it's true that with and without the flag you get different eval metrics, but that's expected because it's actually using the output at each timestep as the input to the next timestep which will certainly be worse than using the ground truth as the inputs, which is the default behavior.

@vince62s
Copy link
Contributor

@rsepassi we do get it that it is different. But we are just saying the reported BLEU in autoregressive mode is just wrong, much much too low to be good.

@rsepassi rsepassi added bug and removed question labels Nov 14, 2017
@rsepassi
Copy link
Contributor

Ah, ok. Yeah, this may be a bug.

@rsepassi rsepassi reopened this Nov 14, 2017
@yuimo
Copy link

yuimo commented Nov 30, 2017

i found the same problem in my transformer model
but when i decode the test set, and found the bleu score is as high as the eval process without --eval_run_autoregressive flag
any ideas about this?

@martinpopel
Copy link
Contributor Author

approx_bleu (with or without eval_run_autoregressive) is not expected to give the same value as real BLEU because approx_bleu is computed on subword ids, while for real BLEU we need to convert subwords into words, detokenize, and then tokenize using a BLEU-compatible tokenization and possibly lowercase, see #436

@lnabergall
Copy link

Has there been any progress on fixing this?

@chenwuperth
Copy link

A quick question on this one - If both input/target modality is REAL, does specifying --eval_run_autoregressive (during train-and-evalute schedule) actually have any effect (thus avoid teacher-forcing during evaluation) at all?

It does not appear have any effect for me since the evaluation metrics RMSE is actually as low as before and as fast as before.

@chenwuperth
Copy link

@rsepassi do you happen to have any idea on the --eval_run_autoregressive behaviour during evaluation for REAL modality datasets? thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants