--eval_run_autoregressive is slow and wrong #407

martinpopel · 2017-11-08T22:30:16Z

When evaluating a fully trained model (translate_ende_wmt32k) in the internal evaluation (eval_steps=68, batch_size=1500), I get approx_bleu=0.41 and the evaluation takes 16 seconds.
When evaluating the same model with --eval_run_autoregressive, I get approx_bleu=0.10 and the evaluation takes 750 seconds.
Decoding the dev set (3000 sentences, containing 68*1500 subwords) takes 260 seconds (including checkpoint loading and beam search of size 4, which is not included in the times above).

As suggested by @vince62s and @lukaszkaiser at Gitter, the slowdown is because last_position_only=false by default in eval_autoregressive even with SymbolModality.
It is unclear what causes the lower score.

The text was updated successfully, but these errors were encountered:

vince62s · 2017-11-12T08:19:15Z

@lukaszkaiser @martinpopel
after 1.2.8 upgrade eval_autoregressive is no longer slow but still very wrong for approx_bleu.

rsepassi · 2017-11-13T22:22:26Z

So it's true that with and without the flag you get different eval metrics, but that's expected because it's actually using the output at each timestep as the input to the next timestep which will certainly be worse than using the ground truth as the inputs, which is the default behavior.

vince62s · 2017-11-14T10:47:07Z

@rsepassi we do get it that it is different. But we are just saying the reported BLEU in autoregressive mode is just wrong, much much too low to be good.

rsepassi · 2017-11-14T16:45:54Z

Ah, ok. Yeah, this may be a bug.

yuimo · 2017-11-30T02:28:50Z

i found the same problem in my transformer model
but when i decode the test set, and found the bleu score is as high as the eval process without --eval_run_autoregressive flag
any ideas about this?

martinpopel · 2017-11-30T10:29:11Z

approx_bleu (with or without eval_run_autoregressive) is not expected to give the same value as real BLEU because approx_bleu is computed on subword ids, while for real BLEU we need to convert subwords into words, detokenize, and then tokenize using a BLEU-compatible tokenization and possibly lowercase, see #436

lnabergall · 2018-01-07T21:36:17Z

Has there been any progress on fixing this?

chenwuperth · 2018-10-09T17:15:11Z

A quick question on this one - If both input/target modality is REAL, does specifying --eval_run_autoregressive (during train-and-evalute schedule) actually have any effect (thus avoid teacher-forcing during evaluation) at all?

It does not appear have any effect for me since the evaluation metrics RMSE is actually as low as before and as fast as before.

chenwuperth · 2018-10-10T05:44:41Z

@rsepassi do you happen to have any idea on the --eval_run_autoregressive behaviour during evaluation for REAL modality datasets? thanks!

martinpopel mentioned this issue Nov 9, 2017

Update LSTM Attention Model to use tf.contrib.seq2seq.AttentionWrapper #377

Merged

rsepassi added bug question and removed bug labels Nov 13, 2017

rsepassi closed this as completed Nov 13, 2017

rsepassi added bug and removed question labels Nov 14, 2017

rsepassi reopened this Nov 14, 2017

martinpopel mentioned this issue Jan 11, 2018

eval autoregressive error #511

Open

martinpopel mentioned this issue Feb 2, 2018

Evaluation failed when training with multiple work gpus #266

Closed

martinpopel mentioned this issue Feb 14, 2018

How to get real bleu score? [approx_bleu_score] #587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--eval_run_autoregressive is slow and wrong #407

--eval_run_autoregressive is slow and wrong #407

martinpopel commented Nov 8, 2017

vince62s commented Nov 12, 2017

rsepassi commented Nov 13, 2017

vince62s commented Nov 14, 2017

rsepassi commented Nov 14, 2017

yuimo commented Nov 30, 2017

martinpopel commented Nov 30, 2017

lnabergall commented Jan 7, 2018

chenwuperth commented Oct 9, 2018

chenwuperth commented Oct 10, 2018

--eval_run_autoregressive is slow and wrong #407

--eval_run_autoregressive is slow and wrong #407

Comments

martinpopel commented Nov 8, 2017

vince62s commented Nov 12, 2017

rsepassi commented Nov 13, 2017

vince62s commented Nov 14, 2017

rsepassi commented Nov 14, 2017

yuimo commented Nov 30, 2017

martinpopel commented Nov 30, 2017

lnabergall commented Jan 7, 2018

chenwuperth commented Oct 9, 2018

chenwuperth commented Oct 10, 2018