T2T 1.4.1 transformer beam search result different with 1.3.2 #525

efeiefei · 2018-01-18T05:23:20Z

I have trained a transformer translate model with t2t 1.3.2.
Now, I want to return every beam search result and socre, so I update t2t version to 1.4.1. I used that model, but got different results in some cases, and the whole bleu decrases.

Can some one help me?

martinpopel · 2018-01-18T20:12:59Z

I have also noticed a huge BLEU drop between T2T versions 1.2.9 and 1.4.2.
In the old version batch_size=1500 had very good results. In the new version the exactly same setup diverges after one hour of training (and BLEU goes down to 0). When I increase batch_size to 2200 it trains OK, but the convergence is much slower.
I plan to find the version when the bug (or whatever) was introduced (but it will take some time, I need to patch some versions with #524 to ensure the same setup).

prajdabre · 2018-01-19T06:00:23Z

I can confirm this.

I used v1.1.7 and got a BLEU of 47.66 on my ASPEC Chinese-Japanese task whereas with v1.4 I get 36.87

And as @martinpopel says, the training diverges after a few thousand iterations. Its as if it only looks at a fraction of the data shards and overfits on them.

AFAIK in the new version the default number of shards is 100 and I suspect that it might be the case that the current code only looks at 10 of these shards and overfits.

Anyone else here who has observed such a problem?

martinpopel · 2018-01-19T14:16:20Z

I found out the bug was introduced in T2T 1.3.0. See the graph below where the upper curve is v1.2.9 and the lower is v1.3.0, all hyperparams are exactly the same.

prajdabre · 2018-01-19T14:22:37Z

@martinpopel GG

martinpopel · 2018-01-19T18:29:43Z

I realize the bug we are discussing now is a different one than the title of this issue and the first post, which is about v1.3.2 vs v1.4.1 problems.
So I created a new issue #529, please continue the discussion about the bug introduced in v1.3.0 there.

rsepassi · 2018-02-09T00:49:33Z

Yeah, not good that the beam search deteriorated. Not sure what the issue might be though.

You used the exact same checkpoint?

If you retrained, then the issue that @martinpopel found may be the culprit. If not, then that's a bit mysterious. Would probably mean that some logic in the decode path changed.

prajdabre mentioned this issue Jan 19, 2018

can not reproduce the result of wmt_enfr32k #528

Closed

martinpopel mentioned this issue Jan 19, 2018

Bug introduced in v1.3.0 causing training divergence #529

Open

rsepassi added the bug label Feb 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T2T 1.4.1 transformer beam search result different with 1.3.2 #525

T2T 1.4.1 transformer beam search result different with 1.3.2 #525

efeiefei commented Jan 18, 2018

martinpopel commented Jan 18, 2018 •

edited

Loading

prajdabre commented Jan 19, 2018

martinpopel commented Jan 19, 2018

prajdabre commented Jan 19, 2018

martinpopel commented Jan 19, 2018

rsepassi commented Feb 9, 2018

T2T 1.4.1 transformer beam search result different with 1.3.2 #525

T2T 1.4.1 transformer beam search result different with 1.3.2 #525

Comments

efeiefei commented Jan 18, 2018

martinpopel commented Jan 18, 2018 • edited Loading

prajdabre commented Jan 19, 2018

martinpopel commented Jan 19, 2018

prajdabre commented Jan 19, 2018

martinpopel commented Jan 19, 2018

rsepassi commented Feb 9, 2018

martinpopel commented Jan 18, 2018 •

edited

Loading