using ensemble models improves? #473

weitaizhang · 2017-12-14T06:41:27Z

hi guys,
did you try ensemble models in translation ? does that improve (like how much) or not ? I would appreciate it if some one could show your experiment results .

weitaizhang · 2017-12-14T06:43:10Z

I am trying code with "trainer_utils_test.py" and will paste my conclusions later.

liesun1994 · 2017-12-18T07:44:35Z

I am using transformer . When using avg_checkpoints , my results did not improves a lot actually . Like @edunov metioned , it got +1 bleus . But my experiments only get 0.1 improvement . I am using the same script (avg_checkpoints , and using last five checkpoints ). Can @martinpopel give me some useful suggestions ?

martinpopel · 2017-12-18T08:18:05Z

@liesun1994: Note that this issue is about ensembling (several models decoding in parallel, voting on each token), not about checkpoint averaging.

As for checkpoint averaging, my experience is that it helps more in the early training stages, but even after weeks of training it still helps (about 0.3 BLEU on average). It depends a lot how frequent are your checkpoints - I prefer 1-hour intervals as I get better averaging results than with the default 10-minutes intervals. I usually use last 8 or 16 checkpoints. As you can see in the following graph, no averaging (orange curve) is almost always worse than 8-checkpoints (red) or 16-checkpoints (blue) averaging, although the size of the improvement changes as the no-avg curve fluctuates a lot (and the avg curves are more stable, as expected):

liesun1994 · 2017-12-18T08:42:04Z

@martinpopel wow，I did not know the difference between ensemble and checkpoint average until now 😆 . The paper used last five checkpoints and got 27.30 bleus in newstest2014 . And I am using the code you mentioned in #458 （bpe , etc.） , the latest result is 26.47 in newstest2014 （single model）. The gap is closer . But it can not achive 27.30 bleus . If we modify the code , our baseline is lower than the paper mentioned . Have you achived 27.3 bleus in newstest2014 ? REALLY THANKS .

weitaizhang · 2017-12-18T11:56:54Z

@liesun1994 @martinpopel
I tried checkpoint averaging and did not get much improvement ,maybe a lit but not as much as we got using RNN . That's upset.
I am here trying to using ensemble models, as @martinpopel says, decoding parallel with several models . Did you guys tried that before? thx.

weitaizhang · 2017-12-18T12:10:10Z

@liesun1994 @martinpopel
actually, I found that the decoding results is not exactly same if I use different batch size .It's somewhat different in some sentences .
Did you guys found that ? and know why?

martinpopel · 2017-12-18T12:19:46Z

@weitaizhang: If you mean batch_size within training, then yes, it affects the results as discussed e.g. in #444 (comment)
If you mean --decode_batch_size, then it should be probably reported as an issue (I think I have seen it in some older version, but now I cannot replicate it).
And answering your previous question: I haven't tried ensembling in T2T, I think it would be great if you make it work and send a PR.

weitaizhang · 2017-12-19T00:15:48Z

I think my codes is v1.2.8 and checked out on Nov. 13 and yes I mean with different decode_batch_size the decoded results is not exactly same.
Maybe it's a bug and fixed in later versions.I am trying to figure it out.

weitaizhang · 2017-12-21T03:23:10Z

hi, guys.
I have completed with my translation tasks with ensemble models. But sorry I cannot sent a PR because my gpu machines are not connected with internet. the BLEU results on my tasks will have 1-2 BLEU improvements. hope that helps.

liesun1994 · 2017-12-21T04:06:21Z

wow，it would be nice to send a PR when your machine works .

cshanbo · 2018-03-09T05:32:47Z

Hi @weitaizhang , great work! is the ensemble ready for making a PR? I think lots of people would like to see the progress.

tan-xu · 2019-03-07T14:41:34Z

@weitaizhang Could you kindly share the codes to ensemble models? Thanks!

rsepassi added the question label Dec 21, 2017

weitaizhang closed this as completed Dec 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using ensemble models improves? #473

using ensemble models improves? #473

weitaizhang commented Dec 14, 2017

weitaizhang commented Dec 14, 2017

liesun1994 commented Dec 18, 2017

martinpopel commented Dec 18, 2017

liesun1994 commented Dec 18, 2017

weitaizhang commented Dec 18, 2017 •

edited

Loading

weitaizhang commented Dec 18, 2017

martinpopel commented Dec 18, 2017

weitaizhang commented Dec 19, 2017 •

edited

Loading

weitaizhang commented Dec 21, 2017

liesun1994 commented Dec 21, 2017

cshanbo commented Mar 9, 2018

tan-xu commented Mar 7, 2019

using ensemble models improves? #473

using ensemble models improves? #473

Comments

weitaizhang commented Dec 14, 2017

weitaizhang commented Dec 14, 2017

liesun1994 commented Dec 18, 2017

martinpopel commented Dec 18, 2017

liesun1994 commented Dec 18, 2017

weitaizhang commented Dec 18, 2017 • edited Loading

weitaizhang commented Dec 18, 2017

martinpopel commented Dec 18, 2017

weitaizhang commented Dec 19, 2017 • edited Loading

weitaizhang commented Dec 21, 2017

liesun1994 commented Dec 21, 2017

cshanbo commented Mar 9, 2018

tan-xu commented Mar 7, 2019

weitaizhang commented Dec 18, 2017 •

edited

Loading

weitaizhang commented Dec 19, 2017 •

edited

Loading