-
Notifications
You must be signed in to change notification settings - Fork 3.5k
using ensemble models improves? #473
Comments
I am trying code with "trainer_utils_test.py" and will paste my conclusions later. |
I am using transformer . When using avg_checkpoints , my results did not improves a lot actually . Like @edunov metioned , it got +1 bleus . But my experiments only get 0.1 improvement . I am using the same script (avg_checkpoints , and using last five checkpoints ). Can @martinpopel give me some useful suggestions ? |
@liesun1994: Note that this issue is about ensembling (several models decoding in parallel, voting on each token), not about checkpoint averaging. As for checkpoint averaging, my experience is that it helps more in the early training stages, but even after weeks of training it still helps (about 0.3 BLEU on average). It depends a lot how frequent are your checkpoints - I prefer 1-hour intervals as I get better averaging results than with the default 10-minutes intervals. I usually use last 8 or 16 checkpoints. As you can see in the following graph, no averaging (orange curve) is almost always worse than 8-checkpoints (red) or 16-checkpoints (blue) averaging, although the size of the improvement changes as the no-avg curve fluctuates a lot (and the avg curves are more stable, as expected): |
@martinpopel wow,I did not know the difference between ensemble and checkpoint average until now 😆 . The paper used last five checkpoints and got 27.30 bleus in newstest2014 . And I am using the code you mentioned in #458 (bpe , etc.) , the latest result is 26.47 in newstest2014 (single model). The gap is closer . But it can not achive 27.30 bleus . If we modify the code , our baseline is lower than the paper mentioned . Have you achived 27.3 bleus in newstest2014 ? REALLY THANKS . |
@liesun1994 @martinpopel |
@liesun1994 @martinpopel |
@weitaizhang: If you mean batch_size within training, then yes, it affects the results as discussed e.g. in #444 (comment) |
I think my codes is v1.2.8 and checked out on Nov. 13 and yes I mean with different decode_batch_size the decoded results is not exactly same. |
hi, guys. |
wow,it would be nice to send a PR when your machine works . |
Hi @weitaizhang , great work! is the |
@weitaizhang Could you kindly share the codes to ensemble models? Thanks! |
hi guys,
did you try ensemble models in translation ? does that improve (like how much) or not ? I would appreciate it if some one could show your experiment results .
The text was updated successfully, but these errors were encountered: