You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
During continuously training, the program automatically saves checkpoint immediately after it restores the latest one, thus resulting in two checkpoint models with only one step gap, e.g., , model.ckpt-1000 and model.ckpt-1001. I did find difference between these two models via testing, i.e., resulting in different outputs. May I know why saving checkpoints in such a manner? Any special concern?
The text was updated successfully, but these errors were encountered:
The issue with saving two checkpoints (e.g. 1000 and 1001) is a duplicate of #495 (comment).
The issue with two "neighboring" checkpoints leading to different results needs more details: Is the difference between the BLEU (supposing you do MT) significant? I think some difference should be expected even after one step (otherwise, the model would never train anything), but it should not be big in most cases.
@martinpopel Yes, I trained an MT model. The BLEU difference is tiny for my test data (around 15k sentences), but the translations vary in terms of fluency for some sentences.
During continuously training, the program automatically saves checkpoint immediately after it restores the latest one, thus resulting in two checkpoint models with only one step gap, e.g., , model.ckpt-1000 and model.ckpt-1001. I did find difference between these two models via testing, i.e., resulting in different outputs. May I know why saving checkpoints in such a manner? Any special concern?
The text was updated successfully, but these errors were encountered: