You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say you train for 10 checkpoints and stop. The metrics file will have info for 10 checkpoints, and the latest params file will be params.00010.
Now, suppose you continue training and stop after an additional 10 checkpoints. The metrics file will have info for 19 checkpoints, and the latest params file will be params.00020. The line that was lost was the line corresponding to params.00010. All of the lines from the 10'th on are now incorrect. Line (9+n) of the metrics file actually contains information about checkpoint (9 + n + 1).
The text was updated successfully, but these errors were encountered:
Hi @tuglat thanks for reporting this issue! I was able to reproduce and it turns out we incorrectly save the training state before adding metrics to it. #860 should fix it.
Let's say you train for 10 checkpoints and stop. The metrics file will have info for 10 checkpoints, and the latest params file will be params.00010.
Now, suppose you continue training and stop after an additional 10 checkpoints. The metrics file will have info for 19 checkpoints, and the latest params file will be params.00020. The line that was lost was the line corresponding to params.00010. All of the lines from the 10'th on are now incorrect. Line (9+n) of the metrics file actually contains information about checkpoint (9 + n + 1).
The text was updated successfully, but these errors were encountered: