Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2.1.16: metrics file loses a line every time training is stopped and then continued #859

Closed
tuglat opened this issue Aug 26, 2020 · 1 comment
Labels

Comments

@tuglat
Copy link
Contributor

tuglat commented Aug 26, 2020

Let's say you train for 10 checkpoints and stop. The metrics file will have info for 10 checkpoints, and the latest params file will be params.00010.

Now, suppose you continue training and stop after an additional 10 checkpoints. The metrics file will have info for 19 checkpoints, and the latest params file will be params.00020. The line that was lost was the line corresponding to params.00010. All of the lines from the 10'th on are now incorrect. Line (9+n) of the metrics file actually contains information about checkpoint (9 + n + 1).

@fhieber
Copy link
Contributor

fhieber commented Aug 27, 2020

Hi @tuglat thanks for reporting this issue! I was able to reproduce and it turns out we incorrectly save the training state before adding metrics to it. #860 should fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants