Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix iterator when loading from checkpoint #5344

Merged
merged 1 commit into from
Oct 9, 2023
Merged

Conversation

000Justin000
Copy link
Contributor

Before submitting

  • [no] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • [yes] Did you read the contributor guideline?
  • [yes] Did you make sure to update the docs?
  • [yes] Did you write any new necessary tests?

What does this PR do?

Assuming an epoch consist of N batches. When we load a checkpoint from mid epoch, there is a number indicate the number of iterations n has been passed in the epoch. When restarting the training job, after the batches are loaded, n batches are removed from the beginning of the iterator by FrozenBatchSampler. Then the CounteringIterator would add n iteration count in front of the batch, so that the batch count in the epoch is correct, i.e. N-n+n = N.
However, when skip_reminder_batch is set to True, the CounteringIterator would truncate, incorrectly, the batch count from N batches to N-n, causing the epoch to have less number of total batches. As a result, you would see a jump in the fblearner ui on the progress --- not because the iteration counter increase, but rather because the total number of batches decrease.
This diffs would fix the problem above.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@000Justin000 000Justin000 merged commit c7c478b into main Oct 9, 2023
1 of 5 checks passed
zobadaniel pushed a commit to ZalozbaDev/fairseq that referenced this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants