-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permanent Skipping of Batches with accelerator.skip_first_batches
#1451
Comments
Indeed, the dataloader needs to be changed only at the first epoch. Would you like to suggest a PR with a fix? |
Thank you for your response. Since I am not familiar with the accelerate data loading logic, the only solution I can suggest at the moment is for users to retain the original data loader and use it for later epochs. These might require some changes to the function documentation and example scripts. |
No, the dataloader should be changed for the the first epoch and then go back to the initial dataloader for all subsequent epochs, that is the correct fix and how we use this internally in the Trainer. |
If so, please forgive me for not knowing how to implement this. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py
Expected behavior
When using the
skip_first_batches
function, I discovered that it does not behave as expected. The function is intended to skip a specified number of batches only in the first resumed epoch, but it permanently skips a specified number of batches from the dataloader. The problem arises when the dataloader is not reinitialized, resulting in a permanently shorter data loader for subsequent epochs.The official script provided in the repository (https://github.com/huggingface/accelerate/blob/main/examples/complete_cv_example.py#L208) does not include any steps to renew the data loader after the first epoch, which is misleading.
The text was updated successfully, but these errors were encountered: