-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainer resume from checkpoint,the learning rate is not the same as retraining,learning rate is discontinuous #34053
Comments
I'm adding a "Good First Issue" tag; the team is under tight bandwidth at the moment so any PR to help solve this is welcome. |
Hey @LBJ6666, could you please share a minimal reproducer, so that we can quickly find the issue ? Thanks ! |
@SunMarc ,Thank for response,I wrote a simple example.
Run the training and record the learning rates for 20 steps as saved in or |
Verify my steps of Reproduction:
|
Thanks for testing this @Knight7561 ! Could you check @LBJ6666 ? |
Thank! @Knight7561 @SunMarc
|
Is there anything that could have been added to make this easier to spot ? Maybe some logs ? |
I think the issue lies with gradient overflow in FP16. After switching to BF16, the optimization process worked normally.
Logs from BF16=True (normal behavior):
Thus, when continuing training with FP16, if gradient overflow occurs, the learning rate scheduler is skipped. |
Hello, I tried reproducing the issue with fp16 and the undesired behaviour was present. Code is as posted before, except for a new part I've added for setting the seed equal to zero. I tried first running the 20 steps while saving results, then deleting steps 12 onwards, switching the flag As reported by @LBJ6666 , when there is gradient overflow, the optimizer does not run and, as a consequence, the LR scheduler does not get updated. However, when reproducing, the issue was a little worse than that in my case. When running the 20 steps without interruption, gradient overflow did not occur on my machine. However, it did occur when I loaded checkpoint 11. After some inspection, I noticed that there was an object from a class called Still, there are some nuances to saving and reloading that I'm not sure how to handle, such as when using Deepspeed, Sagemaker or XLA. I just copied what was done for the scheduler for now and left the code as a draft so others can look at and say how to best change it (the structure is a little unnecessarily duplicated, but this was, again, just a draft to see if it would work). Also, when loading the |
Thanks for the nice investigation. This looks indeed like a bug. cc @muellerzr Could you check @Knight7561 and @LBJ6666 that this is potentially the issue ? |
Sure @SunMarc , let me debug this and try to reproduce the above case scenario to see what's going on. Before I dig in to the issue, these are the test results I got, and I guess It matches with the excerpts above. I am just making sure my steps of reproduction is right. Can you confirm @hsilva664 @LBJ6666 |
My results are close to yours, slightly different because of the seed. I set the seed to zero at the very beginning with: def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed) These are my results when running uninterrupted (I got them from the They look similar to yours. Below are the results when running before I modified the script when I run starting from the model saved at step 11. The overflow at step 11 is because of the Below are the results after the draft modifications from my pull request, where I save and load the scaler as well. These results also start from the checkpoint of step 11. They are closer to the initial result. Still, one thing that caught my attention is that they are close, but not equal. As the seed is set to be the same and also the saving/reloaded from |
Thanks for the results @hsilva664. So, if I understand corrected, once you ran uninterrupted(got no Am I understanding correctly? @hsilva664 |
I did not delete checkpoint 11, only checkpoints 12 onwards. Other than this, your explanation is correct. |
@hsilva664 Thank you for the further testing. I have successfully tested the code on the PR #34932 , and the results show that the logs for first running the 20 steps of training and the logs for resuming training from checkpoint 11 (train steps 12-20) are identical, including the loss, gradient, and learning rate. Regarding result reproducibility, the slight differences observed between two runs on the same machine are due to non-determinism. To ensure consistent results, the following configuration changes can reproduce the results:
|
So, @LBJ6666, Does the PR solves your Issue? |
Hello, by previous discussions, it seems so. Still, the PR is not ready to be merged. As per request, I tried adding some tests that would detect the save/reloading errors better than the tests that were already there. However, these new tests required setting deterministic behaviour differently, which can be done via flags and environment variables. My new tests generate consistent results when run in isolation or when run together with the tests from class I assume this might be because another test, from another class but same file, running on some parallel worker somehow changes the global state in a way that the deterministic behaviour I had set stopped working. Since I'm not sure how to proceed, I left it here until somebody takes a look. More details are in the PR discussion. |
@Knight7561 ,Yes, after testing my code, it resolved my issue. I look forward to more comprehensive testing. |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
the Trainer does not set a warmup and the lr_scheduler is set to linear, and the training is continued from an interruption to complete all steps, the learning rates will be different from those when training all steps from the beginning. Here are the specific learning rates:
Learning rates for training from the beginning for each step:
If training is continued from a checkpoint at step 5, the learning rates for each step are:
Why are the learning rates for step 6 and step 7 different when training continues from a checkpoint compared to training from the start?
Reproduction steps:
trainer.train(resume_from_checkpoint=True)
to continue training from step 5, and after training is completed, record the learning rate in the new checkpoint.Expected behavior
Please explain why the learning rate is not continuous as it is when training from the beginning, for example:
Step 6: "learning_rate": 6e-06,
Step 7: "learning_rate": 5e-06.
........
The text was updated successfully, but these errors were encountered: