-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot access local variable 'gradient_norm_before_clip' where it is not associated with a value #41
Comments
Hmm, that is very weird... It means that there is no gradient update for an entire epoch. I think it could be the case for training with very low amount of videos? How many videos were you using when this occured, what was your batch size and gradient accumulation steps? |
For example, using 2 GPUs, which is something like following
Same error when using 8GPUs
I don't think it is the issue of low amount of training samples? |
Do you see the error when using gradient accumulation steps as 1? I haven't really experimented with higher gradient_accumulation_steps yet, so I might have missed something. Most of my experiments use The script can be improved a little by only logging gradnorm if it was computed. However, I fail to see how no gradient step occurs with the configurations you shared. Will take a deeper look over the weekend |
Yep I just confirm that, every thing works fine with gradient accumulation steps=1. As long as it is not 1, the error occurs. |
Okay thanks, that helps. Will try debugging over the weekend |
@Yuancheng-Xu I found the issue, thanks for reporting! Somehow, everytime I looked at the code before today, I thought that gradient norm calculation was happening after each epoch instead of each step. It is correct to do it at each step, but only when the gradient sync happens - which is now addressed in the PR. I'm running an experiment to verify that it works at the moment, so will merge the PR later on. Please give it a review too when free! |
I think there might still be some other mistakes or a source of randomness coming from somewhere, because for the exact same training parameters, seeds and initializations, and only varying |
Thanks @a-r-r-o-w ! I left a comment here. As for the actual training outcomes (learning curve, validation generated samples), I would expect the results stay the same. |
During both I2V and t2V training, sometimes I encountered the error
This is probably here in the following code
somehow
accelerator.sync_gradients
is false sometimes.Is there a quick fix? Is it only for logging?
The text was updated successfully, but these errors were encountered: