-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck at epoch 1 iter 1 when train vqvae with multi-gpu #70
Comments
I also have the same problem, have you worked out ? |
Well, I think I solved this by adding |
May I ask where did you add |
Original
Fixed
|
@JamesZhutheThird Thank you so much for your quick reply. I believe that the model is still stuck at calculating
I am not sure whether I understood the changes to be made correctly, so your advice would be much appreciated. Thanks. |
perhaps you can try moving
I don't see anything else different from my code, but since I have changed other parts of the code and added extra functions, I'm not sure whether they are related to this bug. Anyway, please keep in touch with me about this. @ekyy2 |
I tried the change you suggested but it does not seem to work. No worries. As you said, it could be something else. Maybe it could be that I am using Cuda 11.6, but using the PyTorch combination for Cuda 11.3 (pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113). Another is that the blowing up of the latent error is mentioned here: #65 (comment), but it is weird that it only happens for multiple GPUs. Anyways, keep in touch. |
I've tested on two environments with different pytorch versions with CUDA11.2. I am pretty sure the exact versions are not so important. btw I checked files in |
For someone who has this problem, I share my solution. please set the
@ekyy2, I hope this solution work for you. |
I can confirm that this works. Thank you so much! The issue can be closed. |
still problem |
The single-gpu training process works just fine for me, and the output samples are satisfactory. However when I set
--n_gpu 2
or--n_gpu 4
, the training process will get stuck at the beginning(Epoch 1 Iter 1). And the time cost of this very first iter (34 seconds) is much longer than that in single-gpu training (3 seconds per iter).I would be grateful if someone could help me to see what might be wrong with this.
The text was updated successfully, but these errors were encountered: