-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP/DP training - multigpu #12
Comments
Hi @helen1c,
No, I haven't. I was able to use DDP with PyTorch Lightning and ME together. However, I found a weird issue: the model's performance gets a little bit worse (~1%). That's why I do not use multi-GPU training in this repo. Anyway, here I provide you a code snippet to support DDP training: You need to convert BN module into the synchronized BN before this line: Line 39 in 9d8793a
as
Then, set the DDP-related keyword arguments here: Line 61 in 9d8793a
I hope this helps your experiments. |
@chrockey Unfortunately, this doesn't help. The same problem again. Can you provide me versions of torch, cuda and pytorch lightning you are using? Thanks for the quick reply though! :) |
Sorry for the late reply.
FYI, I've just uploaded the environment.yaml file to the master branch, which you can refer to. |
If you have further questions, please feel free to re-open this issue. |
Hi @chrockey , I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0. |
Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot! |
Hi @Charlie839242, sorry for the late reply, unfortunately I still didn't solve the problem in the end, but I think it may be a problem with the pytorch-lighting setup, and I have now moved to a new point cloud processing repository https://github.com/Pointcept/Pointcept, which is also an amazing work including many SOTA methods. |
Hi @chrockey, great work!
Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; probably due to the batch norm and batch size).
If I add multigpu support (DDP) according to the example from the ME repository the learning is blocked, i.e. it never starts.
Any help will be appreciated. You commented "multi-GPU training is currently not supported" in the code. Have you had similar issues as I mentioned?
Thanks!
The text was updated successfully, but these errors were encountered: