DDP/DP training - multigpu #12

helen1c · 2023-01-28T17:03:24Z

Hi @chrockey, great work!

Can you guide me on how to set up multigpu training? I have only 20GB gpus available, and when using batch size of 2 I obtain poor performance (~6% lower mIoU and mAcc; probably due to the batch norm and batch size).

If I add multigpu support (DDP) according to the example from the ME repository the learning is blocked, i.e. it never starts.

Any help will be appreciated. You commented "multi-GPU training is currently not supported" in the code. Have you had similar issues as I mentioned?

Thanks!

chrockey · 2023-01-29T12:41:13Z

Hi @helen1c,

Have you had similar issues as I mentioned?

No, I haven't. I was able to use DDP with PyTorch Lightning and ME together. However, I found a weird issue: the model's performance gets a little bit worse (~1%). That's why I do not use multi-GPU training in this repo. Anyway, here I provide you a code snippet to support DDP training:

You need to convert BN module into the synchronized BN before this line:

FastPointTransformer/train.py

Line 39 in 9d8793a

    
           pl_module = get_lightning_module(lightning_module_name)(model=model, max_steps=max_step)

as

if gpus > 1:
    model = ME.MinkowskiSyncBatchNorm.convert_sync_batchnorm(model)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

Then, set the DDP-related keyword arguments here:

FastPointTransformer/train.py

Line 61 in 9d8793a

if gpus > 1:

as

if gpus > 1:
    kwargs["replace_sampler_ddp"] = True
    kwargs["sync_batchnorm"] = False
    kwargs["strategy"] = "ddp_find_unused_parameters_false"

I hope this helps your experiments.

helen1c · 2023-01-29T15:14:34Z

@chrockey Unfortunately, this doesn't help. The same problem again.

Can you provide me versions of torch, cuda and pytorch lightning you are using?

Thanks for the quick reply though! :)

chrockey · 2023-02-01T09:45:35Z

Sorry for the late reply.
Here are the versions:

CUDA: 11.3
PyTorch: 1.12.1
PyTorch Lightning: 1.8.2
TorchMetrics: 0.11.0

FYI, I've just uploaded the environment.yaml file to the master branch, which you can refer to.

chrockey · 2023-02-07T13:17:23Z

If you have further questions, please feel free to re-open this issue.

lishuai-97 · 2023-02-07T15:57:32Z

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Charlie839242 · 2024-02-11T16:59:33Z

Hi @chrockey ,

I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

lishuai-97 · 2024-03-03T04:58:26Z

Hi @chrockey ,
I ran into the same problem, it worked well when I trained with a single GPU, but when I tried multi-GPU training as you suggested, the training process stopped at epoch 0 and no errors were reported. The GPU memory is occupied, but the GPU utilization is 0.

Hi @lishuai-97 , I met the same problem as you decribed. Could you please give me some suggestions on how you solve it? Thanks a lot!

Hi @Charlie839242, sorry for the late reply, unfortunately I still didn't solve the problem in the end, but I think it may be a problem with the pytorch-lighting setup, and I have now moved to a new point cloud processing repository https://github.com/Pointcept/Pointcept, which is also an amazing work including many SOTA methods.

chrockey closed this as completed Feb 7, 2023

chrockey reopened this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP/DP training - multigpu #12

DDP/DP training - multigpu #12

helen1c commented Jan 28, 2023 •

edited

Loading

chrockey commented Jan 29, 2023 •

edited

Loading

helen1c commented Jan 29, 2023

chrockey commented Feb 1, 2023 •

edited

Loading

chrockey commented Feb 7, 2023 •

edited

Loading

lishuai-97 commented Feb 7, 2023

Charlie839242 commented Feb 11, 2024

lishuai-97 commented Mar 3, 2024

DDP/DP training - multigpu #12

DDP/DP training - multigpu #12

Comments

helen1c commented Jan 28, 2023 • edited Loading

chrockey commented Jan 29, 2023 • edited Loading

helen1c commented Jan 29, 2023

chrockey commented Feb 1, 2023 • edited Loading

chrockey commented Feb 7, 2023 • edited Loading

lishuai-97 commented Feb 7, 2023

Charlie839242 commented Feb 11, 2024

lishuai-97 commented Mar 3, 2024

helen1c commented Jan 28, 2023 •

edited

Loading

chrockey commented Jan 29, 2023 •

edited

Loading

chrockey commented Feb 1, 2023 •

edited

Loading

chrockey commented Feb 7, 2023 •

edited

Loading