about training with multi GPU #32

boxuLibrary · 2024-05-29T09:05:04Z

hi, i encounter one problem. When i run the code with multiple GPUs distributed on the different nodes on the slurm. I find that I can not execute GPUS on different nodes. I wonder does the code support the distribution on different nodes?

donydchen · 2024-05-29T09:19:44Z

Hi @boxuLibrary, ideally, this project should support multiple-node training since it is built on top of pytorch_lightning. However, I don't have an environment to test that, I can only confirm that this project works fine with multiple GPUs on the same node.

As previously mentioned, have you checked here to ensure all related parameters are correctly set? For example, in your sbatch script, make sure that

#SBATCH --nodes=4             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=8   # This needs to match Trainer(devices=...)

Besides, make sure that you run this project with srun, as detailed here.

boxuLibrary · 2024-05-29T10:25:01Z

Oh, thank you for your reply. It is real helpful for me.

donydchen · 2024-05-29T11:23:33Z

No worries @boxuLibrary. If you manage to make it work on multiple nodes, feel free to make a pull request (if some codes need to be modified) or leave some notes on this issue (I will pin it so that others can easily refer to it).

donydchen · 2024-05-31T14:57:15Z

Hi @boxuLibrary, on another project I work on, I find it easy to train on multiple nodes with pytorch-lighting; just ensure that num_nodes and devices are correctly set, as indicated in my above response.

Besides, make sure that nodes can 'ping' each other. In my case, I need to set in the sbatch script with export NCCL_SOCKET_IFNAME=XXX, where XXX is the interface name that can be obtained from ifconfig. You will encounter errors like NCCL WARN socketStartConnect: Connect to x.x.x.x failed : Software caused connection abort if the interface is not properly set.

I will find time to test the multiple-node training on this project shortly.

donydchen closed this as completed in 7a2afb6 Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about training with multi GPU #32

about training with multi GPU #32

boxuLibrary commented May 29, 2024

donydchen commented May 29, 2024

boxuLibrary commented May 29, 2024

donydchen commented May 29, 2024

donydchen commented May 31, 2024

about training with multi GPU #32

about training with multi GPU #32

Comments

boxuLibrary commented May 29, 2024

donydchen commented May 29, 2024

boxuLibrary commented May 29, 2024

donydchen commented May 29, 2024

donydchen commented May 31, 2024