Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about training with multi GPU #32

Closed
boxuLibrary opened this issue May 29, 2024 · 4 comments
Closed

about training with multi GPU #32

boxuLibrary opened this issue May 29, 2024 · 4 comments

Comments

@boxuLibrary
Copy link

hi, i encounter one problem. When i run the code with multiple GPUs distributed on the different nodes on the slurm. I find that I can not execute GPUS on different nodes. I wonder does the code support the distribution on different nodes?

@donydchen
Copy link
Owner

Hi @boxuLibrary, ideally, this project should support multiple-node training since it is built on top of pytorch_lightning. However, I don't have an environment to test that, I can only confirm that this project works fine with multiple GPUs on the same node.

As previously mentioned, have you checked here to ensure all related parameters are correctly set? For example, in your sbatch script, make sure that

#SBATCH --nodes=4             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=8   # This needs to match Trainer(devices=...)

Besides, make sure that you run this project with srun, as detailed here.

@boxuLibrary
Copy link
Author

Oh, thank you for your reply. It is real helpful for me.

@donydchen
Copy link
Owner

No worries @boxuLibrary. If you manage to make it work on multiple nodes, feel free to make a pull request (if some codes need to be modified) or leave some notes on this issue (I will pin it so that others can easily refer to it).

@donydchen
Copy link
Owner

Hi @boxuLibrary, on another project I work on, I find it easy to train on multiple nodes with pytorch-lighting; just ensure that num_nodes and devices are correctly set, as indicated in my above response.

Besides, make sure that nodes can 'ping' each other. In my case, I need to set in the sbatch script with export NCCL_SOCKET_IFNAME=XXX, where XXX is the interface name that can be obtained from ifconfig. You will encounter errors like NCCL WARN socketStartConnect: Connect to x.x.x.x failed : Software caused connection abort if the interface is not properly set.

I will find time to test the multiple-node training on this project shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants