-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about training with multi GPU #32
Comments
Hi @boxuLibrary, ideally, this project should support multiple-node training since it is built on top of pytorch_lightning. However, I don't have an environment to test that, I can only confirm that this project works fine with multiple GPUs on the same node. As previously mentioned, have you checked here to ensure all related parameters are correctly set? For example, in your sbatch script, make sure that #SBATCH --nodes=4 # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=8 # This needs to match Trainer(devices=...) Besides, make sure that you run this project with |
Oh, thank you for your reply. It is real helpful for me. |
No worries @boxuLibrary. If you manage to make it work on multiple nodes, feel free to make a pull request (if some codes need to be modified) or leave some notes on this issue (I will pin it so that others can easily refer to it). |
Hi @boxuLibrary, on another project I work on, I find it easy to train on multiple nodes with pytorch-lighting; just ensure that Besides, make sure that nodes can 'ping' each other. In my case, I need to set in the sbatch script with I will find time to test the multiple-node training on this project shortly. |
hi, i encounter one problem. When i run the code with multiple GPUs distributed on the different nodes on the slurm. I find that I can not execute GPUS on different nodes. I wonder does the code support the distribution on different nodes?
The text was updated successfully, but these errors were encountered: