-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-node-training #103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @SimonKamuk for putting this together! I like this simple approach a lot. @joeloskarsson you previously were not too impressed by myself adding device
as a new flag. But would you agree that in a non-slurm world it makes sense?
I suppose I have viewed it as better to use |
The reason why I wanted to select devices, is because we at DMI also have some shared GPU's which are not on a SLURM cluster, and I wanted to be able to run some things there without reserving the entire system. I didn't think of using CUDA_VISIBLE_DEVICES, and I have no strong opinion either way, so if you prefer, I'll just remove the devices argument again, and specify the way to do it in the readme 😄 |
No strong opinions on my side either really, it doesn't hurt to have the --devices option as long as it defaults to auto. |
Alright cool, then I think this is ready to merge. FYI I also changed to only print on rank 0 everywhere to keep logs clearer. |
looks good to merge. The whole printing/logging will soon be redone anyways. So it makes sense to keep it simple here. Please go ahead with the merge 👍 |
Describe your changes
This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow setting the number of nodes with the cli argument
num_nodes
. It is also possible to select a subset of visible GPU's using the argumentdevices
(only when not using SLURM)Replaces #26 with a simpler method based on advice from @sadamov
Type of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee