Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-node-training #103

Merged
merged 8 commits into from
Jan 23, 2025
Merged

Add multi-node-training #103

merged 8 commits into from
Jan 23, 2025

Conversation

SimonKamuk
Copy link
Contributor

@SimonKamuk SimonKamuk commented Jan 22, 2025

Describe your changes

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow setting the number of nodes with the cli argument num_nodes. It is also possible to select a subset of visible GPU's using the argument devices (only when not using SLURM)

Replaces #26 with a simpler method based on advice from @sadamov

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@SimonKamuk SimonKamuk requested a review from sadamov January 22, 2025 14:47
Copy link
Collaborator

@sadamov sadamov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @SimonKamuk for putting this together! I like this simple approach a lot. @joeloskarsson you previously were not too impressed by myself adding device as a new flag. But would you agree that in a non-slurm world it makes sense?

tests/datastore_examples/mdp/danra_100m_winds/config.yaml Outdated Show resolved Hide resolved
@sadamov sadamov linked an issue Jan 22, 2025 that may be closed by this pull request
@sadamov sadamov added this to the v0.4.0 milestone Jan 22, 2025
@sadamov
Copy link
Collaborator

sadamov commented Jan 22, 2025

I have closed the legacy #26 PR and linked the corresponding issue #1 and milestone here. I am not sure if we have a procedure in place, but you mentioned reducing the scope of v0.4.0. We should at least post this in the slack-channel, if we can't wait until the next dev-meeting.

@joeloskarsson
Copy link
Collaborator

I suppose I have viewed it as better to use CUDA_VISIBLE_DEVICES environment variable than to integrate device selection into neural-lam, but I don't know if there are any advantages or disadvantages to this. It gets tricky when we start talking HPC setups as it can be quite different for different users and I'd prefer to not make assumptions about the users env. But as I see this change I don't think it makes any more assumptions than lightning does, which should be the minimal.

@SimonKamuk
Copy link
Contributor Author

The reason why I wanted to select devices, is because we at DMI also have some shared GPU's which are not on a SLURM cluster, and I wanted to be able to run some things there without reserving the entire system. I didn't think of using CUDA_VISIBLE_DEVICES, and I have no strong opinion either way, so if you prefer, I'll just remove the devices argument again, and specify the way to do it in the readme 😄

@joeloskarsson
Copy link
Collaborator

No strong opinions on my side either really, it doesn't hurt to have the --devices option as long as it defaults to auto.

@SimonKamuk
Copy link
Contributor Author

Alright cool, then I think this is ready to merge. FYI I also changed to only print on rank 0 everywhere to keep logs clearer.

@sadamov
Copy link
Collaborator

sadamov commented Jan 23, 2025

looks good to merge. The whole printing/logging will soon be redone anyways. So it makes sense to keep it simple here. Please go ahead with the merge 👍

@SimonKamuk SimonKamuk merged commit d2e26a9 into mllam:main Jan 23, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-GPU training
3 participants