Add multi-node-training #103

SimonKamuk · 2025-01-22T14:47:49Z

Describe your changes

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow setting the number of nodes with the cli argument num_nodes. It is also possible to select a subset of visible GPU's using the argument devices (only when not using SLURM)

Replaces #26 with a simpler method based on advice from @sadamov

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

sadamov

Thank @SimonKamuk for putting this together! I like this simple approach a lot. @joeloskarsson you previously were not too impressed by myself adding device as a new flag. But would you agree that in a non-slurm world it makes sense?

tests/datastore_examples/mdp/danra_100m_winds/config.yaml

sadamov · 2025-01-22T15:08:59Z

I have closed the legacy #26 PR and linked the corresponding issue #1 and milestone here. I am not sure if we have a procedure in place, but you mentioned reducing the scope of v0.4.0. We should at least post this in the slack-channel, if we can't wait until the next dev-meeting.

joeloskarsson · 2025-01-22T15:15:35Z

I suppose I have viewed it as better to use CUDA_VISIBLE_DEVICES environment variable than to integrate device selection into neural-lam, but I don't know if there are any advantages or disadvantages to this. It gets tricky when we start talking HPC setups as it can be quite different for different users and I'd prefer to not make assumptions about the users env. But as I see this change I don't think it makes any more assumptions than lightning does, which should be the minimal.

SimonKamuk · 2025-01-22T15:20:08Z

The reason why I wanted to select devices, is because we at DMI also have some shared GPU's which are not on a SLURM cluster, and I wanted to be able to run some things there without reserving the entire system. I didn't think of using CUDA_VISIBLE_DEVICES, and I have no strong opinion either way, so if you prefer, I'll just remove the devices argument again, and specify the way to do it in the readme 😄

joeloskarsson · 2025-01-22T15:26:51Z

No strong opinions on my side either really, it doesn't hurt to have the --devices option as long as it defaults to auto.

SimonKamuk · 2025-01-22T20:24:56Z

Alright cool, then I think this is ready to merge. FYI I also changed to only print on rank 0 everywhere to keep logs clearer.

sadamov · 2025-01-23T05:12:34Z

looks good to merge. The whole printing/logging will soon be redone anyways. So it makes sense to keep it simple here. Please go ahead with the merge 👍

Simon Kamuk Christiansen added 4 commits January 22, 2025 13:20

get num_nodes from cli instead of slurm env var

3cf007e

added option for subsetting gpus

12ca17c

elaborated help for device cli argument

ccf6e05

updated readme

ba2f038

SimonKamuk requested a review from sadamov January 22, 2025 14:47

update changelog

ded70b6

sadamov approved these changes Jan 22, 2025

View reviewed changes

tests/datastore_examples/mdp/danra_100m_winds/config.yaml Outdated Show resolved Hide resolved

sadamov linked an issue Jan 22, 2025 that may be closed by this pull request

Multi-GPU training #1

Closed

sadamov assigned SimonKamuk Jan 22, 2025

sadamov added this to the v0.4.0 milestone Jan 22, 2025

sadamov mentioned this pull request Jan 22, 2025

Introduce multi-node training setup #26

Closed

only print on rank 0

2e469aa

updated changelog

fd6a5e1

undo changes to config.yaml

f55a9e4

SimonKamuk merged commit d2e26a9 into mllam:main Jan 23, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-node-training #103

Add multi-node-training #103

SimonKamuk commented Jan 22, 2025 •

edited by sadamov

Loading

sadamov left a comment

sadamov commented Jan 22, 2025

joeloskarsson commented Jan 22, 2025

SimonKamuk commented Jan 22, 2025

joeloskarsson commented Jan 22, 2025

SimonKamuk commented Jan 22, 2025

sadamov commented Jan 23, 2025

Add multi-node-training #103

Add multi-node-training #103

Conversation

SimonKamuk commented Jan 22, 2025 • edited by sadamov Loading

Describe your changes

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

sadamov left a comment

Choose a reason for hiding this comment

sadamov commented Jan 22, 2025

joeloskarsson commented Jan 22, 2025

SimonKamuk commented Jan 22, 2025

joeloskarsson commented Jan 22, 2025

SimonKamuk commented Jan 22, 2025

sadamov commented Jan 23, 2025

SimonKamuk commented Jan 22, 2025 •

edited by sadamov

Loading