(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints #19870

awaelchli · 2024-05-15T12:14:44Z

What does this PR do?

Follow-up to #19852.

I found that loading large full-state-dict checkpoints into a distributed model can lead to OOM (e.g. Llama 3 70B) because PyTorch's approach of loading on rank-0, then broadcasting and redistributing is applied to the entire checkpoint at once, instead of on a per-parameter or per-module basis (see comment).

In this PR, I load the checkpoint per-parameter, which seems to work as it should.

Mini inference benchmark on Llama 3 8B (8xA100)
Before:
4.64 GB (peak memory usage), 13.43 seconds to load
Now:
3.20 GB (peak memory usage), 13.72 seconds to load

Llama 3 70B (8xA100)
Before:
OOM
Now:
20.01 GB (peak memory usage), 40.73 seconds to load

Benchmarks done with this LitGPT branch.

cc @justusschock @awaelchli @carmocca @Borda

for more information, see https://pre-commit.ci

github-actions · 2024-05-15T13:52:08Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.1)	success	✅
pl-cpu (macOS-11, lightning, 3.10, 2.2)	success	✅
pl-cpu (macOS-14, lightning, 3.10, 2.3)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.2)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.3)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 2.0, oldest)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.1)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.2)	success	✅
pl-cpu (windows-2022, lightning, 3.10, 2.3)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 2.0)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.8, 2.0)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.0)	success	✅
pl-cpu (macOS-12, pytorch, 3.11, 2.1)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0)	success	✅
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.0)	success	✅
pl-cpu (windows-2022, pytorch, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs) (testing Lightning \| latest)	success	✅
pytorch-lightning (GPUs) (testing PyTorch \| latest)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

🟢 pytorch_lightning: Benchmarks

Check ID	Status
lightning.Benchmarks	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

🟢 fabric: Docs

Check ID	Status
docs-make (fabric, doctest)	success	✅
docs-make (fabric, html)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

🟢 lightning_fabric: CPU workflow

Check ID	Status
fabric-cpu (macOS-11, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (macOS-11, lightning, 3.10, 2.0)	success	✅
fabric-cpu (macOS-11, lightning, 3.11, 2.1)	success	✅
fabric-cpu (macOS-11, lightning, 3.11, 2.2)	success	✅
fabric-cpu (macOS-14, lightning, 3.10, 2.3)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.2)	success	✅
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.3)	success	✅
fabric-cpu (windows-2022, lightning, 3.8, 2.0, oldest)	success	✅
fabric-cpu (windows-2022, lightning, 3.10, 2.0)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.2)	success	✅
fabric-cpu (windows-2022, lightning, 3.11, 2.3)	success	✅
fabric-cpu (macOS-11, fabric, 3.8, 2.0)	success	✅
fabric-cpu (ubuntu-20.04, fabric, 3.8, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.8, 2.0)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.0)	success	✅
fabric-cpu (macOS-12, fabric, 3.11, 2.1)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0)	success	✅
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.1)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.0)	success	✅
fabric-cpu (windows-2022, fabric, 3.11, 2.1)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py, tests/tests_fabric/strategies/test_model_parallel_integration.py.

🟢 lightning_fabric: Azure GPU

Check ID	Status
lightning-fabric (GPUs) (testing Fabric \| latest)	success	✅
lightning-fabric (GPUs) (testing Lightning \| latest)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py, tests/tests_fabric/strategies/test_model_parallel_integration.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/fabric/strategies/model_parallel.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/lightning/fabric/strategies/model_parallel.py

codecov · 2024-05-15T14:02:47Z

Codecov Report

Attention: Patch coverage is 21.05263% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 59%. Comparing base (9455871) to head (bd2843f).

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #19870     +/-   ##
=========================================
- Coverage      84%      59%    -25%     
=========================================
  Files         425      420      -5     
  Lines       35010    34925     -85     
=========================================
- Hits        29369    20527   -8842     
- Misses       5641    14398   +8757

lantiga

Looks great

github-actions bot added the fabric lightning.fabric.Fabric label May 15, 2024

awaelchli added refactor performance and removed fabric lightning.fabric.Fabric labels May 15, 2024

awaelchli added this to the 2.3 milestone May 15, 2024

github-actions bot added the fabric lightning.fabric.Fabric label May 15, 2024

Base automatically changed from examples/tp-ckpt to master May 15, 2024 12:19

github-actions bot added the pl Generic label for PyTorch Lightning package label May 15, 2024

awaelchli force-pushed the feature/tp-full-ckpt-load branch from 86d8d16 to 1170618 Compare May 15, 2024 12:21

memory-optimized loading of full checkpoints into dist model

1a0887e

awaelchli force-pushed the feature/tp-full-ckpt-load branch from 1170618 to 1a0887e Compare May 15, 2024 12:22

awaelchli and others added 6 commits May 15, 2024 14:26

simplify

d1e5bd1

handle buffers

6371999

[pre-commit.ci] auto fixes from pre-commit.com hooks

91664a1

for more information, see https://pre-commit.ci

handle strict loading, buffers, and add test

4f861f6

[pre-commit.ci] auto fixes from pre-commit.com hooks

fcb7fdb

for more information, see https://pre-commit.ci

chlog

bd2843f

awaelchli marked this pull request as ready for review May 15, 2024 13:51

awaelchli requested review from carmocca and justusschock as code owners May 15, 2024 13:51

awaelchli requested a review from lantiga May 15, 2024 13:52

awaelchli changed the title ~~(3/n) Support 2D Parallelism - More efficient loading of full-state checkpoints~~ (3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints May 15, 2024

awaelchli commented May 15, 2024

View reviewed changes

src/lightning/fabric/strategies/model_parallel.py Show resolved Hide resolved

justusschock approved these changes May 15, 2024

View reviewed changes

lantiga approved these changes May 15, 2024

View reviewed changes

lantiga merged commit cd8acc2 into master May 15, 2024
127 of 132 checks passed

lantiga deleted the feature/tp-full-ckpt-load branch May 15, 2024 17:07

mergify bot added the ready PRs ready to be merged label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints #19870

(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints #19870

awaelchli commented May 15, 2024 •

edited by github-actions bot

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading

lantiga left a comment

(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints #19870

(3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints #19870

Conversation

awaelchli commented May 15, 2024 • edited by github-actions bot Loading

What does this PR do?

github-actions bot commented May 15, 2024 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

codecov bot commented May 15, 2024 • edited Loading

Codecov Report

lantiga left a comment

Choose a reason for hiding this comment

awaelchli commented May 15, 2024 •

edited by github-actions bot

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

codecov bot commented May 15, 2024 •

edited

Loading