Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ckpt-rewr] Save state dict API #3372

Merged
merged 48 commits into from
Jun 17, 2024
Merged

[ckpt-rewr] Save state dict API #3372

merged 48 commits into from
Jun 17, 2024

Conversation

eracah
Copy link
Contributor

@eracah eracah commented Jun 6, 2024

What does this PR do?

Implements an API for saving a state dict to disk given a sharded or full state dict:

  • save_state_dict_to_disk
  • _save_sharded_state_dict_to_disk
  • _save_full_state_dict_to_disk

Tests for both CPU and GPU
Also modifies testing infra for comparing DTensors and ShardedTensors, generating sharded test models

@eracah eracah marked this pull request as draft June 6, 2024 01:30
@eracah eracah marked this pull request as ready for review June 7, 2024 21:25
@eracah eracah requested review from bigning and mvpatel2000 June 7, 2024 21:25
composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
tests/checkpoint/helpers.py Show resolved Hide resolved
tests/checkpoint/helpers.py Show resolved Hide resolved
tests/checkpoint/test_save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
@eracah eracah requested review from bigning and mvpatel2000 June 12, 2024 00:31
composer/checkpoint/save.py Outdated Show resolved Hide resolved
@bigning
Copy link
Contributor

bigning commented Jun 12, 2024

Please wait for the code freeze

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding for freeze

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides hold

composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
composer/checkpoint/save.py Outdated Show resolved Hide resolved
tests/checkpoint/helpers.py Show resolved Hide resolved
@eracah eracah requested review from mvpatel2000 and bigning June 17, 2024 20:33
@eracah
Copy link
Contributor Author

eracah commented Jun 17, 2024

Holding for freeze

please release

@mvpatel2000 mvpatel2000 dismissed their stale review June 17, 2024 20:43

Dismissing

@eracah eracah enabled auto-merge (squash) June 17, 2024 22:44
@eracah eracah merged commit f1cfc64 into mosaicml:dev Jun 17, 2024
17 checks passed
mvpatel2000 pushed a commit to mvpatel2000/composer that referenced this pull request Jul 21, 2024
mvpatel2000 pushed a commit that referenced this pull request Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants