Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass correct node size for ZeRO++ #4085

Merged
merged 4 commits into from
Aug 9, 2023
Merged

Conversation

cmikeh2
Copy link
Contributor

@cmikeh2 cmikeh2 commented Aug 3, 2023

No description provided.

@cmikeh2 cmikeh2 enabled auto-merge August 3, 2023 21:18
Copy link
Contributor

@samadejacobs samadejacobs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmikeh2 cmikeh2 added this pull request to the merge queue Aug 9, 2023
Merged via the queue into master with commit f0463b4 Aug 9, 2023
hughpu pushed a commit to hughpu/DeepSpeed that referenced this pull request Aug 9, 2023
* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023
…port inputs require no grad (#4118)

* feat: add `non_reentrant_checkpoint`

* feat: add missing output postprocess and change the hook to record leaf forward tensor refs

* fix: make the multi_grad_hook registered after graph construction

* fix: backward compatibility for multi_tensor_hook

* fix: nonlocal reference error of deepspeed_saved_tensors

* fix: reduce repeating hook registration

* test: add test for `activation_checkpointing.checkpointing.non_reentrant_checkpoint`

* Pass correct node size for ZeRO++ (#4085)

* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* add deepspeed chat arxiv report (#4110)

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* style: change flake8 detected style missmatch

* test: hack to clone the `test_activation_checkpointing` module for reuse and add regression tests

* doc: explain the introduction of `non_reentrant_checkpoint`

* doc: explain the test of `non_reentrant_checkpoint`

---------

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 6, 2023
… true for activation checkpoint layer in pipeline train. (#4224)

* feat: add `non_reentrant_checkpoint`

* feat: add missing output postprocess and change the hook to record leaf forward tensor refs

* fix: make the multi_grad_hook registered after graph construction

* fix: backward compatibility for multi_tensor_hook

* fix: nonlocal reference error of deepspeed_saved_tensors

* fix: reduce repeating hook registration

* test: add test for `activation_checkpointing.checkpointing.non_reentrant_checkpoint`

* Pass correct node size for ZeRO++ (#4085)

* Pass correct node size

* formatting

---------

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* add deepspeed chat arxiv report (#4110)

* add deepspeed chat arxiv report

* add zeroquant v2 and fp

* add selective enhencement

* add ignore for 'Youn' in spell checker

---------

Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* style: change flake8 detected style missmatch

* test: hack to clone the `test_activation_checkpointing` module for reuse and add regression tests

* doc: explain the introduction of `non_reentrant_checkpoint`

* doc: explain the test of `non_reentrant_checkpoint`

* apply non_reentrant_checkpoint in pipeline parallel training

* ut pass

* fix ci

* reduce check level for ci

---------

Co-authored-by: hughpu <[email protected]>
Co-authored-by: Hugh Pu <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: yaozhewei <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants