Using explicit GPU upcast for ZeRO-Offload #6962

xylian86 · 2025-01-20T13:25:37Z

Following discussion in PR-6670, the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one.

The results on 3B model are shown below:

Option	BWD (ms)	Speed up
Before PR-6670	25603.30	1x
After PR-6670	1174.31	21.8X
After this PR	309.2	82.8X

Following discussion in [PR-6670](#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: Olatunji Ruwase <[email protected]>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: siqi <[email protected]>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X |

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: gyou2021 <[email protected]>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: yisheng <[email protected]>

using explicit GPU upcast

06d4135

xylian86 requested review from tjruwase and tohtana as code owners January 20, 2025 13:25

xylian86 mentioned this pull request Jan 20, 2025

fix memcpy issue on backward for zero-infinity #6670

Merged

tjruwase approved these changes Jan 20, 2025

View reviewed changes

loadams enabled auto-merge January 21, 2025 18:16

loadams added this pull request to the merge queue Jan 21, 2025

Merged via the queue into deepspeedai:master with commit c17dc33 Jan 21, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using explicit GPU upcast for ZeRO-Offload #6962

Using explicit GPU upcast for ZeRO-Offload #6962

xylian86 commented Jan 20, 2025 •

edited

Loading

Using explicit GPU upcast for ZeRO-Offload #6962

Using explicit GPU upcast for ZeRO-Offload #6962

Conversation

xylian86 commented Jan 20, 2025 • edited Loading

xylian86 commented Jan 20, 2025 •

edited

Loading