Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeRO-Inference refresh #722

Merged
merged 19 commits into from
Sep 11, 2023
Merged

ZeRO-Inference refresh #722

merged 19 commits into from
Sep 11, 2023

Conversation

tjruwase
Copy link
Contributor

@tjruwase tjruwase commented Sep 8, 2023

Refresh with 2 new optimizations: weight quantization and kv cache offloading to CPU.
Companion DS PR: microsoft/DeepSpeed#4197

@tjruwase
Copy link
Contributor Author

tjruwase commented Sep 8, 2023

FYI, @donglinz @cli99

@tjruwase tjruwase force-pushed the staging-zero-inference-v1 branch from 9f072a8 to 9974256 Compare September 8, 2023 13:32
@tjruwase tjruwase force-pushed the staging-zero-inference-v1 branch from 0e51a19 to 6baa73b Compare September 9, 2023 18:50
@tjruwase tjruwase merged commit 12f78ec into master Sep 11, 2023
3 checks passed
@Ying1123
Copy link

@tjruwase It's wonderful to see the optimizations you've integrated into Deepspeed ZeRO-inference, especially without the need for custom APIs. This is a great work!

One minor suggestion is that if you used the ideas (e.g., cache offloading) from FlexGen, maybe it is better also to add FlexGen into the reference section. I'm also happy to discuss if you have any problems regarding other optimizations (Partial offloading, Cache quantization).

@tjruwase
Copy link
Contributor Author

@Ying1123, thanks for the kind words. You are correct that we should add FlexGen to reference since that was our inspiration for exploring cache offloading and weight quantization. Sorry about this oversight.

Is the following the best reference? https://arxiv.org/abs/2303.06865

@Ying1123
Copy link

@Ying1123, thanks for the kind words. You are correct that we should add FlexGen to reference since that was our inspiration for exploring cache offloading and weight quantization. Sorry about this oversight.

Is the following the best reference? https://arxiv.org/abs/2303.06865

Yes, this is the most updated paper. Thanks!

LeetJoe pushed a commit to LeetJoe/DeepSpeedExamples that referenced this pull request Sep 15, 2023
* Add zero inference

* Fix scripts

* Fix scripts

* Fix scripts

* Fix versioning text

* Shrink figure

* Shrink figure

* Shrink figure

* Generality

* :q

* Tweak repro scripts and README

* Fix versions

* Fix rqmts

* README tweak

* Cleanup

* Rearrange README

* Versioning

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants