-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeRO-Inference refresh #722
Conversation
9f072a8
to
9974256
Compare
0e51a19
to
6baa73b
Compare
@tjruwase It's wonderful to see the optimizations you've integrated into Deepspeed ZeRO-inference, especially without the need for custom APIs. This is a great work! One minor suggestion is that if you used the ideas (e.g., cache offloading) from FlexGen, maybe it is better also to add FlexGen into the reference section. I'm also happy to discuss if you have any problems regarding other optimizations (Partial offloading, Cache quantization). |
@Ying1123, thanks for the kind words. You are correct that we should add FlexGen to reference since that was our inspiration for exploring cache offloading and weight quantization. Sorry about this oversight. Is the following the best reference? https://arxiv.org/abs/2303.06865 |
Yes, this is the most updated paper. Thanks! |
* Add zero inference * Fix scripts * Fix scripts * Fix scripts * Fix versioning text * Shrink figure * Shrink figure * Shrink figure * Generality * :q * Tweak repro scripts and README * Fix versions * Fix rqmts * README tweak * Cleanup * Rearrange README * Versioning * cleanup
Refresh with 2 new optimizations: weight quantization and kv cache offloading to CPU.
Companion DS PR: microsoft/DeepSpeed#4197