-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Add more device-agnostic compression algorithms #2894
Comments
Thanks for the great proposal and we appreciate your contributation here :). We will discuss it internally and get back to you soon. Best, |
Hi there, The proposal looks great to us. For the pruning/scarification proposal, we wonder if the return of callback is needed or we can just use something like For quantization proposal, the post-training quantization is in some sense already implemented. Users can use static activation quantization method with few batches inference to get the calibration. See here for more details: Static Act Quantization. Do I miss something here? Look forward to hearing from you, |
@yaozhewei thanks for the valuable feedback. we are evaluating if we can remove or enhance callback like you suggested. will get back to you soon. as for post-training quantization support, from code and DeepSpeedExamples, all we saw are related with compression_training. we didn't see a pure post training quantization example. could you pls point out us the link for check? |
@ftian1 It is always good to have more examples and we appreciate if you added more examples :) |
@yaozhewei per our investigation, we found it's doable to remove those explicit callbacks. we are preparing PR for further review. will ping you when it's ready. as for calibration based PTQ example, we will take a look on that. thanks for the advice. |
Thanks a lot @ftian1. Look forward to the PR :) |
@yaozhewei I took a look at post training static quantization code implementation here. From the logic, DeepSpeed Static Act Quantization MUST rely on training to collect calibration data. I am wondering if it's valuable to contribute pure post training static quantization, which only involve inference phase and need user to explicitly pass down calibration dataset, to DeepSpeed? |
That sounds good to me :) |
Summary
This is a design discussion RFC for contributing some device-agnostic compression algorithms, like the post training quantization(QDQ quant format) and structural sparsity supported by Intel(R) Neural Compressor into DeepSpeed.
Motivation
As we know, the DeepSpeed Compression have supported many useful compression methods like layer reduction via knowledge distillation, weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning.
But those compression methods are lack of main stream support on some popular compression algorithms like post training static quantization and structural sparsity, which have been demostrated as efficient and popular compression methods by the industry.
Intel(R) Neural Compressor has implemented such device-agnostic compression algorithms, we would like to contribute those into DeepSpeed.
Proposal Details on Pruning
In this proposal, we would like to introduce structural pruning functionalities by enabling "N in M" and "N x M" block sparsity pattern with snip_momentum criteria and progressive pruning.
We propose a two-phases support on the structural pruning method.
Phase 1: structural pruning with global sparse ratio
This way leverages the existing DeepSpeed sparsity design which has a global sparse ratio control. If the accuracy doesn't meet expectation, user has to tune the training process like what they did on DeepSpeed by manually specifying and exlporing the proper sparse ratio per layer.
We extend the json config file format and implement the structural sparsity algorithm in
compression
dir like below.As for the structural sparsity implementation in
compression
dir, let's takingLinearLayer_Compress
class indeepspeed/compression/basic_layer.py
as an example, this class is enhanced like this to support structural sparsity algorithm.NOTE: In this phase 1, the DeepSpeed user facing API keeps unchanged. The only change user need to be aware of is the extended Json file format.
Phase 2: Advanced structural pruning with fine-grained sparse ratio control per layer
This advanced algorithm supports the adaptive sparse ratio adjustment algorithm per layer to reach higher accuracy.
This way needs to extend the
initialize()
API to return one more parametercallbacks
besidesengine
,optimizer
,training_dataloader
,lr_scheduler
. The json config file needs to be adjusted accordingly.This
callbacks
class object returned byinitialize
function is used to register hooks for user into the normal training process.The user need to manually insert such hooks into their training code for fine-grain sparsity control per layer.
Structural Sparsity Results
Recommendation
We recommend to split this contribution into two phases:
The first phase focuses on adding the entire structural sparsity methods supported by Intel(R) Neural Compressor into DeepSpeed with minor changes.
This way provoides the complete structural sparsity capability with global sparse ratio setting. It's easy of use for customer to pilot the structural sparsity feature.
The second phase focuses on productivity improvement by supporting the adaptive sparse ratio adjustment to support broad pruning algorithm.
This way has the capability of automatically adjusting the sparse ration per layer for better accuracy. It can highly improve the productivity for those customers who wants to have high sparsity but meet strict accuracy goal.
Proposal Details on Quantization
In this proposal, we would like to enhance the quantization functionality by integrating device agnostic post training static&dynamic quantization (QDQ quant format) supported by Intel(R) Neural Compressor into DeepSpeed.
As current implementation of DeepSpeed is focusing on simulating the quantization behavior during training, we propose to add post training quantization by below changes.
Besides the changes in the compression config file, we also need introduce a new function
quantize
like below to support post-training quantization.The usage would be like below:
Quantization Results
As for the post training quantization results, please refer to this link
Future Works
We enabled new quantization algorithms like SmoothQuant in Intel(R) Neural Compressor and applied to popular large language models such as BLOOM-176B. We plan to enable these new features into DeepSpeed compression library as part of our future works.
The text was updated successfully, but these errors were encountered: