This repository accompanies the NeurIPS 2020 paper on WoodFisher: Efficient Second-Order Approximation for Neural Network Compression (Singh & Alistarh, 2020).
The code is based on PyTorch and has been tested on version 1.2.0+cu92
.
woodfisher
: Full WoodFisher, computes both off-diagonal and diagonal of the Hessian inverse.woodtaylor
: Full WoodTaylor, computes both off-diagonal and diagonal of the Hessian inverse. In addition to WoodFisher (which is based on the second-order term of the Taylor series), it also considers the first-order term (i.e., the one containing the gradient) in Taylor series.
woodfisherblock
: Same aswoodfisher
except that it only considers block of size--fittable-params
along the diagonal.woodtaylorblock
: Same aswoodtaylor
except that it only considers block of size--fittable-params
along the diagonal.
-
woodfisherblockdynamic
: The functionality is same aswoodfisherblock
but is more scalable. Instead of zeroing the inactive/pruned parameters while estimating the Hessian inverse during the gradual pruning, it only maintains the active parameters. This is done by storing their indices, can be thought of as doing computations based on the adjacency-graph representation instead of the Hessian matrix. -
woodfisherblock_flops
: Takes into account the FLOP costs while deciding whether to prune a parameter or not. It comes into two modes: one where you can specify the--flops-power
to reweigh the pruning statistic, other directly takes in the level of FLOPs you require for the final model via--flops-target
.
magni
: Magnitude-based automated gradual prunerglobalmagni
: Global-Magnitude-based automated gradual prunernaivehess
: Naive Hessian-based automated gradual pruner (computes the diagonal of the hessian via double backprop)diagfisher
: Diagonal Fisher-based automated gradual prunerkfac
: K-FAC-based automated gradual pruner (blockwise with respect to layers)
main.py
is the main file to run pruning from. You will need to provide data and config paths and specify dataset and architecture names.configs/
contains yaml config files we use for specifying training and pruning schedules.models/
directory contains currently available models. To add a new model, have it be loaded byget_model
function inmodels/__init__.py
.policies/
containspruner
-specific policies,trainer
-specific policies, and aManager
class which executes these policies as specified in a given config.tests/
contain some tests for the above functionalities.utils/
contains utilities for loading datasets, masking layers for pruning, and performing helper computations.scripts/
contains bash scripts for executing the correspondingWoodFisher
results mentioned in the paper. Some examples forGlobal Magnitude
are also provided, but you can get theGlobal Magnitude
version of the other scripts by simply setting thePRUNER
flag in the bash file:PRUNER=globalmagni
.
We use a yaml config file to specify all things related to training and pruning: choice of pruners and optimizers, scheduling, and so on. Writing this file is best explained with an example, which you can find in configs/example_config.yaml
. As shown, the file should be split into two top-level sections: pruners
and trainers
(in any order). Second-level entries for both pruners
and trainers
list names of instances that are used to refer to them from the runner scripts. These names can be arbitrarily chosen by the user; however, if you are using only one model for training and pruning (which is most often the case), it is easiest to name your single trainer default_trainer
to avoid having to specify a custom name for the runner script. For example, example_config.yaml
has two pruners, pruner_1
and pruner_2
, and a default_trainer
.
Also to allow an easier way to use, you can update the config file on the fly via command line arguments. In order to do so, --update-config
must be passed, and besides that whatever argument that you might want to override, e.g., setting weight decay to zero --prune-wdecay 0
, etc.
For the results, in the paper, we fix a budget of 100 epochs for the entire gradual pruning process (i.e., including finetuning). However, you might want to play around with this schedule depending on your needs, to either compress or lengthen the schedule.
This can be achieved by using the flags --prune-start
, --prune-freq
, --prune-end
, --epochs
.
More details on these flags can be found in the argument descriptions contained in the options.py
file.
--disable-wdecay-after-prune
, bool: It disable weight decay after the pruning part has finished and retraining remains to be done. Empirically, this seems to further improve the performance, beyond the numbers reported in the paper as well.--scale-prune-update
, float: This scales the readjustment of the other weights in the OBS (optimal brain surgeon) formulation. Default value is1.0
. Setting to0
should enable OBD (optimal brain damage) formulation instead of OBS. You can also play around with some other values.--label-smoothing
, float: This enables label smoothing in the loss. Default value is0
, so label smoothing is disabled. Our reported results don't use label smoothing, but one can also try this to get slightly more improvements.- Recomputation of Hessian: This is again an additional feature to break down a single pruning step into multiple parts, whereupon after each part the inverse Hessian is recomputed. The motivation behind this is that the second order Taylor series approximation holds only in a small neighborhood of the current parameters. To use this, you have to append some additional flags such as the ones mentioned below, besides the usual command.
--recompute-num 2 --recompute-schedule poly --recompute-degree 2
At some point if you are pruning moderate to large-sized networks, there are some additional things you might want to take care off, in order to be able to run WoodFisher
-based pruning variants. The below-mentioned flags will likely be useful in this scenario:
--fittable-params
: This is one of the key flags that controls the block/chunk size to consider along the diagonal of the hessian. If this chunk size is c
, then the memory consumed is of the order O(cd)
, where d
denotes the total number of parameters in the model. Hence choose c
based on the model size and the amount of memory available at hand. E.g., for ResNet50
which has ~25M
params, a chunk size of 1000-2000should be good, while for
MobileNetV1which has
~4.5Mparams, a decent chunk size would be between
5000-20000`.
--offload-inv
: This offloads the hessian inverse, computed during an intermediate step during pruning, on to the CPU.
--offload-grads
: This offloads the loss gradients, computed over the various samples, on to the CPU.
--fisher-cpu
: Shifts even the computation of some parts onto the CPU.
--fisher-split-grads
: Tradeoff speed and memory while doing the Woodburry updates on the GPU. This is not required if you are using the CPU anyways. The flag expects an integer argument via an additional flag --fisher-parts, and reduces the memory requirement from
2 * Hessian sizeto
(1+ 1/k)* Hesian sizewhere
k` stands for the value of this fisher-parts flag (default value is 5).
First, clone the repository by running the following command:
$ git clone https://github.com/IST-DASLab/WoodFisher
After that, do not forget to run source setup.sh
.
First of all ensure that your torch
package has version 1.1.0 or above. Then install the nightly release of tensorboard
:
$ pip install tb-nightly
After that ensure that future
package is installed or invoke installation process by typing the following command in terminal:
$ pip install future
We welcome the contributions of community in further enriching the current codebase, from the standpoint of improving efficiency to adding support for additional network types as well as to matters concerning the aesthetics. Feel free to send a pull request in such a scenario, possibly alongside the csv file generated by running the tests contained in the tests/
folder. Also, an example results.csv
with which you can match your results to see if everything is still alright (these results should be rough match as the exact numbers depend on the PyTorch versions, the inherent randomness across platforms, etc.).
Some example's of pull requests are labelled in the code as PR.
We thank Alex Shevchenko, Ksenia Korovina for providing an initial framework that we could re-purpose for the implementation of our work.
This codebase corresponds to the paper: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression. If you use any of the code or provided models for your research, please consider citing the paper as.
@inproceedings{NEURIPS2020_d1ff1ec8,
author = {Singh, Sidak Pal and Alistarh, Dan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
pages = {18098--18109},
publisher = {Curran Associates, Inc.},
title = {WoodFisher: Efficient Second-Order Approximation for Neural Network Compression},
url = {https://proceedings.neurips.cc/paper/2020/file/d1ff1ec86b62cd5f3903ff19c3a326b2-Paper.pdf},
volume = {33},
year = {2020}
}