Pruning Implementation #186

March-08 · 2021-04-21T10:12:31Z

March-08
Apr 21, 2021

Hi,

I was wondering, how did you implement the unstructured pruning method?
Did you use a mask matrix to set the parameters to zero? And how did you prevent the update of those parameters set to zero?
And in general, how do you get speed from compression and evaluate on CPU as fast as on GPU?
Thank you in advance!

jeanniefinks · 2021-04-21T16:02:59Z

jeanniefinks
Apr 21, 2021
Maintainer

Hi @March-08 We will be moving this post (originally found in our bugs and feature request queue) to our support forum discussion and @markurtz will respond soon. Great questions!

Jeannie, Neural Magic

0 replies

markurtz · 2021-04-23T12:19:18Z

markurtz
Apr 23, 2021
Maintainer

Hi @March-08, yes, excellent questions.

Our core implementation for pruning lives in the mask_pruning.py file. Inside, you'll see that the class is invoked with a list of layers and param_names. It then creates a new parameter for each original weight of the same shape and type. Also, the forward pass for the layer and the backward pass for the parameter is overwritten here and here. On the forward pass, we register as a pre-hook, and in that pre-hook, we apply the set the weight equal to the weight times the current mask. Overwriting the forward pass ensures the activations are correct once the layer runs. We multiply the gradient by the mask on the backward pass, so the weight is not updated by the optimizer (especially important for optimizers with momentum enabled).

In terms of how we can get speed from unstructured pruning, it's a reasonably tricky problem. In general, though, there are specific ways to use the vector instructions and JITs on CPUs to order the compute for each layer and across multiple layers such that removed 0's result in significant speedup. I say across multiple layers because after the compute is reduced from the removed zeros and you can achieve speedup, most neural networks quickly become memory bound. More efficiently ordering the operations across multiple layers rather than layer by layer enables the output data from a layer to be stored in the CPU's caches for much faster access by the following layer.

I hope those answer your questions and let me know if there's anything you'd like me to clarify!
Mark

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pruning Implementation #186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pruning Implementation #186

March-08 Apr 21, 2021

Replies: 2 comments

jeanniefinks Apr 21, 2021 Maintainer

markurtz Apr 23, 2021 Maintainer

March-08
Apr 21, 2021

jeanniefinks
Apr 21, 2021
Maintainer

markurtz
Apr 23, 2021
Maintainer