Replies: 2 comments
-
Hi @March-08 We will be moving this post (originally found in our bugs and feature request queue) to our support forum discussion and @markurtz will respond soon. Great questions!
|
Beta Was this translation helpful? Give feedback.
-
Hi @March-08, yes, excellent questions. Our core implementation for pruning lives in the mask_pruning.py file. Inside, you'll see that the class is invoked with a list of layers and param_names. It then creates a new parameter for each original weight of the same shape and type. Also, the forward pass for the layer and the backward pass for the parameter is overwritten here and here. On the forward pass, we register as a pre-hook, and in that pre-hook, we apply the set the weight equal to the weight times the current mask. Overwriting the forward pass ensures the activations are correct once the layer runs. We multiply the gradient by the mask on the backward pass, so the weight is not updated by the optimizer (especially important for optimizers with momentum enabled). In terms of how we can get speed from unstructured pruning, it's a reasonably tricky problem. In general, though, there are specific ways to use the vector instructions and JITs on CPUs to order the compute for each layer and across multiple layers such that removed 0's result in significant speedup. I say across multiple layers because after the compute is reduced from the removed zeros and you can achieve speedup, most neural networks quickly become memory bound. More efficiently ordering the operations across multiple layers rather than layer by layer enables the output data from a layer to be stored in the CPU's caches for much faster access by the following layer. I hope those answer your questions and let me know if there's anything you'd like me to clarify! |
Beta Was this translation helpful? Give feedback.
-
Hi,
I was wondering, how did you implement the unstructured pruning method?
Did you use a mask matrix to set the parameters to zero? And how did you prevent the update of those parameters set to zero?
And in general, how do you get speed from compression and evaluate on CPU as fast as on GPU?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions