-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Blocked Convolution Proof of Concept #97
Conversation
Codecov Report
@@ Coverage Diff @@
## master #97 +/- ##
=========================================
- Coverage 79.15% 72.5% -6.65%
=========================================
Files 24 25 +1
Lines 753 822 +69
=========================================
Hits 596 596
- Misses 157 226 +69
Continue to review full report at Codecov.
|
@staticfloat pulled a couple of utilities out of #94 to make the transition to a single kernel for 1D/2D/3D convolution easier. Thank you. |
@mbrookhart now that #94 is merged, I intend to take a look at this and see if we can't automatically choose this over other approaches for small images, assuming the performance works out. I'm especially excited to see how much the lack of allocations (for the |
Closing due to lack of interest. |
As in you don't have the time, or lack of interest on our side? I think we're still pretty interested in this, but at least personally I wasn't sure what the status is – is it ready to go from your side, or does it need integration with other NNlib parts that have been refactored lately, or help/review etc. Bumping with @staticfloat as well. |
Little bit of both, haven't had time on my end to extend it to other operations, haven't heard from @staticfloat for many months now. If you guys are still interested, I can reopen, didn't delete my fork. |
Disclaimer: I work for Intel, but this work is purely my own no way should reflect on Intel's positions.
Inspired by a number of projects, including MKLDNN and https://arxiv.org/abs/1809.10170, I thought it would be interesting to try and improve NNlib's CPU performance by writing a direct Convolution in pure Julia.
The basic idea here is that convolution is effectively a large reduction onto a number of channels, and if we can use simd to accelerate reduction onto those channels, we can speed up convolution.
To this end, I take an image batch of shape (W, H, C, N), reshape it to (W, H, c, C', N), and then permute dims to (c, W, H, C', N), where c = 8 or 16 is the CPU's simd width for float32. Similarly, Weights of shape (w, h, I, O) are reshaped to (c_i, c_o, w, h, I', O'). I then perform a tight inner loop convolution, relying on SIMD.jl's fused multiply/add instructions for performance.
Testing on my 4-core i7-6700k with AVX2 shows a 2-2.5x speed improvement over im2col, tested against NNlib 0.4.3.
The algorithm's output remains in blocked form, so we can chain multiple Conv's together to amortize the overhead of blocking the tensors. Similar optimizations can be done on Batchnorm and Pooling to further increase the number of chained blocked operations.
@MikeInnes, @staticfloat and I have chatted about possibilities for NNlib, we agree that we should move towards operations on a typed
BlockedArray
kind of structure and refactor to fit the new API in #94, but in the short term we thought it would be good to get the PoC to the wider community.Thanks!
Matthew