Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Blocked Convolution Proof of Concept #97

Closed
wants to merge 10 commits into from

Conversation

mbrookhart
Copy link

Disclaimer: I work for Intel, but this work is purely my own no way should reflect on Intel's positions.

Inspired by a number of projects, including MKLDNN and https://arxiv.org/abs/1809.10170, I thought it would be interesting to try and improve NNlib's CPU performance by writing a direct Convolution in pure Julia.

The basic idea here is that convolution is effectively a large reduction onto a number of channels, and if we can use simd to accelerate reduction onto those channels, we can speed up convolution.

To this end, I take an image batch of shape (W, H, C, N), reshape it to (W, H, c, C', N), and then permute dims to (c, W, H, C', N), where c = 8 or 16 is the CPU's simd width for float32. Similarly, Weights of shape (w, h, I, O) are reshaped to (c_i, c_o, w, h, I', O'). I then perform a tight inner loop convolution, relying on SIMD.jl's fused multiply/add instructions for performance.

Testing on my 4-core i7-6700k with AVX2 shows a 2-2.5x speed improvement over im2col, tested against NNlib 0.4.3.

The algorithm's output remains in blocked form, so we can chain multiple Conv's together to amortize the overhead of blocking the tensors. Similar optimizations can be done on Batchnorm and Pooling to further increase the number of chained blocked operations.

@MikeInnes, @staticfloat and I have chatted about possibilities for NNlib, we agree that we should move towards operations on a typed BlockedArray kind of structure and refactor to fit the new API in #94, but in the short term we thought it would be good to get the PoC to the wider community.

Thanks!
Matthew

@codecov-io
Copy link

codecov-io commented Mar 1, 2019

Codecov Report

Merging #97 into master will decrease coverage by 6.64%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #97      +/-   ##
=========================================
- Coverage   79.15%   72.5%   -6.65%     
=========================================
  Files          24      25       +1     
  Lines         753     822      +69     
=========================================
  Hits          596     596              
- Misses        157     226      +69
Impacted Files Coverage Δ
src/NNlib.jl 100% <ø> (ø) ⬆️
src/impl/blocked_conv.jl 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c514d05...ca98fed. Read the comment docs.

@MikeInnes MikeInnes requested a review from staticfloat March 4, 2019 14:46
@mbrookhart
Copy link
Author

@staticfloat pulled a couple of utilities out of #94 to make the transition to a single kernel for 1D/2D/3D convolution easier. Thank you.

@MikeInnes MikeInnes mentioned this pull request Mar 26, 2019
@staticfloat
Copy link
Contributor

@mbrookhart now that #94 is merged, I intend to take a look at this and see if we can't automatically choose this over other approaches for small images, assuming the performance works out. I'm especially excited to see how much the lack of allocations (for the col matrices in our im2col examples) helps here, as I think that is a non-insignificant amount of time we spend in our im2col convolution pipeline. I may not get to this for a few days or so, but it is in my TODO, never fear. :)

@mbrookhart
Copy link
Author

Closing due to lack of interest.

@mbrookhart mbrookhart closed this Sep 5, 2019
@MikeInnes
Copy link
Member

As in you don't have the time, or lack of interest on our side?

I think we're still pretty interested in this, but at least personally I wasn't sure what the status is – is it ready to go from your side, or does it need integration with other NNlib parts that have been refactored lately, or help/review etc. Bumping with @staticfloat as well.

@mbrookhart
Copy link
Author

Little bit of both, haven't had time on my end to extend it to other operations, haven't heard from @staticfloat for many months now. If you guys are still interested, I can reopen, didn't delete my fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants