Add a Blocked Convolution Proof of Concept #97

mbrookhart · 2019-03-01T03:54:37Z

Disclaimer: I work for Intel, but this work is purely my own no way should reflect on Intel's positions.

Inspired by a number of projects, including MKLDNN and https://arxiv.org/abs/1809.10170, I thought it would be interesting to try and improve NNlib's CPU performance by writing a direct Convolution in pure Julia.

The basic idea here is that convolution is effectively a large reduction onto a number of channels, and if we can use simd to accelerate reduction onto those channels, we can speed up convolution.

To this end, I take an image batch of shape (W, H, C, N), reshape it to (W, H, c, C', N), and then permute dims to (c, W, H, C', N), where c = 8 or 16 is the CPU's simd width for float32. Similarly, Weights of shape (w, h, I, O) are reshaped to (c_i, c_o, w, h, I', O'). I then perform a tight inner loop convolution, relying on SIMD.jl's fused multiply/add instructions for performance.

Testing on my 4-core i7-6700k with AVX2 shows a 2-2.5x speed improvement over im2col, tested against NNlib 0.4.3.

The algorithm's output remains in blocked form, so we can chain multiple Conv's together to amortize the overhead of blocking the tensors. Similar optimizations can be done on Batchnorm and Pooling to further increase the number of chained blocked operations.

@MikeInnes, @staticfloat and I have chatted about possibilities for NNlib, we agree that we should move towards operations on a typed BlockedArray kind of structure and refactor to fit the new API in #94, but in the short term we thought it would be good to get the PoC to the wider community.

Thanks!
Matthew

codecov-io · 2019-03-01T04:02:36Z

Codecov Report

Merging #97 into master will decrease coverage by 6.64%.
The diff coverage is 0%.

@@            Coverage Diff            @@
##           master     #97      +/-   ##
=========================================
- Coverage   79.15%   72.5%   -6.65%     
=========================================
  Files          24      25       +1     
  Lines         753     822      +69     
=========================================
  Hits          596     596              
- Misses        157     226      +69

Impacted Files	Coverage Δ
src/NNlib.jl	`100% <ø> (ø)`	⬆️
src/impl/blocked_conv.jl	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c514d05...ca98fed. Read the comment docs.

mbrookhart · 2019-03-05T15:09:06Z

@staticfloat pulled a couple of utilities out of #94 to make the transition to a single kernel for 1D/2D/3D convolution easier. Thank you.

staticfloat · 2019-03-28T23:20:08Z

@mbrookhart now that #94 is merged, I intend to take a look at this and see if we can't automatically choose this over other approaches for small images, assuming the performance works out. I'm especially excited to see how much the lack of allocations (for the col matrices in our im2col examples) helps here, as I think that is a non-insignificant amount of time we spend in our im2col convolution pipeline. I may not get to this for a few days or so, but it is in my TODO, never fear. :)

…otype

…the inner loop

mbrookhart · 2019-09-05T01:57:04Z

Closing due to lack of interest.

MikeInnes · 2019-09-05T11:12:41Z

As in you don't have the time, or lack of interest on our side?

I think we're still pretty interested in this, but at least personally I wasn't sure what the status is – is it ready to go from your side, or does it need integration with other NNlib parts that have been refactored lately, or help/review etc. Bumping with @staticfloat as well.

mbrookhart · 2019-09-05T15:32:45Z

Little bit of both, haven't had time on my end to extend it to other operations, haven't heard from @staticfloat for many months now. If you guys are still interested, I can reopen, didn't delete my fork.

Add a blocked convolution implementation as a proof of concept

7a934d0

clean up loop variable/shape names

b36f18d

MikeInnes requested a review from staticfloat March 4, 2019 14:46

add 1D and 3D kernels, test utility

595a14b

MikeInnes mentioned this pull request Mar 26, 2019

MKL-DNN FluxML/Flux.jl#157

Closed

mbrookhart added 7 commits May 2, 2019 20:52

Merge branch 'master' into blocked_conv_prototype

5f5cc2d

first crack at moving things into NNlib

986f3d6

Merge remote-tracking branch 'upstream/master' into blocked_conv_prot…

e70f7d5

…otype

use DenseConvDims API in first layer

3034911

update blocked_conv and blocked_conv! to use cdims, but don't change …

ecf7d27

…the inner loop

update to padded/central region loops

e707610

add SIMD to dependencies

ca98fed

staticfloat mentioned this pull request Jun 10, 2019

Convolutions with negative padding segfault... sometimes #123

Open

mbrookhart closed this Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Blocked Convolution Proof of Concept #97

Add a Blocked Convolution Proof of Concept #97

mbrookhart commented Mar 1, 2019

codecov-io commented Mar 1, 2019 •

edited

Loading

mbrookhart commented Mar 5, 2019

staticfloat commented Mar 28, 2019

mbrookhart commented Sep 5, 2019

MikeInnes commented Sep 5, 2019

mbrookhart commented Sep 5, 2019

Add a Blocked Convolution Proof of Concept #97

Add a Blocked Convolution Proof of Concept #97

Conversation

mbrookhart commented Mar 1, 2019

codecov-io commented Mar 1, 2019 • edited Loading

Codecov Report

mbrookhart commented Mar 5, 2019

staticfloat commented Mar 28, 2019

mbrookhart commented Sep 5, 2019

MikeInnes commented Sep 5, 2019

mbrookhart commented Sep 5, 2019

codecov-io commented Mar 1, 2019 •

edited

Loading