Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added FFTs. Still need to add iFFTs (straightforward) and... #363

Closed
wants to merge 3 commits into from
Closed

Added FFTs. Still need to add iFFTs (straightforward) and... #363

wants to merge 3 commits into from

Conversation

chriselrod
Copy link
Contributor

@chriselrod chriselrod commented Feb 5, 2018

work on performance.
Just wanted this to be here for review / criticism / etc.
The style probably needs a lot of cleaning up.
"fft.jl" probably shouldn't be the first test.
Etc.

Besides style, critical features to add before this gets accepted:

  • inverse FFTs. I'll do that when I have free time over the next few days.
  • Last I tried it, it errors on prime numbers. Also straightforward fix, because of course it solves prime number ffts as part of solving larger ones.
  • Go after low hanging performance fruits.
  • Possibly -- break up the computation into separate real and complex components, instead of complex numbers, to more easily skip all the nops. An obvious thing here is to, each time sin would evaluate to 0, use a real instead of a complex number. If not going full-split, I'd have to watch out for type stability when doing label-swapping on each iteration.*
  • Improve compile times. One issue is that I call the transcendental functions way more than I have to. Reordering computations could greatly reduce that number.
  • Extension to matrices?
  • Clean things up. Reflect the style of the package.
  • More to be added based on suggestions.

*Do I have to swap labels? That is, how smart is the compiler about abandoning / clearing out memory that is no longer used? Right now I'm repeatedly overwriting the same sets (which means I can't switch from real to complex) because I don't want to clutter caches with junk.

Anyway, a basic overview: I just copied the Cooley Tukey algorithm from Wikipedia, minus the bit twiddling.

The vast majority of the functions just work to build up an expression.
The input is a vector (or tuple -- straightforward to allow) whose size, N, is known at compile time.
First, each element of the input is extracted into a variable named "x_1", "x_2", etc, corresponding to the index via Base.Cartesian.@nextract.
Then, based on how it factors, we organize each of these into mini prime ffts. For example, when N=30, this is the generated expression:

julia> StaticArrays.fft_expression(30, Float64)
quote  # /home/celrod/.julia/v0.6/StaticArrays/src/fft.jl, line 123:
    @fastmath begin  # /home/celrod/.julia/v0.6/StaticArrays/src/fft.jl, line 123:
            o_1 = x_1 + 1.0 + 0.0im * x_16
            o_2 = x_1 + -1.0 - 0.0im * x_16
            o_3 = x_6 + 1.0 + 0.0im * x_21
            o_4 = x_6 + -1.0 - 0.0im * x_21
            o_5 = x_11 + 1.0 + 0.0im * x_26
            o_6 = x_11 + -1.0 - 0.0im * x_26
            o_7 = x_2 + 1.0 + 0.0im * x_17
            o_8 = x_2 + -1.0 - 0.0im * x_17
            o_9 = x_7 + 1.0 + 0.0im * x_22
            o_10 = x_7 + -1.0 - 0.0im * x_22
            o_11 = x_12 + 1.0 + 0.0im * x_27
            o_12 = x_12 + -1.0 - 0.0im * x_27
            o_13 = x_3 + 1.0 + 0.0im * x_18
            o_14 = x_3 + -1.0 - 0.0im * x_18
            o_15 = x_8 + 1.0 + 0.0im * x_23
            o_16 = x_8 + -1.0 - 0.0im * x_23
            o_17 = x_13 + 1.0 + 0.0im * x_28
            o_18 = x_13 + -1.0 - 0.0im * x_28
            o_19 = x_4 + 1.0 + 0.0im * x_19
            o_20 = x_4 + -1.0 - 0.0im * x_19
            o_21 = x_9 + 1.0 + 0.0im * x_24
            o_22 = x_9 + -1.0 - 0.0im * x_24
            o_23 = x_14 + 1.0 + 0.0im * x_29
            o_24 = x_14 + -1.0 - 0.0im * x_29
            o_25 = x_5 + 1.0 + 0.0im * x_20
            o_26 = x_5 + -1.0 - 0.0im * x_20
            o_27 = x_10 + 1.0 + 0.0im * x_25
            o_28 = x_10 + -1.0 - 0.0im * x_25
            o_29 = x_15 + 1.0 + 0.0im * x_30
            o_30 = x_15 + -1.0 - 0.0im * x_30
            u_1 = o_1 + 1.0 + 0.0im * o_3
            u_1 += 1.0 + 0.0im * o_5
            u_2 = o_2 + 0.5 - 0.8660254037844387im * o_4
            u_2 += -0.5 - 0.8660254037844387im * o_6
            u_3 = o_1 + -0.5 - 0.8660254037844387im * o_3
            u_3 += -0.5 + 0.8660254037844387im * o_5
            u_4 = o_2 + -1.0 - 0.0im * o_4
            u_4 += 1.0 - 0.0im * o_6
            u_5 = o_1 + -0.5 + 0.8660254037844387im * o_3
            u_5 += -0.5 - 0.8660254037844387im * o_5
            u_6 = o_2 + 0.5 + 0.8660254037844387im * o_4
            u_6 += -0.5 + 0.8660254037844387im * o_6
            u_7 = o_7 + 1.0 + 0.0im * o_9
            u_7 += 1.0 + 0.0im * o_11
            u_8 = o_8 + 0.5 - 0.8660254037844387im * o_10
            u_8 += -0.5 - 0.8660254037844387im * o_12
            u_9 = o_7 + -0.5 - 0.8660254037844387im * o_9
            u_9 += -0.5 + 0.8660254037844387im * o_11
            u_10 = o_8 + -1.0 - 0.0im * o_10
            u_10 += 1.0 - 0.0im * o_12
            u_11 = o_7 + -0.5 + 0.8660254037844387im * o_9
            u_11 += -0.5 - 0.8660254037844387im * o_11
            u_12 = o_8 + 0.5 + 0.8660254037844387im * o_10
            u_12 += -0.5 + 0.8660254037844387im * o_12
            u_13 = o_13 + 1.0 + 0.0im * o_15
            u_13 += 1.0 + 0.0im * o_17
            u_14 = o_14 + 0.5 - 0.8660254037844387im * o_16
            u_14 += -0.5 - 0.8660254037844387im * o_18
            u_15 = o_13 + -0.5 - 0.8660254037844387im * o_15
            u_15 += -0.5 + 0.8660254037844387im * o_17
            u_16 = o_14 + -1.0 - 0.0im * o_16
            u_16 += 1.0 - 0.0im * o_18
            u_17 = o_13 + -0.5 + 0.8660254037844387im * o_15
            u_17 += -0.5 - 0.8660254037844387im * o_17
            u_18 = o_14 + 0.5 + 0.8660254037844387im * o_16
            u_18 += -0.5 + 0.8660254037844387im * o_18
            u_19 = o_19 + 1.0 + 0.0im * o_21
            u_19 += 1.0 + 0.0im * o_23
            u_20 = o_20 + 0.5 - 0.8660254037844387im * o_22
            u_20 += -0.5 - 0.8660254037844387im * o_24
            u_21 = o_19 + -0.5 - 0.8660254037844387im * o_21
            u_21 += -0.5 + 0.8660254037844387im * o_23
            u_22 = o_20 + -1.0 - 0.0im * o_22
            u_22 += 1.0 - 0.0im * o_24
            u_23 = o_19 + -0.5 + 0.8660254037844387im * o_21
            u_23 += -0.5 - 0.8660254037844387im * o_23
            u_24 = o_20 + 0.5 + 0.8660254037844387im * o_22
            u_24 += -0.5 + 0.8660254037844387im * o_24
            u_25 = o_25 + 1.0 + 0.0im * o_27
            u_25 += 1.0 + 0.0im * o_29
            u_26 = o_26 + 0.5 - 0.8660254037844387im * o_28
            u_26 += -0.5 - 0.8660254037844387im * o_30
            u_27 = o_25 + -0.5 - 0.8660254037844387im * o_27
            u_27 += -0.5 + 0.8660254037844387im * o_29
            u_28 = o_26 + -1.0 - 0.0im * o_28
            u_28 += 1.0 - 0.0im * o_30
            u_29 = o_25 + -0.5 + 0.8660254037844387im * o_27
            u_29 += -0.5 - 0.8660254037844387im * o_29
            u_30 = o_26 + 0.5 + 0.8660254037844387im * o_28
            u_30 += -0.5 + 0.8660254037844387im * o_30
            o_1 = u_1 + 1.0 + 0.0im * u_7
            o_1 += 1.0 + 0.0im * u_13
            o_1 += 1.0 + 0.0im * u_19
            o_1 += 1.0 + 0.0im * u_25
            o_2 = u_2 + 0.9781476007338057 - 0.20791169081775934im * u_8
            o_2 += 0.9135454576426009 - 0.4067366430758002im * u_14
            o_2 += 0.8090169943749475 - 0.5877852522924731im * u_20
            o_2 += 0.6691306063588582 - 0.7431448254773942im * u_26
            o_3 = u_3 + 0.9135454576426009 - 0.4067366430758002im * u_9
            o_3 += 0.6691306063588582 - 0.7431448254773942im * u_15
            o_3 += 0.30901699437494745 - 0.9510565162951535im * u_21
            o_3 += -0.10452846326765347 - 0.9945218953682733im * u_27
            o_4 = u_4 + 0.8090169943749475 - 0.5877852522924731im * u_10
            o_4 += 0.30901699437494745 - 0.9510565162951535im * u_16
            o_4 += -0.30901699437494745 - 0.9510565162951535im * u_22
            o_4 += -0.8090169943749475 - 0.5877852522924731im * u_28
            o_5 = u_5 + 0.6691306063588582 - 0.7431448254773942im * u_11
            o_5 += -0.10452846326765347 - 0.9945218953682733im * u_17
            o_5 += -0.8090169943749475 - 0.5877852522924731im * u_23
            o_5 += -0.9781476007338057 + 0.20791169081775934im * u_29
            o_6 = u_6 + 0.5 - 0.8660254037844387im * u_12
            o_6 += -0.5 - 0.8660254037844387im * u_18
            o_6 += -1.0 - 0.0im * u_24
            o_6 += -0.5 + 0.8660254037844387im * u_30
            o_7 = u_1 + 0.30901699437494745 - 0.9510565162951535im * u_7
            o_7 += -0.8090169943749475 - 0.5877852522924731im * u_13
            o_7 += -0.8090169943749475 + 0.5877852522924731im * u_19
            o_7 += 0.30901699437494745 + 0.9510565162951535im * u_25
            o_8 = u_2 + 0.10452846326765347 - 0.9945218953682733im * u_8
            o_8 += -0.9781476007338057 - 0.20791169081775934im * u_14
            o_8 += -0.30901699437494745 + 0.9510565162951535im * u_20
            o_8 += 0.9135454576426009 + 0.4067366430758002im * u_26
            o_9 = u_3 + -0.10452846326765347 - 0.9945218953682733im * u_9
            o_9 += -0.9781476007338057 + 0.20791169081775934im * u_15
            o_9 += 0.30901699437494745 + 0.9510565162951535im * u_21
            o_9 += 0.9135454576426009 - 0.4067366430758002im * u_27
            o_10 = u_4 + -0.30901699437494745 - 0.9510565162951535im * u_10
            o_10 += -0.8090169943749475 + 0.5877852522924731im * u_16
            o_10 += 0.8090169943749475 + 0.5877852522924731im * u_22
            o_10 += 0.30901699437494745 - 0.9510565162951535im * u_28
            o_11 = u_5 + -0.5 - 0.8660254037844387im * u_11
            o_11 += -0.5 + 0.8660254037844387im * u_17
            o_11 += 1.0 - 0.0im * u_23
            o_11 += -0.5 - 0.8660254037844387im * u_29
            o_12 = u_6 + -0.6691306063588582 - 0.7431448254773942im * u_12
            o_12 += -0.10452846326765347 + 0.9945218953682733im * u_18
            o_12 += 0.8090169943749475 - 0.5877852522924731im * u_24
            o_12 += -0.9781476007338057 - 0.20791169081775934im * u_30
            o_13 = u_1 + -0.8090169943749475 - 0.5877852522924731im * u_7
            o_13 += 0.30901699437494745 + 0.9510565162951535im * u_13
            o_13 += 0.30901699437494745 - 0.9510565162951535im * u_19
            o_13 += -0.8090169943749475 + 0.5877852522924731im * u_25
            o_14 = u_2 + -0.9135454576426009 - 0.4067366430758002im * u_8
            o_14 += 0.6691306063588582 + 0.7431448254773942im * u_14
            o_14 += -0.30901699437494745 - 0.9510565162951535im * u_20
            o_14 += -0.10452846326765347 + 0.9945218953682733im * u_26
            o_15 = u_3 + -0.9781476007338057 - 0.20791169081775934im * u_9
            o_15 += 0.9135454576426009 + 0.4067366430758002im * u_15
            o_15 += -0.8090169943749475 - 0.5877852522924731im * u_21
            o_15 += 0.6691306063588582 + 0.7431448254773942im * u_27
            o_16 = u_4 + -1.0 - 0.0im * u_10
            o_16 += 1.0 - 0.0im * u_16
            o_16 += -1.0 - 0.0im * u_22
            o_16 += 1.0 - 0.0im * u_28
            o_17 = u_5 + -0.9781476007338057 + 0.20791169081775934im * u_11
            o_17 += 0.9135454576426009 - 0.4067366430758002im * u_17
            o_17 += -0.8090169943749475 + 0.5877852522924731im * u_23
            o_17 += 0.6691306063588582 - 0.7431448254773942im * u_29
            o_18 = u_6 + -0.9135454576426009 + 0.4067366430758002im * u_12
            o_18 += 0.6691306063588582 - 0.7431448254773942im * u_18
            o_18 += -0.30901699437494745 + 0.9510565162951535im * u_24
            o_18 += -0.10452846326765347 - 0.9945218953682733im * u_30
            o_19 = u_1 + -0.8090169943749475 + 0.5877852522924731im * u_7
            o_19 += 0.30901699437494745 - 0.9510565162951535im * u_13
            o_19 += 0.30901699437494745 + 0.9510565162951535im * u_19
            o_19 += -0.8090169943749475 - 0.5877852522924731im * u_25
            o_20 = u_2 + -0.6691306063588582 + 0.7431448254773942im * u_8
            o_20 += -0.10452846326765347 - 0.9945218953682733im * u_14
            o_20 += 0.8090169943749475 + 0.5877852522924731im * u_20
            o_20 += -0.9781476007338057 + 0.20791169081775934im * u_26
            o_21 = u_3 + -0.5 + 0.8660254037844387im * u_9
            o_21 += -0.5 - 0.8660254037844387im * u_15
            o_21 += 1.0 - 0.0im * u_21
            o_21 += -0.5 + 0.8660254037844387im * u_27
            o_22 = u_4 + -0.30901699437494745 + 0.9510565162951535im * u_10
            o_22 += -0.8090169943749475 - 0.5877852522924731im * u_16
            o_22 += 0.8090169943749475 - 0.5877852522924731im * u_22
            o_22 += 0.30901699437494745 + 0.9510565162951535im * u_28
            o_23 = u_5 + -0.10452846326765347 + 0.9945218953682733im * u_11
            o_23 += -0.9781476007338057 - 0.20791169081775934im * u_17
            o_23 += 0.30901699437494745 - 0.9510565162951535im * u_23
            o_23 += 0.9135454576426009 + 0.4067366430758002im * u_29
            o_24 = u_6 + 0.10452846326765347 + 0.9945218953682733im * u_12
            o_24 += -0.9781476007338057 + 0.20791169081775934im * u_18
            o_24 += -0.30901699437494745 - 0.9510565162951535im * u_24
            o_24 += 0.9135454576426009 - 0.4067366430758002im * u_30
            o_25 = u_1 + 0.30901699437494745 + 0.9510565162951535im * u_7
            o_25 += -0.8090169943749475 + 0.5877852522924731im * u_13
            o_25 += -0.8090169943749475 - 0.5877852522924731im * u_19
            o_25 += 0.30901699437494745 - 0.9510565162951535im * u_25
            o_26 = u_2 + 0.5 + 0.8660254037844387im * u_8
            o_26 += -0.5 + 0.8660254037844387im * u_14
            o_26 += -1.0 - 0.0im * u_20
            o_26 += -0.5 - 0.8660254037844387im * u_26
            o_27 = u_3 + 0.6691306063588582 + 0.7431448254773942im * u_9
            o_27 += -0.10452846326765347 + 0.9945218953682733im * u_15
            o_27 += -0.8090169943749475 + 0.5877852522924731im * u_21
            o_27 += -0.9781476007338057 - 0.20791169081775934im * u_27
            o_28 = u_4 + 0.8090169943749475 + 0.5877852522924731im * u_10
            o_28 += 0.30901699437494745 + 0.9510565162951535im * u_16
            o_28 += -0.30901699437494745 + 0.9510565162951535im * u_22
            o_28 += -0.8090169943749475 + 0.5877852522924731im * u_28
            o_29 = u_5 + 0.9135454576426009 + 0.4067366430758002im * u_11
            o_29 += 0.6691306063588582 + 0.7431448254773942im * u_17
            o_29 += 0.30901699437494745 + 0.9510565162951535im * u_23
            o_29 += -0.10452846326765347 + 0.9945218953682733im * u_29
            o_30 = u_6 + 0.9781476007338057 + 0.20791169081775934im * u_12
            o_30 += 0.9135454576426009 + 0.4067366430758002im * u_18
            o_30 += 0.8090169943749475 + 0.5877852522924731im * u_24
            o_30 += 0.6691306063588582 + 0.7431448254773942im * u_30
        end
    SVector((o_1, o_2, o_3, o_4, o_5, o_6, o_7, o_8, o_9, o_10, o_11, o_12, o_13, o_14, o_15, o_16, o_17, o_18, o_19, o_20, o_21, o_22, o_23, o_24, o_25, o_26, o_27, o_28, o_29, o_30))
end

Imagine drawing a tree, following the recursive definition from the largest to smallest prime factors.
That is, 5 * 3 * 2 -- so we start with the smallest (2x 2ffts) of variables (5x3) apart.
From there, I simplified things by reorganizing the indices to place ffts next to each other. Each iteration of the assembler contracts the number of ffts (but they grow longer by the same margin), and then building the output vector in the end.

The reason for all the += is because the functions will otherwise allocate memory and slow down dramatically. Splitting them up into several lines solves that problem.

iffts are basically the same thing, just negative of exp's exponents (or sin/cos's parameters) and a scaling factor.

Performance of the algorithm may need general improvement. I haven't tried it much beyond very small sizes, but it does well at small sizes like SVectors promise.
But it is pure Julia, and has all the "just works" examples that come with it (especially if it is also defined to work on NTuples).

julia> using StaticArrays, BenchmarkTools
INFO: Recompiling stale cache file /home/celrod/.julia/lib/v0.6/StaticArrays.ji for module StaticArrays.

julia> x = @SVector randn(12);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc); px = plan_fft(xc);

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.028 ns (0.00% GC)
  median time:      52.100 ns (0.00% GC)
  mean time:        53.262 ns (0.00% GC)
  maximum time:     89.666 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     89.470 ns (0.00% GC)
  median time:      91.162 ns (0.00% GC)
  mean time:        94.105 ns (0.00% GC)
  maximum time:     152.291 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     955

julia> x = @SVector randn(16);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc); px = plan_fft(xc);

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     96.183 ns (0.00% GC)
  median time:      98.878 ns (0.00% GC)
  mean time:        102.202 ns (0.00% GC)
  maximum time:     164.407 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     946

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     117.662 ns (0.00% GC)
  median time:      119.859 ns (0.00% GC)
  mean time:        122.359 ns (0.00% GC)
  maximum time:     175.610 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     918

julia> x = @SVector randn(18);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc); px = plan_fft(xc);

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     95.259 ns (0.00% GC)
  median time:      97.339 ns (0.00% GC)
  mean time:        100.163 ns (0.00% GC)
  maximum time:     149.081 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     946

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     139.492 ns (0.00% GC)
  median time:      141.441 ns (0.00% GC)
  mean time:        144.569 ns (0.00% GC)
  maximum time:     206.916 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     861

julia> x = @SVector randn(24);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc); px = plan_fft(xc);

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     116.136 ns (0.00% GC)
  median time:      116.193 ns (0.00% GC)
  mean time:        118.957 ns (0.00% GC)
  maximum time:     169.975 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     919

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     162.604 ns (0.00% GC)
  median time:      164.030 ns (0.00% GC)
  mean time:        168.654 ns (0.00% GC)
  maximum time:     255.530 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     788

@c42f
Copy link
Member

c42f commented Feb 6, 2018

This is pretty cool, thanks for taking the time to put it into a PR. It's great that you can get a speedup from a really simple unrolled version of the original algorithm. It's also kind of remarkable how competitive Base.fft (I assume fftw) is here as naively specializing and inlining everything usually wins by a larger factor.

I only had a very quick look at the code - a few thoughts:

  • Be aware that in @generated function code generators you shouldn't call any functions which might be extended after the generated function is defined. Generally this limits us mostly to doing loop unrolling, with other information (such as the output eltype) computed inside the code which is generated.
  • Did you benchmark against plain old matrix multiplication (ie, the naive DFT) for small precomputed static matrices of coefficients? In the past I found this competitive for small sizes but I can't remember whether the environment I had at the time contained a decent fft implementation.
  • As I recall (just vaguely) libraries like fftw have optimized codelets for a whole heap of small sizes, so examining the optimization tricks which are pulled there might be interesting/useful.
  • I wonder whether this would be better in its own package which depends on StaticArrays? I bring this up because maintaining and upgrading StaticArrays is already quite unwieldy, implementing as it does some random assortment of half of the standard library linear algebra and arrays functionality :-) Not to discourage you from continuing your big list to make this PR great, but a point to consider.

By the way, what is your use case for implementing these tiny transformations?

@andyferris
Copy link
Member

I agree, this is cool :)

I'm not sure if you're aware but the fft function has moved out of Base for Julia v0.7/v1.0, and into the FFTW.jl and AbstractFFTs.jl packages. Moving forward to the next version of Julia, you're best option would be to start a new package (called something like StaticFFTs.jl) which depends on StaticArrays and AbstractFFTs. (For a Julia v0.6-only release you can just depend on StaticArrays).

CC @stevengj - just in case statically sized FFTs in Julia piques your interest?

@chriselrod
Copy link
Contributor Author

This is pretty cool, thanks for taking the time to put it into a PR. It's great that you can get a speedup from a really simple unrolled version of the original algorithm.

I'm not sure if you're aware but the fft function has moved out of Base for Julia v0.7/v1.0, and into the FFTW.jl and AbstractFFTs.jl packages. Moving forward to the next version of Julia, you're best option would be to start a new package (called something like StaticFFTs.jl) which depends on StaticArrays and AbstractFFTs. (For a Julia v0.6-only release you can just depend on StaticArrays).

Is there any proper channel or protocol to see if it's better to submit a pull request or create a new package?
I asked on gitter and then figured it wouldn't hurt to try a pull request.
If you guys think it'd be better as a separate package, I wont object to someone closing this.

Anyway, I hope to make a more serious post and add revisions next weekend (either here, or in StaticFFTs.jl).
Of course, StaticFFTs aren't far removed from creating a "plan" function that wraps a dynamic dispatch.

Julia's generated functions are probably an ideal way to actually set up ffts, so I wouldn't be surprised if there's a great pure Julia implementation someday.

It's also kind of remarkable how competitive Base.fft (I assume fftw) is here as naively specializing and inlining everything usually wins by a larger factor.

Did you benchmark against plain old matrix multiplication (ie, the naive DFT) for small precomputed static matrices of coefficients? In the past I found this competitive for small sizes but I can't remember whether the environment I had at the time contained a decent fft implementation.

function MatrixDFT(n::Int, ::Type{T} = Float64) where T
    out = fill(one(Complex{T}), n, n)
    for j  2:n
        out[j, 2] = exp( -2π*im*(j-1) / n )
    end
    for i  3:n, j  i:n
        out[j,i] = out[j,i-1] * out[j,2]
    end
    Symmetric(out, :L) #./= √n
end
function MatrixDFT(::Val{n}, ::Type{T} = Float64) where {T,b}
    out = fill(one(Complex{T}), n, n)
    for j  2:n
        out[j, 2] = exp( -2π*im*(j-1) / n )
    end
    for i  3:n, j  i:n
        out[j,i] = out[j,i-1] * out[j,2]
    end
    SMatrix{n,n}(out) #./= √n
end

Then (I did start Julia with -O3):

julia> n = 4
4

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> static = MatrixDFT(Val(n));

julia> @benchmark $static * $x
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.597 ns (0.00% GC)
  median time:      15.603 ns (0.00% GC)
  mean time:        16.307 ns (0.00% GC)
  maximum time:     55.080 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     226.052 ns (0.00% GC)
  median time:      234.811 ns (0.00% GC)
  mean time:        242.141 ns (0.00% GC)
  maximum time:     363.591 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     460

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.675 ns (0.00% GC)
  median time:      47.923 ns (0.00% GC)
  mean time:        49.421 ns (0.00% GC)
  maximum time:     91.303 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.239 ns (0.00% GC)
  median time:      19.242 ns (0.00% GC)
  mean time:        19.665 ns (0.00% GC)
  maximum time:     62.975 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     62.733 ns (0.00% GC)
  median time:      65.173 ns (0.00% GC)
  mean time:        67.553 ns (0.00% GC)
  maximum time:     128.734 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     973

julia> n = 8
8

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> static = MatrixDFT(Val(n));

julia> @benchmark $static * $x
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.854 ns (0.00% GC)
  median time:      43.045 ns (0.00% GC)
  mean time:        43.855 ns (0.00% GC)
  maximum time:     63.999 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     764.807 ns (0.00% GC)
  median time:      773.959 ns (0.00% GC)
  mean time:        791.216 ns (0.00% GC)
  maximum time:     1.403 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     109

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     159.669 ns (0.00% GC)
  median time:      167.005 ns (0.00% GC)
  mean time:        171.965 ns (0.00% GC)
  maximum time:     245.055 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     788

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.905 ns (0.00% GC)
  median time:      38.907 ns (0.00% GC)
  mean time:        39.807 ns (0.00% GC)
  maximum time:     73.683 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     78.316 ns (0.00% GC)
  median time:      82.653 ns (0.00% GC)
  mean time:        83.989 ns (0.00% GC)
  maximum time:     155.556 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     968

julia> n = 9
9

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> static = MatrixDFT(Val(n));

julia> @benchmark $static * $x
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.537 ns (0.00% GC)
  median time:      55.418 ns (0.00% GC)
  mean time:        56.333 ns (0.00% GC)
  maximum time:     103.198 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     971.227 ns (0.00% GC)
  median time:      1.000 μs (0.00% GC)
  mean time:        1.016 μs (0.00% GC)
  maximum time:     2.660 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     22

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     183.720 ns (0.00% GC)
  median time:      190.713 ns (0.00% GC)
  mean time:        193.549 ns (0.00% GC)
  maximum time:     291.679 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     626

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     70.955 ns (0.00% GC)
  median time:      70.965 ns (0.00% GC)
  mean time:        73.333 ns (0.00% GC)
  maximum time:     109.058 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     975

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     105.001 ns (0.00% GC)
  median time:      108.898 ns (0.00% GC)
  mean time:        114.061 ns (0.00% GC)
  maximum time:     189.999 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     937

julia> n = 12
12

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> static = MatrixDFT(Val(n));

julia> @benchmark $static * $x
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     95.963 ns (0.00% GC)
  median time:      99.998 ns (0.00% GC)
  mean time:        102.809 ns (0.00% GC)
  maximum time:     177.366 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     946

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.974 μs (0.00% GC)
  median time:      2.103 μs (0.00% GC)
  mean time:        2.161 μs (0.00% GC)
  maximum time:     6.809 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     290.830 ns (0.00% GC)
  median time:      297.380 ns (0.00% GC)
  mean time:        308.262 ns (0.00% GC)
  maximum time:     549.749 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     271

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.028 ns (0.00% GC)
  median time:      52.042 ns (0.00% GC)
  mean time:        55.162 ns (0.00% GC)
  maximum time:     115.460 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     986

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     89.946 ns (0.00% GC)
  median time:      97.067 ns (0.00% GC)
  mean time:        101.421 ns (0.00% GC)
  maximum time:     207.158 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     956

julia> n = 24
24

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> static = MatrixDFT(Val(n));

julia> @benchmark $static * $x
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     443.359 ns (0.00% GC)
  median time:      451.833 ns (0.00% GC)
  mean time:        469.978 ns (0.00% GC)
  maximum time:     832.616 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     198

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.534 μs (0.00% GC)
  median time:      8.146 μs (0.00% GC)
  mean time:        8.374 μs (0.00% GC)
  maximum time:     20.068 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.043 μs (0.00% GC)
  median time:      1.056 μs (0.00% GC)
  mean time:        1.086 μs (0.00% GC)
  maximum time:     4.266 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     116.139 ns (0.00% GC)
  median time:      116.647 ns (0.00% GC)
  mean time:        119.742 ns (0.00% GC)
  maximum time:     168.846 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     921

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     161.727 ns (0.00% GC)
  median time:      163.455 ns (0.00% GC)
  mean time:        170.398 ns (0.00% GC)
  maximum time:     352.685 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     772

julia> n = 64
64

julia> x = @SVector randn(n);

julia> xc = Complex.([x...]);

julia> xcs = similar(xc);

julia> px = plan_fft(xc);

julia> symmetric = MatrixDFT(n);

julia> regular_dft = Array(symmetric);

julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.327 μs (0.00% GC)
  median time:      54.325 μs (0.00% GC)
  mean time:        55.412 μs (0.00% GC)
  maximum time:     109.277 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.344 μs (0.00% GC)
  median time:      6.354 μs (0.00% GC)
  mean time:        6.514 μs (0.00% GC)
  maximum time:     21.284 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark fft($x)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     743.430 ns (0.00% GC)
  median time:      745.383 ns (0.00% GC)
  mean time:        763.079 ns (0.00% GC)
  maximum time:     1.180 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     128

julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     247.975 ns (0.00% GC)
  median time:      254.828 ns (0.00% GC)
  mean time:        259.313 ns (0.00% GC)
  maximum time:     377.162 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     401

It falls behind FFTW by 64, but it starts and remains clearly well ahead of the DFT matrices by then. O(nlog(n)) scaling vs theoretically O(n^2)*, so it would probably have to fall into major memory issues before it falls behind them.
*Is it? By the same logic, matrix multiplication is O(n^3), but I know optimized BLAS libraries are supposed to achieve better than O(n^2.4).

I'm surprised at the performance for n=4 and n=9. It must not be SIMD vectorizing as well?
That seems to be the case based on the number of <4 x double> and <2 x double> statements:

julia> x4 = @SVector randn(4);

julia> dft4 = MatrixDFT(Val(4));

julia> @code_llvm dft4 * x4

define void @"julia_*_63855"(%SArray.73* noalias nocapture sret, %SArray.70* nocapture readonly dereferenceable(256), %SArray.68* nocapture readonly dereferenceable(32)) #0 !dbg !5 {
top:
  %3 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 0
  %4 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 1
  %5 = load double, double* %3, align 8
  %6 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 4, i32 0
  %7 = load double, double* %4, align 8
  %8 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 2
  %9 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 8, i32 0
  %10 = load double, double* %8, align 8
  %11 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 3
  %12 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 12, i32 0
  %13 = load double, double* %11, align 8
  %14 = bitcast %SArray.70* %1 to <4 x double>*
  %15 = load <4 x double>, <4 x double>* %14, align 8
  %16 = insertelement <4 x double> undef, double %5, i32 0
  %17 = insertelement <4 x double> %16, double %5, i32 1
  %18 = insertelement <4 x double> %17, double %5, i32 2
  %19 = insertelement <4 x double> %18, double %5, i32 3
  %20 = fmul <4 x double> %19, %15
  %21 = bitcast double* %6 to <4 x double>*
  %22 = load <4 x double>, <4 x double>* %21, align 8
  %23 = insertelement <4 x double> undef, double %7, i32 0
  %24 = insertelement <4 x double> %23, double %7, i32 1
  %25 = insertelement <4 x double> %24, double %7, i32 2
  %26 = insertelement <4 x double> %25, double %7, i32 3
  %27 = fmul <4 x double> %26, %22
  %28 = fadd <4 x double> %20, %27
  %29 = bitcast double* %9 to <4 x double>*
  %30 = load <4 x double>, <4 x double>* %29, align 8
  %31 = insertelement <4 x double> undef, double %10, i32 0
  %32 = insertelement <4 x double> %31, double %10, i32 1
  %33 = insertelement <4 x double> %32, double %10, i32 2
  %34 = insertelement <4 x double> %33, double %10, i32 3
  %35 = fmul <4 x double> %34, %30
  %36 = fadd <4 x double> %28, %35
  %37 = bitcast double* %12 to <4 x double>*
  %38 = load <4 x double>, <4 x double>* %37, align 8
  %39 = insertelement <4 x double> undef, double %13, i32 0
  %40 = insertelement <4 x double> %39, double %13, i32 1
  %41 = insertelement <4 x double> %40, double %13, i32 2
  %42 = insertelement <4 x double> %41, double %13, i32 3
  %43 = fmul <4 x double> %42, %38
  %44 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 2, i32 0
  %45 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 6, i32 0
  %46 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 10, i32 0
  %47 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 14, i32 0
  %48 = bitcast double* %44 to <4 x double>*
  %49 = load <4 x double>, <4 x double>* %48, align 8
  %50 = fmul <4 x double> %19, %49
  %51 = bitcast double* %45 to <4 x double>*
  %52 = load <4 x double>, <4 x double>* %51, align 8
  %53 = fmul <4 x double> %26, %52
  %54 = fadd <4 x double> %50, %53
  %55 = bitcast double* %46 to <4 x double>*
  %56 = load <4 x double>, <4 x double>* %55, align 8
  %57 = fmul <4 x double> %34, %56
  %58 = fadd <4 x double> %54, %57
  %59 = bitcast double* %47 to <4 x double>*
  %60 = load <4 x double>, <4 x double>* %59, align 8
  %61 = fmul <4 x double> %42, %60
  %62 = fadd <4 x double> %36, %43
  %63 = fadd <4 x double> %58, %61
  %64 = bitcast %SArray.73* %0 to <4 x double>*
  store <4 x double> %62, <4 x double>* %64, align 8
  %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 0
  %65 = bitcast double* %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
  store <4 x double> %63, <4 x double>* %65, align 8
  ret void
}

julia> @code_llvm fft(x4)

define void @julia_fft_63446(%SArray.73* noalias nocapture sret, %SArray.68* nocapture readonly dereferenceable(32)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 0
  %3 = load double, double* %2, align 8
  %4 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 1
  %5 = load double, double* %4, align 8
  %6 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 2
  %7 = load double, double* %6, align 8
  %8 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 3
  %9 = load double, double* %8, align 8
  %10 = fadd fast double %7, %3
  %11 = fadd fast double %9, %5
  %12 = fsub fast double %5, %9
  %13 = fadd fast double %11, %10
  %14 = insertelement <2 x double> undef, double %3, i32 0
  %15 = insertelement <2 x double> %14, double -0.000000e+00, i32 1
  %16 = insertelement <2 x double> undef, double %7, i32 0
  %17 = insertelement <2 x double> %16, double %12, i32 1
  %18 = fsub fast <2 x double> %15, %17
  %19 = fsub fast double %10, %11
  %.sroa.016.sroa.0.0..sroa.016.0..sroa_cast.sroa_idx = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 0, i32 0
  store double %13, double* %.sroa.016.sroa.0.0..sroa.016.0..sroa_cast.sroa_idx, align 8
  %.sroa.016.sroa.2.0..sroa.016.0..sroa_cast.sroa_idx28 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 0, i32 1
  store double 0.000000e+00, double* %.sroa.016.sroa.2.0..sroa.016.0..sroa_cast.sroa_idx28, align 8
  %.sroa.016.sroa.3.0..sroa.016.0..sroa_cast.sroa_idx29 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 1, i32 0
  %20 = bitcast double* %.sroa.016.sroa.3.0..sroa.016.0..sroa_cast.sroa_idx29 to <2 x double>*
  store <2 x double> %18, <2 x double>* %20, align 8
  %.sroa.016.sroa.5.0..sroa.016.0..sroa_cast.sroa_idx31 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 0
  store double %19, double* %.sroa.016.sroa.5.0..sroa.016.0..sroa_cast.sroa_idx31, align 8
  %.sroa.016.sroa.6.0..sroa.016.0..sroa_cast.sroa_idx32 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 1
  store double 0.000000e+00, double* %.sroa.016.sroa.6.0..sroa.016.0..sroa_cast.sroa_idx32, align 8
  %.sroa.016.sroa.7.0..sroa.016.0..sroa_cast.sroa_idx33 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 3, i32 0
  %21 = extractelement <2 x double> %18, i32 0
  store double %21, double* %.sroa.016.sroa.7.0..sroa.016.0..sroa_cast.sroa_idx33, align 8
  %.sroa.016.sroa.8.0..sroa.016.0..sroa_cast.sroa_idx34 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 3, i32 1
  store double %12, double* %.sroa.016.sroa.8.0..sroa.016.0..sroa_cast.sroa_idx34, align 8
  ret void
}

julia> x9 = @SVector randn(9);

julia> dft9 = MatrixDFT(Val(9));

julia> @code_llvm dft9 * x9

define void @"julia_*_63857"(%SArray.87* noalias nocapture sret, %SArray.83* nocapture readonly dereferenceable(1296), %SArray.82* nocapture readonly dereferenceable(72)) #0 !dbg !5 {
top:
  %3 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 0
  %4 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 1
  %5 = load double, double* %3, align 8
  %6 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 9, i32 0
  %7 = load double, double* %4, align 8
  %8 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 2
  %9 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 18, i32 0
  %10 = load double, double* %8, align 8
  %11 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 3
  %12 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 27, i32 0
  %13 = load double, double* %11, align 8
  %14 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 4
  %15 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 36, i32 0
  %16 = load double, double* %14, align 8
  %17 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 5
  %18 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 45, i32 0
  %19 = load double, double* %17, align 8
  %20 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 6
  %21 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 54, i32 0
  %22 = load double, double* %20, align 8
  %23 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 7
  %24 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 63, i32 0
  %25 = load double, double* %23, align 8
  %26 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 8
  %27 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 72, i32 0
  %28 = load double, double* %26, align 8
  %29 = bitcast %SArray.83* %1 to <4 x double>*
  %30 = load <4 x double>, <4 x double>* %29, align 8
  %31 = insertelement <4 x double> undef, double %5, i32 0
  %32 = insertelement <4 x double> %31, double %5, i32 1
  %33 = insertelement <4 x double> %32, double %5, i32 2
  %34 = insertelement <4 x double> %33, double %5, i32 3
  %35 = fmul <4 x double> %34, %30
  %36 = bitcast double* %6 to <4 x double>*
  %37 = load <4 x double>, <4 x double>* %36, align 8
  %38 = insertelement <4 x double> undef, double %7, i32 0
  %39 = insertelement <4 x double> %38, double %7, i32 1
  %40 = insertelement <4 x double> %39, double %7, i32 2
  %41 = insertelement <4 x double> %40, double %7, i32 3
  %42 = fmul <4 x double> %41, %37
  %43 = fadd <4 x double> %35, %42
  %44 = bitcast double* %9 to <4 x double>*
  %45 = load <4 x double>, <4 x double>* %44, align 8
  %46 = insertelement <4 x double> undef, double %10, i32 0
  %47 = insertelement <4 x double> %46, double %10, i32 1
  %48 = insertelement <4 x double> %47, double %10, i32 2
  %49 = insertelement <4 x double> %48, double %10, i32 3
  %50 = fmul <4 x double> %49, %45
  %51 = fadd <4 x double> %43, %50
  %52 = bitcast double* %12 to <4 x double>*
  %53 = load <4 x double>, <4 x double>* %52, align 8
  %54 = insertelement <4 x double> undef, double %13, i32 0
  %55 = insertelement <4 x double> %54, double %13, i32 1
  %56 = insertelement <4 x double> %55, double %13, i32 2
  %57 = insertelement <4 x double> %56, double %13, i32 3
  %58 = fmul <4 x double> %57, %53
  %59 = fadd <4 x double> %51, %58
  %60 = bitcast double* %15 to <4 x double>*
  %61 = load <4 x double>, <4 x double>* %60, align 8
  %62 = insertelement <4 x double> undef, double %16, i32 0
  %63 = insertelement <4 x double> %62, double %16, i32 1
  %64 = insertelement <4 x double> %63, double %16, i32 2
  %65 = insertelement <4 x double> %64, double %16, i32 3
  %66 = fmul <4 x double> %65, %61
  %67 = fadd <4 x double> %59, %66
  %68 = bitcast double* %18 to <4 x double>*
  %69 = load <4 x double>, <4 x double>* %68, align 8
  %70 = insertelement <4 x double> undef, double %19, i32 0
  %71 = insertelement <4 x double> %70, double %19, i32 1
  %72 = insertelement <4 x double> %71, double %19, i32 2
  %73 = insertelement <4 x double> %72, double %19, i32 3
  %74 = fmul <4 x double> %73, %69
  %75 = fadd <4 x double> %67, %74
  %76 = bitcast double* %21 to <4 x double>*
  %77 = load <4 x double>, <4 x double>* %76, align 8
  %78 = insertelement <4 x double> undef, double %22, i32 0
  %79 = insertelement <4 x double> %78, double %22, i32 1
  %80 = insertelement <4 x double> %79, double %22, i32 2
  %81 = insertelement <4 x double> %80, double %22, i32 3
  %82 = fmul <4 x double> %81, %77
  %83 = fadd <4 x double> %75, %82
  %84 = bitcast double* %24 to <4 x double>*
  %85 = load <4 x double>, <4 x double>* %84, align 8
  %86 = insertelement <4 x double> undef, double %25, i32 0
  %87 = insertelement <4 x double> %86, double %25, i32 1
  %88 = insertelement <4 x double> %87, double %25, i32 2
  %89 = insertelement <4 x double> %88, double %25, i32 3
  %90 = fmul <4 x double> %89, %85
  %91 = fadd <4 x double> %83, %90
  %92 = bitcast double* %27 to <4 x double>*
  %93 = load <4 x double>, <4 x double>* %92, align 8
  %94 = insertelement <4 x double> undef, double %28, i32 0
  %95 = insertelement <4 x double> %94, double %28, i32 1
  %96 = insertelement <4 x double> %95, double %28, i32 2
  %97 = insertelement <4 x double> %96, double %28, i32 3
  %98 = fmul <4 x double> %97, %93
  %99 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 2, i32 0
  %100 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 11, i32 0
  %101 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 20, i32 0
  %102 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 29, i32 0
  %103 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 38, i32 0
  %104 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 47, i32 0
  %105 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 56, i32 0
  %106 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 65, i32 0
  %107 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 74, i32 0
  %108 = bitcast double* %99 to <4 x double>*
  %109 = load <4 x double>, <4 x double>* %108, align 8
  %110 = fmul <4 x double> %34, %109
  %111 = bitcast double* %100 to <4 x double>*
  %112 = load <4 x double>, <4 x double>* %111, align 8
  %113 = fmul <4 x double> %41, %112
  %114 = fadd <4 x double> %110, %113
  %115 = bitcast double* %101 to <4 x double>*
  %116 = load <4 x double>, <4 x double>* %115, align 8
  %117 = fmul <4 x double> %49, %116
  %118 = fadd <4 x double> %114, %117
  %119 = bitcast double* %102 to <4 x double>*
  %120 = load <4 x double>, <4 x double>* %119, align 8
  %121 = fmul <4 x double> %57, %120
  %122 = fadd <4 x double> %118, %121
  %123 = bitcast double* %103 to <4 x double>*
  %124 = load <4 x double>, <4 x double>* %123, align 8
  %125 = fmul <4 x double> %65, %124
  %126 = fadd <4 x double> %122, %125
  %127 = bitcast double* %104 to <4 x double>*
  %128 = load <4 x double>, <4 x double>* %127, align 8
  %129 = fmul <4 x double> %73, %128
  %130 = fadd <4 x double> %126, %129
  %131 = bitcast double* %105 to <4 x double>*
  %132 = load <4 x double>, <4 x double>* %131, align 8
  %133 = fmul <4 x double> %81, %132
  %134 = fadd <4 x double> %130, %133
  %135 = bitcast double* %106 to <4 x double>*
  %136 = load <4 x double>, <4 x double>* %135, align 8
  %137 = fmul <4 x double> %89, %136
  %138 = fadd <4 x double> %134, %137
  %139 = bitcast double* %107 to <4 x double>*
  %140 = load <4 x double>, <4 x double>* %139, align 8
  %141 = fmul <4 x double> %97, %140
  %142 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 4, i32 0
  %143 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 13, i32 0
  %144 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 22, i32 0
  %145 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 31, i32 0
  %146 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 40, i32 0
  %147 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 49, i32 0
  %148 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 58, i32 0
  %149 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 67, i32 0
  %150 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 76, i32 0
  %151 = bitcast double* %142 to <4 x double>*
  %152 = load <4 x double>, <4 x double>* %151, align 8
  %153 = fmul <4 x double> %34, %152
  %154 = bitcast double* %143 to <4 x double>*
  %155 = load <4 x double>, <4 x double>* %154, align 8
  %156 = fmul <4 x double> %41, %155
  %157 = fadd <4 x double> %153, %156
  %158 = bitcast double* %144 to <4 x double>*
  %159 = load <4 x double>, <4 x double>* %158, align 8
  %160 = fmul <4 x double> %49, %159
  %161 = fadd <4 x double> %157, %160
  %162 = bitcast double* %145 to <4 x double>*
  %163 = load <4 x double>, <4 x double>* %162, align 8
  %164 = fmul <4 x double> %57, %163
  %165 = fadd <4 x double> %161, %164
  %166 = bitcast double* %146 to <4 x double>*
  %167 = load <4 x double>, <4 x double>* %166, align 8
  %168 = fmul <4 x double> %65, %167
  %169 = fadd <4 x double> %165, %168
  %170 = bitcast double* %147 to <4 x double>*
  %171 = load <4 x double>, <4 x double>* %170, align 8
  %172 = fmul <4 x double> %73, %171
  %173 = fadd <4 x double> %169, %172
  %174 = bitcast double* %148 to <4 x double>*
  %175 = load <4 x double>, <4 x double>* %174, align 8
  %176 = fmul <4 x double> %81, %175
  %177 = fadd <4 x double> %173, %176
  %178 = bitcast double* %149 to <4 x double>*
  %179 = load <4 x double>, <4 x double>* %178, align 8
  %180 = fmul <4 x double> %89, %179
  %181 = fadd <4 x double> %177, %180
  %182 = bitcast double* %150 to <4 x double>*
  %183 = load <4 x double>, <4 x double>* %182, align 8
  %184 = fmul <4 x double> %97, %183
  %185 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 6, i32 0
  %186 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 15, i32 0
  %187 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 24, i32 0
  %188 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 33, i32 0
  %189 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 42, i32 0
  %190 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 51, i32 0
  %191 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 60, i32 0
  %192 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 69, i32 0
  %193 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 78, i32 0
  %194 = bitcast double* %185 to <4 x double>*
  %195 = load <4 x double>, <4 x double>* %194, align 8
  %196 = fmul <4 x double> %34, %195
  %197 = bitcast double* %186 to <4 x double>*
  %198 = load <4 x double>, <4 x double>* %197, align 8
  %199 = fmul <4 x double> %41, %198
  %200 = fadd <4 x double> %196, %199
  %201 = bitcast double* %187 to <4 x double>*
  %202 = load <4 x double>, <4 x double>* %201, align 8
  %203 = fmul <4 x double> %49, %202
  %204 = fadd <4 x double> %200, %203
  %205 = bitcast double* %188 to <4 x double>*
  %206 = load <4 x double>, <4 x double>* %205, align 8
  %207 = fmul <4 x double> %57, %206
  %208 = fadd <4 x double> %204, %207
  %209 = bitcast double* %189 to <4 x double>*
  %210 = load <4 x double>, <4 x double>* %209, align 8
  %211 = fmul <4 x double> %65, %210
  %212 = fadd <4 x double> %208, %211
  %213 = bitcast double* %190 to <4 x double>*
  %214 = load <4 x double>, <4 x double>* %213, align 8
  %215 = fmul <4 x double> %73, %214
  %216 = fadd <4 x double> %212, %215
  %217 = bitcast double* %191 to <4 x double>*
  %218 = load <4 x double>, <4 x double>* %217, align 8
  %219 = fmul <4 x double> %81, %218
  %220 = fadd <4 x double> %216, %219
  %221 = bitcast double* %192 to <4 x double>*
  %222 = load <4 x double>, <4 x double>* %221, align 8
  %223 = fmul <4 x double> %89, %222
  %224 = fadd <4 x double> %220, %223
  %225 = bitcast double* %193 to <4 x double>*
  %226 = load <4 x double>, <4 x double>* %225, align 8
  %227 = fmul <4 x double> %97, %226
  %228 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 8, i32 0
  %229 = bitcast double* %228 to <2 x double>*
  %230 = load <2 x double>, <2 x double>* %229, align 8
  %231 = insertelement <2 x double> undef, double %5, i32 0
  %232 = insertelement <2 x double> %231, double %5, i32 1
  %233 = fmul <2 x double> %232, %230
  %234 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 17, i32 0
  %235 = bitcast double* %234 to <2 x double>*
  %236 = load <2 x double>, <2 x double>* %235, align 8
  %237 = insertelement <2 x double> undef, double %7, i32 0
  %238 = insertelement <2 x double> %237, double %7, i32 1
  %239 = fmul <2 x double> %238, %236
  %240 = fadd <2 x double> %233, %239
  %241 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 26, i32 0
  %242 = bitcast double* %241 to <2 x double>*
  %243 = load <2 x double>, <2 x double>* %242, align 8
  %244 = insertelement <2 x double> undef, double %10, i32 0
  %245 = insertelement <2 x double> %244, double %10, i32 1
  %246 = fmul <2 x double> %245, %243
  %247 = fadd <2 x double> %240, %246
  %248 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 35, i32 0
  %249 = bitcast double* %248 to <2 x double>*
  %250 = load <2 x double>, <2 x double>* %249, align 8
  %251 = insertelement <2 x double> undef, double %13, i32 0
  %252 = insertelement <2 x double> %251, double %13, i32 1
  %253 = fmul <2 x double> %252, %250
  %254 = fadd <2 x double> %247, %253
  %255 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 44, i32 0
  %256 = bitcast double* %255 to <2 x double>*
  %257 = load <2 x double>, <2 x double>* %256, align 8
  %258 = insertelement <2 x double> undef, double %16, i32 0
  %259 = insertelement <2 x double> %258, double %16, i32 1
  %260 = fmul <2 x double> %259, %257
  %261 = fadd <2 x double> %254, %260
  %262 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 53, i32 0
  %263 = bitcast double* %262 to <2 x double>*
  %264 = load <2 x double>, <2 x double>* %263, align 8
  %265 = insertelement <2 x double> undef, double %19, i32 0
  %266 = insertelement <2 x double> %265, double %19, i32 1
  %267 = fmul <2 x double> %266, %264
  %268 = fadd <2 x double> %261, %267
  %269 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 62, i32 0
  %270 = bitcast double* %269 to <2 x double>*
  %271 = load <2 x double>, <2 x double>* %270, align 8
  %272 = insertelement <2 x double> undef, double %22, i32 0
  %273 = insertelement <2 x double> %272, double %22, i32 1
  %274 = fmul <2 x double> %273, %271
  %275 = fadd <2 x double> %268, %274
  %276 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 71, i32 0
  %277 = bitcast double* %276 to <2 x double>*
  %278 = load <2 x double>, <2 x double>* %277, align 8
  %279 = insertelement <2 x double> undef, double %25, i32 0
  %280 = insertelement <2 x double> %279, double %25, i32 1
  %281 = fmul <2 x double> %280, %278
  %282 = fadd <2 x double> %275, %281
  %283 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 80, i32 0
  %284 = bitcast double* %283 to <2 x double>*
  %285 = load <2 x double>, <2 x double>* %284, align 8
  %286 = insertelement <2 x double> undef, double %28, i32 0
  %287 = insertelement <2 x double> %286, double %28, i32 1
  %288 = fmul <2 x double> %287, %285
  %289 = fadd <4 x double> %91, %98
  %290 = fadd <4 x double> %138, %141
  %291 = fadd <4 x double> %181, %184
  %292 = fadd <4 x double> %224, %227
  %293 = fadd <2 x double> %282, %288
  %294 = bitcast %SArray.87* %0 to <4 x double>*
  store <4 x double> %289, <4 x double>* %294, align 8
  %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 0
  %295 = bitcast double* %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
  store <4 x double> %290, <4 x double>* %295, align 8
  %"#temp#.sroa.0.sroa.6.sroa.0.0.#temp#.sroa.0.sroa.6.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 4, i32 0
  %296 = bitcast double* %"#temp#.sroa.0.sroa.6.sroa.0.0.#temp#.sroa.0.sroa.6.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
  store <4 x double> %291, <4 x double>* %296, align 8
  %"#temp#.sroa.0.sroa.8.sroa.0.0.#temp#.sroa.0.sroa.8.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 0
  %297 = bitcast double* %"#temp#.sroa.0.sroa.8.sroa.0.0.#temp#.sroa.0.sroa.8.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
  store <4 x double> %292, <4 x double>* %297, align 8
  %"#temp#.sroa.0.sroa.10.sroa.0.0.#temp#.sroa.0.sroa.10.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 8, i32 0
  %298 = bitcast double* %"#temp#.sroa.0.sroa.10.sroa.0.0.#temp#.sroa.0.sroa.10.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <2 x double>*
  store <2 x double> %293, <2 x double>* %298, align 8
  ret void
}

julia> @code_llvm fft(x9)

define void @julia_fft_63737(%SArray.87* noalias nocapture sret, %SArray.82* nocapture readonly dereferenceable(72)) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 0
  %3 = load double, double* %2, align 8
  %4 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 1
  %5 = load double, double* %4, align 8
  %6 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 2
  %7 = load double, double* %6, align 8
  %8 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 3
  %9 = load double, double* %8, align 8
  %10 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 4
  %11 = load double, double* %10, align 8
  %12 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 5
  %13 = load double, double* %12, align 8
  %14 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 6
  %15 = load double, double* %14, align 8
  %16 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 7
  %17 = load double, double* %16, align 8
  %18 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 8
  %19 = load double, double* %18, align 8
  %20 = fadd fast double %9, %3
  %21 = fadd fast double %20, %15
  %22 = fmul fast double %9, 5.000000e-01
  %23 = fsub fast double %3, %22
  %.neg355 = fmul fast double %15, -5.000000e-01
  %24 = fadd fast double %.neg355, %23
  %25 = fsub fast double %15, %9
  %26 = fmul fast double %25, 0x3FEBB67AE8584CAB
  %27 = fsub fast double %9, %15
  %28 = fmul fast double %27, 0x3FEBB67AE8584CAB
  %29 = fadd fast double %11, %5
  %30 = fadd fast double %29, %17
  %31 = fmul fast double %11, 5.000000e-01
  %32 = fsub fast double %5, %31
  %.neg353 = fmul fast double %17, -5.000000e-01
  %33 = fadd fast double %.neg353, %32
  %34 = fsub fast double %17, %11
  %35 = fsub fast double %11, %17
  %36 = fadd fast double %13, %7
  %37 = fadd fast double %36, %19
  %38 = fmul fast double %13, 5.000000e-01
  %39 = fsub fast double %7, %38
  %.neg351 = fmul fast double %19, -5.000000e-01
  %40 = fadd fast double %.neg351, %39
  %41 = fsub fast double %19, %13
  %42 = fsub fast double %13, %19
  %43 = fadd fast double %30, %21
  %44 = fadd fast double %43, %37
  %45 = fmul fast double %34, 0x3FE1D03E70EA84FF
  %46 = fmul fast double %34, 0x3FE53AAFE3631109
  %.neg = fmul fast double %33, 0xBFE491B7523C161C
  %47 = fmul fast double %41, 0x3FEB4AB2F290BEAB
  %48 = fmul fast double %41, 0x3FC33FC62FEE2044
  %.neg333 = fmul fast double %40, 0xBFEF838B8C811C17
  %49 = fadd fast double %45, %24
  %50 = insertelement <2 x double> <double 0x3FE8836FA2CF5039, double undef>, double %35, i32 1
  %51 = insertelement <2 x double> undef, double %33, i32 0
  %52 = insertelement <2 x double> %51, double 0x3FE53AAFE3631109, i32 1
  %53 = fmul fast <2 x double> %50, %52
  %54 = extractelement <2 x double> %53, i32 0
  %55 = fadd fast double %49, %54
  %56 = fadd fast double %55, %47
  %57 = insertelement <2 x double> <double 0x3FC63A1A7E0B7389, double undef>, double %42, i32 1
  %58 = insertelement <2 x double> undef, double %40, i32 0
  %59 = insertelement <2 x double> %58, double 0x3FC33FC62FEE2044, i32 1
  %60 = fmul fast <2 x double> %57, %59
  %61 = extractelement <2 x double> %60, i32 0
  %62 = fadd fast double %56, %61
  %63 = fadd fast double %.neg, %26
  %64 = fadd fast double %63, %46
  %65 = fadd fast double %64, %.neg333
  %66 = fadd fast double %65, %48
  %67 = fmul fast double %33, 0x3FC63A1A7E0B7389
  %68 = fmul fast double %35, 0x3FEB4AB2F290BEAB
  %69 = fmul fast double %35, 0x3FC33FC62FEE2044
  %.neg334 = fmul fast double %33, 0xBFEF838B8C811C17
  %70 = fmul fast double %40, 0x3FEE11F642522D1C
  %71 = fmul fast double %42, 0x3FD2F4E9034C7357
  %72 = fmul fast double %40, 0x3FD5E3A8748A0BF5
  %.neg335 = fmul fast double %42, 0xBFEA0AA16F5E991B
  %73 = fadd fast double %68, %24
  %74 = fadd fast double %73, %67
  %75 = fadd fast double %74, %71
  %76 = fsub fast double %75, %70
  %77 = fadd fast double %28, %.neg334
  %78 = fadd fast double %69, %77
  %79 = fadd fast double %78, %.neg335
  %80 = fsub fast double %79, %72
  %.neg336 = fmul fast double %30, -5.000000e-01
  %81 = fadd fast double %21, %.neg336
  %82 = fmul fast double %37, 5.000000e-01
  %83 = fsub fast double %81, %82
  %84 = fsub fast double %37, %30
  %85 = fmul fast double %84, 0x3FEBB67AE8584CAB
  %86 = fmul fast double %33, 0x3FEE11F642522D1C
  %87 = fmul fast double %34, 0x3FD2F4E9034C7357
  %88 = fmul fast double %33, 0x3FD5E3A8748A0BF5
  %.neg337 = fmul fast double %34, 0xBFEA0AA16F5E991B
  %.neg338 = fmul fast double %41, 0xBFE1D03E70EA84FF
  %89 = fmul fast double %40, 0x3FE491B7523C161C
  %90 = fadd fast double %87, %24
  %91 = fsub fast double %.neg338, %86
  %92 = fadd fast double %.neg337, %26
  %93 = fsub fast double %89, %88
  %94 = fmul fast double %33, 0xBFEE11F642522D1C
  %.neg339 = fmul fast double %35, 0xBFD2F4E9034C7357
  %.neg340 = fmul fast double %35, 0xBFEA0AA16F5E991B
  %95 = fmul fast double %42, 0x3FE53AAFE3631109
  %96 = insertelement <4 x double> <double 0x3FE8836FA2CF5039, double undef, double undef, double undef>, double %41, i32 1
  %97 = insertelement <4 x double> %96, double %42, i32 2
  %98 = insertelement <4 x double> %97, double %40, i32 3
  %99 = insertelement <4 x double> undef, double %40, i32 0
  %100 = insertelement <4 x double> %99, double 0x3FE53AAFE3631109, i32 1
  %101 = insertelement <4 x double> %100, double 0x3FE1D03E70EA84FF, i32 2
  %102 = insertelement <4 x double> %101, double 0xBFE491B7523C161C, i32 3
  %103 = fmul fast <4 x double> %98, %102
  %104 = fadd fast double %.neg339, %24
  %105 = fadd fast double %104, %94
  %106 = fadd fast double %.neg340, %28
  %107 = fadd fast double %106, %88
  %108 = insertelement <4 x double> undef, double %91, i32 0
  %109 = insertelement <4 x double> %108, double %93, i32 1
  %110 = insertelement <4 x double> %109, double %105, i32 2
  %111 = insertelement <4 x double> %110, double %107, i32 3
  %112 = fadd fast <4 x double> %111, %103
  %113 = insertelement <4 x double> undef, double %90, i32 0
  %114 = insertelement <4 x double> %113, double %92, i32 1
  %115 = extractelement <4 x double> %103, i32 0
  %116 = insertelement <4 x double> %114, double %115, i32 2
  %117 = insertelement <4 x double> %116, double %95, i32 3
  %118 = fadd fast <4 x double> %112, %117
  %119 = fsub fast double %30, %37
  %120 = fmul fast double %119, 0x3FEBB67AE8584CAB
  %.neg343 = fmul fast double %34, 0xBFEB4AB2F290BEAB
  %121 = fmul fast double %34, 0x3FC33FC62FEE2044
  %122 = fmul fast double %33, 0x3FEF838B8C811C17
  %123 = fmul fast double %40, 0xBFEE11F642522D1C
  %.neg344 = fmul fast double %41, 0xBFD2F4E9034C7357
  %.neg345 = fmul fast double %41, 0xBFEA0AA16F5E991B
  %124 = fadd fast double %.neg343, %24
  %125 = fadd fast double %124, %67
  %126 = fadd fast double %125, %.neg344
  %127 = fadd fast double %126, %123
  %128 = fadd fast double %122, %26
  %129 = fadd fast double %128, %121
  %130 = fadd fast double %129, %.neg345
  %131 = fadd fast double %130, %72
  %132 = insertelement <2 x double> <double 0xBFE1D03E70EA84FF, double undef>, double %33, i32 1
  %133 = insertelement <2 x double> undef, double %35, i32 0
  %134 = insertelement <2 x double> %133, double 0x3FE491B7523C161C, i32 1
  %135 = fmul fast <2 x double> %132, %134
  %136 = insertelement <2 x double> <double 0xBFEB4AB2F290BEAB, double undef>, double %40, i32 1
  %137 = insertelement <2 x double> undef, double %42, i32 0
  %138 = insertelement <2 x double> %137, double 0x3FEF838B8C811C17, i32 1
  %139 = fmul fast <2 x double> %136, %138
  %140 = insertelement <2 x double> undef, double %24, i32 0
  %141 = insertelement <2 x double> %140, double %28, i32 1
  %142 = fadd fast <2 x double> %135, %141
  %143 = fadd fast <2 x double> %142, %53
  %144 = fadd fast <2 x double> %143, %139
  %145 = fadd fast <2 x double> %144, %60
  %.sroa.0294.sroa.0.0..sroa.0294.0..sroa_cast.sroa_idx = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 0, i32 0
  store double %44, double* %.sroa.0294.sroa.0.0..sroa.0294.0..sroa_cast.sroa_idx, align 8
  %.sroa.0294.sroa.2.0..sroa.0294.0..sroa_cast.sroa_idx316 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 0, i32 1
  store double 0.000000e+00, double* %.sroa.0294.sroa.2.0..sroa.0294.0..sroa_cast.sroa_idx316, align 8
  %.sroa.0294.sroa.3.0..sroa.0294.0..sroa_cast.sroa_idx317 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 1, i32 0
  store double %62, double* %.sroa.0294.sroa.3.0..sroa.0294.0..sroa_cast.sroa_idx317, align 8
  %.sroa.0294.sroa.4.0..sroa.0294.0..sroa_cast.sroa_idx318 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 1, i32 1
  store double %66, double* %.sroa.0294.sroa.4.0..sroa.0294.0..sroa_cast.sroa_idx318, align 8
  %.sroa.0294.sroa.5.0..sroa.0294.0..sroa_cast.sroa_idx319 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 0
  store double %76, double* %.sroa.0294.sroa.5.0..sroa.0294.0..sroa_cast.sroa_idx319, align 8
  %.sroa.0294.sroa.6.0..sroa.0294.0..sroa_cast.sroa_idx320 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 1
  store double %80, double* %.sroa.0294.sroa.6.0..sroa.0294.0..sroa_cast.sroa_idx320, align 8
  %.sroa.0294.sroa.7.0..sroa.0294.0..sroa_cast.sroa_idx321 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 3, i32 0
  store double %83, double* %.sroa.0294.sroa.7.0..sroa.0294.0..sroa_cast.sroa_idx321, align 8
  %.sroa.0294.sroa.8.0..sroa.0294.0..sroa_cast.sroa_idx322 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 3, i32 1
  store double %85, double* %.sroa.0294.sroa.8.0..sroa.0294.0..sroa_cast.sroa_idx322, align 8
  %.sroa.0294.sroa.9.0..sroa.0294.0..sroa_cast.sroa_idx323 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 4, i32 0
  %146 = bitcast double* %.sroa.0294.sroa.9.0..sroa.0294.0..sroa_cast.sroa_idx323 to <4 x double>*
  store <4 x double> %118, <4 x double>* %146, align 8
  %.sroa.0294.sroa.13.0..sroa.0294.0..sroa_cast.sroa_idx327 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 0
  store double %83, double* %.sroa.0294.sroa.13.0..sroa.0294.0..sroa_cast.sroa_idx327, align 8
  %.sroa.0294.sroa.14.0..sroa.0294.0..sroa_cast.sroa_idx328 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 1
  store double %120, double* %.sroa.0294.sroa.14.0..sroa.0294.0..sroa_cast.sroa_idx328, align 8
  %.sroa.0294.sroa.15.0..sroa.0294.0..sroa_cast.sroa_idx329 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 7, i32 0
  store double %127, double* %.sroa.0294.sroa.15.0..sroa.0294.0..sroa_cast.sroa_idx329, align 8
  %.sroa.0294.sroa.16.0..sroa.0294.0..sroa_cast.sroa_idx330 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 7, i32 1
  store double %131, double* %.sroa.0294.sroa.16.0..sroa.0294.0..sroa_cast.sroa_idx330, align 8
  %.sroa.0294.sroa.17.0..sroa.0294.0..sroa_cast.sroa_idx331 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 8, i32 0
  %147 = bitcast double* %.sroa.0294.sroa.17.0..sroa.0294.0..sroa_cast.sroa_idx331 to <2 x double>*
  store <2 x double> %145, <2 x double>* %147, align 8
  ret void
}

Is there some trick that would help?

Would be good to know before I bother seriously consider splitting up the numbers into their real and complex components.
More likely, I'll probably try using a let or begin blocks surrounding at least extracting the "x_is" and the chunk that uses them (so they can freed from memory earlier? is that right?), and then try something along those lines to allow things to stay type stable with some chunk of variables initially remaining real instead of complex, etc.
I'll also add an rfft (and of course the ifft and irfft).

By the way, what is your use case for implementing these tiny transformations?

Ideally, I would have liked for this to stay optimal into the ~64 range.
My application involved manipulating a lot of polynomials. I'm roughly following "A Basis for Multivariate Hermite Interpolation" (links on google seem dead), where I can express basis functions as the product of lots of univariate polynomials, each one with a different set of roots factored out -- ffts help with that factoring. Depending on how many points I decide to interpolate at a time and the dimensionality of the problem, the univariate polynomials could be as small as degree 16. There, static ffts would perform well.
If I'm interpolating many more points, I may also want to do all the ffts within a GPU kernel (to avoid overhead of repeated calls, ie (1) preprocess -> (2) CLFFT -> (3) postprocess), necessitating some sort of "pure Julia" approach.**
Lots of unknowns -- but I wanted to at least get far enough to have some idea of how well this will work. To help one of the "ifs", I am planning on follow Tutorial: OpenCL SGEMM tuning (but in Julia, and perhaps HIP)
to play around and get a better idea of what it takes to actually get good performance out of a GPU. (While CLBLAS is way faster than OpenBLAS for huge matrices, OpenCL.jl did not speed up some early tests...)
Another unknown is that I really ought to rush to get a working test case, to get a better idea which algorithms are actually worth optimizing. No need to have a fast dud.
**Stepping back further, this is to approximate probability distributions. I think that, at least for small dimensional problems, I can get accurate answers orders of magnitude faster than MCMC == great for simulations, like powering studies or adaptive design. Another plus is that these approximations are conjugate, meaning some higher dimensional problems could be factored and then solved via a sum product algorithm.

[Also, GPUs don't have SIMD instructions, but ideally it would have great CPU performance too.]

Be aware that in @generated function code generators you shouldn't call any functions which might be extended after the generated function is defined. Generally this limits us mostly to doing loop unrolling, with other information (such as the output eltype) computed inside the code which is generated.

I don't think this is an issue anywhere -- do you have something specific in mind?
The arranging and recombining of the prime ffts is basically just loop unrolling.

fftw have optimized codelets for a whole heap of small sizes

Definitely a lot of literature there that I'm unfamiliar with, and optimizations that go into something like FFTW.

CC @stevengj - just in case statically sized FFTs in Julia piques your interest?

Ha ha. That name comes up often on Wikiedia's FFT-related pages!

@stevengj
Copy link
Contributor

stevengj commented Feb 7, 2018

In case it is helpful, I wrote a static-size FFT generator in Julia a while back: JuliaLang/julia#6193

@andyferris
Copy link
Member

andyferris commented Feb 7, 2018

Is there any proper channel or protocol to see if it's better to submit a pull request or create a new package?
I asked on gitter and then figured it wouldn't hurt to try a pull request.
If you guys think it'd be better as a separate package, I wont object to someone closing this.

I suppose this is the proper channel? (There's also a #staticarrays channel on Slack.) I'm happy that you tried for a PR here, but given the changes to Base the best thing seems to be a separate package. Having a separate package is good for experimentation - it lets you merge faster and run the test quicker etc.

@c42f
Copy link
Member

c42f commented Feb 8, 2018

Yes, there's no "proper" channel as such, the community is mostly loosely coupled in the package ecosystem. It's pretty easy to create a new package - I'd definitely encourage it for fft work as it should be pretty decoupled from the core StaticArrays functionality.

Thanks for your big description of the use case, this sounds really interesting and a good use for statically sized transformations if you're going to stick to several fixed sizes.

The reason for all the += is because the functions will otherwise allocate memory and slow down dramatically. Splitting them up into several lines solves that problem.

This sounds odd. It's probably due to a+b+c+d+... parsing as +(a,b,c,d,...) which has been a problem in the past. An alternative workaround is to explicitly emit pairwise operations eg ((a+b)+(c+d))+....

More likely, I'll probably try using a let or begin blocks surrounding at least extracting the "x_is" and the chunk that uses them (so they can freed from memory earlier? is that right?),

The compiler shouldn't need any help with this - LLVM optimization passes do a great job of reasoning about the lifetime of values which need to be retained, optimizing the order of operations to keep intermediate results in registers rather than storing them on the stack, reusing stack slots to keep the frame small, etc. What you can do to help is generate straight line code which is easy for the optimizer to reason about, which I think you're already doing. If I had to guess I'd say the naive DFT only wins in some cases because it maps very nicely to the hardware and has few data dependencies (potentially less pipeline stalls). Data dependencies don't look like they'd be a problem for your bigger unrolled FFTs because there's a lot of arithmetic to do between each stage.

In general another thing you can do to help the compiler is logical rearrangement of operations which are equivalent for real numbers but forbidden by IEEE floating point semantics. Or simply transformations which the compiler can't discover. I see you've already got fastmath in there so that may allow some of these anyway.

and then try something along those lines to allow things to stay type stable with some chunk of variables initially remaining real instead of complex, etc.

Type stability probably shouldn't be a problem for unrolled loops - even if a variable would change type in the equivalent looped version - as each variable will have a well defined type at each point in the unrolled source. But I'm not sure the compiler deals with all such cases properly in 0.6. You might get some value out of using reals where possible in the initial stages, I'd say it's worth trying.

By the way, for expression generation the neatest technique I've seen contributed to static arrays is the code in qr_unrolled - it looks very like the original qr algorithm, but acts to transform the loops into a big unrolled expression. Worth a quick look as a nice example.

@milankl
Copy link

milankl commented May 2, 2019

Can I just ask, whether it would be possible to put the fft efforts presented here in a separate package? I think there would be much appreciation for a pure Julia fft implementation that therefore also supports non-Float32/Float64 data types. I would be very interested to run a rfft/irfft with Float16,Bfloat16 and other low precision formats.

ωcos(N, i) = cospi(-2i/N)
ωsin(N, i) = sinpi(-2i/N)

ω(N, i) = ωcos(N, i) + im*ωsin(N, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the generated code take advantage of the many trig factors that are +/-1 or +/-i?

@c42f c42f added the feature features and feature requests label Aug 1, 2019
@c42f
Copy link
Member

c42f commented Aug 1, 2019

I'm torn on this. On the one hand, we should close it because Static FFTs really deserve their own package and this PR is incomplete. On the other hand, it's a cool idea and a neat prototype.

On balance I think I should close this issue because StaticArrays is already somewhat unwieldy with all the specialized linear algebra stuff in here. But I really encourage you to scrub this code up and make your own StaticFFTs.jl package; I think it would be super cool.

@c42f c42f closed this Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features and feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants