-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added FFTs. Still need to add iFFTs (straightforward) and... #363
Conversation
…formance (less straightforward).
This is pretty cool, thanks for taking the time to put it into a PR. It's great that you can get a speedup from a really simple unrolled version of the original algorithm. It's also kind of remarkable how competitive Base.fft (I assume fftw) is here as naively specializing and inlining everything usually wins by a larger factor. I only had a very quick look at the code - a few thoughts:
By the way, what is your use case for implementing these tiny transformations? |
I agree, this is cool :) I'm not sure if you're aware but the CC @stevengj - just in case statically sized FFTs in Julia piques your interest? |
Is there any proper channel or protocol to see if it's better to submit a pull request or create a new package? Anyway, I hope to make a more serious post and add revisions next weekend (either here, or in StaticFFTs.jl). Julia's generated functions are probably an ideal way to actually set up ffts, so I wouldn't be surprised if there's a great pure Julia implementation someday.
function MatrixDFT(n::Int, ::Type{T} = Float64) where T
out = fill(one(Complex{T}), n, n)
for j ∈ 2:n
out[j, 2] = exp( -2π*im*(j-1) / n )
end
for i ∈ 3:n, j ∈ i:n
out[j,i] = out[j,i-1] * out[j,2]
end
Symmetric(out, :L) #./= √n
end
function MatrixDFT(::Val{n}, ::Type{T} = Float64) where {T,b}
out = fill(one(Complex{T}), n, n)
for j ∈ 2:n
out[j, 2] = exp( -2π*im*(j-1) / n )
end
for i ∈ 3:n, j ∈ i:n
out[j,i] = out[j,i-1] * out[j,2]
end
SMatrix{n,n}(out) #./= √n
end Then (I did start Julia with julia> n = 4
4
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> static = MatrixDFT(Val(n));
julia> @benchmark $static * $x
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.597 ns (0.00% GC)
median time: 15.603 ns (0.00% GC)
mean time: 16.307 ns (0.00% GC)
maximum time: 55.080 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 226.052 ns (0.00% GC)
median time: 234.811 ns (0.00% GC)
mean time: 242.141 ns (0.00% GC)
maximum time: 363.591 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 460
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 47.675 ns (0.00% GC)
median time: 47.923 ns (0.00% GC)
mean time: 49.421 ns (0.00% GC)
maximum time: 91.303 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 988
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 19.239 ns (0.00% GC)
median time: 19.242 ns (0.00% GC)
mean time: 19.665 ns (0.00% GC)
maximum time: 62.975 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 997
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 62.733 ns (0.00% GC)
median time: 65.173 ns (0.00% GC)
mean time: 67.553 ns (0.00% GC)
maximum time: 128.734 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 973
julia> n = 8
8
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> static = MatrixDFT(Val(n));
julia> @benchmark $static * $x
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.854 ns (0.00% GC)
median time: 43.045 ns (0.00% GC)
mean time: 43.855 ns (0.00% GC)
maximum time: 63.999 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 990
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 764.807 ns (0.00% GC)
median time: 773.959 ns (0.00% GC)
mean time: 791.216 ns (0.00% GC)
maximum time: 1.403 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 109
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 159.669 ns (0.00% GC)
median time: 167.005 ns (0.00% GC)
mean time: 171.965 ns (0.00% GC)
maximum time: 245.055 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 788
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 38.905 ns (0.00% GC)
median time: 38.907 ns (0.00% GC)
mean time: 39.807 ns (0.00% GC)
maximum time: 73.683 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 992
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 78.316 ns (0.00% GC)
median time: 82.653 ns (0.00% GC)
mean time: 83.989 ns (0.00% GC)
maximum time: 155.556 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
julia> n = 9
9
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> static = MatrixDFT(Val(n));
julia> @benchmark $static * $x
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 51.537 ns (0.00% GC)
median time: 55.418 ns (0.00% GC)
mean time: 56.333 ns (0.00% GC)
maximum time: 103.198 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 986
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 971.227 ns (0.00% GC)
median time: 1.000 μs (0.00% GC)
mean time: 1.016 μs (0.00% GC)
maximum time: 2.660 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 22
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 183.720 ns (0.00% GC)
median time: 190.713 ns (0.00% GC)
mean time: 193.549 ns (0.00% GC)
maximum time: 291.679 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 626
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 70.955 ns (0.00% GC)
median time: 70.965 ns (0.00% GC)
mean time: 73.333 ns (0.00% GC)
maximum time: 109.058 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 975
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 105.001 ns (0.00% GC)
median time: 108.898 ns (0.00% GC)
mean time: 114.061 ns (0.00% GC)
maximum time: 189.999 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 937
julia> n = 12
12
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> static = MatrixDFT(Val(n));
julia> @benchmark $static * $x
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 95.963 ns (0.00% GC)
median time: 99.998 ns (0.00% GC)
mean time: 102.809 ns (0.00% GC)
maximum time: 177.366 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 946
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.974 μs (0.00% GC)
median time: 2.103 μs (0.00% GC)
mean time: 2.161 μs (0.00% GC)
maximum time: 6.809 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 290.830 ns (0.00% GC)
median time: 297.380 ns (0.00% GC)
mean time: 308.262 ns (0.00% GC)
maximum time: 549.749 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 271
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 52.028 ns (0.00% GC)
median time: 52.042 ns (0.00% GC)
mean time: 55.162 ns (0.00% GC)
maximum time: 115.460 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 986
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 89.946 ns (0.00% GC)
median time: 97.067 ns (0.00% GC)
mean time: 101.421 ns (0.00% GC)
maximum time: 207.158 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 956
julia> n = 24
24
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> static = MatrixDFT(Val(n));
julia> @benchmark $static * $x
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 443.359 ns (0.00% GC)
median time: 451.833 ns (0.00% GC)
mean time: 469.978 ns (0.00% GC)
maximum time: 832.616 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 198
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 7.534 μs (0.00% GC)
median time: 8.146 μs (0.00% GC)
mean time: 8.374 μs (0.00% GC)
maximum time: 20.068 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 4
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.043 μs (0.00% GC)
median time: 1.056 μs (0.00% GC)
mean time: 1.086 μs (0.00% GC)
maximum time: 4.266 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 116.139 ns (0.00% GC)
median time: 116.647 ns (0.00% GC)
mean time: 119.742 ns (0.00% GC)
maximum time: 168.846 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 921
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 161.727 ns (0.00% GC)
median time: 163.455 ns (0.00% GC)
mean time: 170.398 ns (0.00% GC)
maximum time: 352.685 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 772
julia> n = 64
64
julia> x = @SVector randn(n);
julia> xc = Complex.([x...]);
julia> xcs = similar(xc);
julia> px = plan_fft(xc);
julia> symmetric = MatrixDFT(n);
julia> regular_dft = Array(symmetric);
julia> @benchmark A_mul_B!($xcs, $symmetric, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 53.327 μs (0.00% GC)
median time: 54.325 μs (0.00% GC)
mean time: 55.412 μs (0.00% GC)
maximum time: 109.277 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark A_mul_B!($xcs, $regular_dft, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.344 μs (0.00% GC)
median time: 6.354 μs (0.00% GC)
mean time: 6.514 μs (0.00% GC)
maximum time: 21.284 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> @benchmark fft($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 743.430 ns (0.00% GC)
median time: 745.383 ns (0.00% GC)
mean time: 763.079 ns (0.00% GC)
maximum time: 1.180 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 128
julia> @benchmark A_mul_B!($xcs, $px, $xc)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 247.975 ns (0.00% GC)
median time: 254.828 ns (0.00% GC)
mean time: 259.313 ns (0.00% GC)
maximum time: 377.162 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 401 It falls behind FFTW by 64, but it starts and remains clearly well ahead of the DFT matrices by then. O(nlog(n)) scaling vs theoretically O(n^2)*, so it would probably have to fall into major memory issues before it falls behind them. I'm surprised at the performance for julia> x4 = @SVector randn(4);
julia> dft4 = MatrixDFT(Val(4));
julia> @code_llvm dft4 * x4
define void @"julia_*_63855"(%SArray.73* noalias nocapture sret, %SArray.70* nocapture readonly dereferenceable(256), %SArray.68* nocapture readonly dereferenceable(32)) #0 !dbg !5 {
top:
%3 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 0
%4 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 1
%5 = load double, double* %3, align 8
%6 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 4, i32 0
%7 = load double, double* %4, align 8
%8 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 2
%9 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 8, i32 0
%10 = load double, double* %8, align 8
%11 = getelementptr inbounds %SArray.68, %SArray.68* %2, i64 0, i32 0, i64 3
%12 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 12, i32 0
%13 = load double, double* %11, align 8
%14 = bitcast %SArray.70* %1 to <4 x double>*
%15 = load <4 x double>, <4 x double>* %14, align 8
%16 = insertelement <4 x double> undef, double %5, i32 0
%17 = insertelement <4 x double> %16, double %5, i32 1
%18 = insertelement <4 x double> %17, double %5, i32 2
%19 = insertelement <4 x double> %18, double %5, i32 3
%20 = fmul <4 x double> %19, %15
%21 = bitcast double* %6 to <4 x double>*
%22 = load <4 x double>, <4 x double>* %21, align 8
%23 = insertelement <4 x double> undef, double %7, i32 0
%24 = insertelement <4 x double> %23, double %7, i32 1
%25 = insertelement <4 x double> %24, double %7, i32 2
%26 = insertelement <4 x double> %25, double %7, i32 3
%27 = fmul <4 x double> %26, %22
%28 = fadd <4 x double> %20, %27
%29 = bitcast double* %9 to <4 x double>*
%30 = load <4 x double>, <4 x double>* %29, align 8
%31 = insertelement <4 x double> undef, double %10, i32 0
%32 = insertelement <4 x double> %31, double %10, i32 1
%33 = insertelement <4 x double> %32, double %10, i32 2
%34 = insertelement <4 x double> %33, double %10, i32 3
%35 = fmul <4 x double> %34, %30
%36 = fadd <4 x double> %28, %35
%37 = bitcast double* %12 to <4 x double>*
%38 = load <4 x double>, <4 x double>* %37, align 8
%39 = insertelement <4 x double> undef, double %13, i32 0
%40 = insertelement <4 x double> %39, double %13, i32 1
%41 = insertelement <4 x double> %40, double %13, i32 2
%42 = insertelement <4 x double> %41, double %13, i32 3
%43 = fmul <4 x double> %42, %38
%44 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 2, i32 0
%45 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 6, i32 0
%46 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 10, i32 0
%47 = getelementptr inbounds %SArray.70, %SArray.70* %1, i64 0, i32 0, i64 14, i32 0
%48 = bitcast double* %44 to <4 x double>*
%49 = load <4 x double>, <4 x double>* %48, align 8
%50 = fmul <4 x double> %19, %49
%51 = bitcast double* %45 to <4 x double>*
%52 = load <4 x double>, <4 x double>* %51, align 8
%53 = fmul <4 x double> %26, %52
%54 = fadd <4 x double> %50, %53
%55 = bitcast double* %46 to <4 x double>*
%56 = load <4 x double>, <4 x double>* %55, align 8
%57 = fmul <4 x double> %34, %56
%58 = fadd <4 x double> %54, %57
%59 = bitcast double* %47 to <4 x double>*
%60 = load <4 x double>, <4 x double>* %59, align 8
%61 = fmul <4 x double> %42, %60
%62 = fadd <4 x double> %36, %43
%63 = fadd <4 x double> %58, %61
%64 = bitcast %SArray.73* %0 to <4 x double>*
store <4 x double> %62, <4 x double>* %64, align 8
%"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 0
%65 = bitcast double* %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
store <4 x double> %63, <4 x double>* %65, align 8
ret void
}
julia> @code_llvm fft(x4)
define void @julia_fft_63446(%SArray.73* noalias nocapture sret, %SArray.68* nocapture readonly dereferenceable(32)) #0 !dbg !5 {
top:
%2 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 0
%3 = load double, double* %2, align 8
%4 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 1
%5 = load double, double* %4, align 8
%6 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 2
%7 = load double, double* %6, align 8
%8 = getelementptr inbounds %SArray.68, %SArray.68* %1, i64 0, i32 0, i64 3
%9 = load double, double* %8, align 8
%10 = fadd fast double %7, %3
%11 = fadd fast double %9, %5
%12 = fsub fast double %5, %9
%13 = fadd fast double %11, %10
%14 = insertelement <2 x double> undef, double %3, i32 0
%15 = insertelement <2 x double> %14, double -0.000000e+00, i32 1
%16 = insertelement <2 x double> undef, double %7, i32 0
%17 = insertelement <2 x double> %16, double %12, i32 1
%18 = fsub fast <2 x double> %15, %17
%19 = fsub fast double %10, %11
%.sroa.016.sroa.0.0..sroa.016.0..sroa_cast.sroa_idx = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 0, i32 0
store double %13, double* %.sroa.016.sroa.0.0..sroa.016.0..sroa_cast.sroa_idx, align 8
%.sroa.016.sroa.2.0..sroa.016.0..sroa_cast.sroa_idx28 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 0, i32 1
store double 0.000000e+00, double* %.sroa.016.sroa.2.0..sroa.016.0..sroa_cast.sroa_idx28, align 8
%.sroa.016.sroa.3.0..sroa.016.0..sroa_cast.sroa_idx29 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 1, i32 0
%20 = bitcast double* %.sroa.016.sroa.3.0..sroa.016.0..sroa_cast.sroa_idx29 to <2 x double>*
store <2 x double> %18, <2 x double>* %20, align 8
%.sroa.016.sroa.5.0..sroa.016.0..sroa_cast.sroa_idx31 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 0
store double %19, double* %.sroa.016.sroa.5.0..sroa.016.0..sroa_cast.sroa_idx31, align 8
%.sroa.016.sroa.6.0..sroa.016.0..sroa_cast.sroa_idx32 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 2, i32 1
store double 0.000000e+00, double* %.sroa.016.sroa.6.0..sroa.016.0..sroa_cast.sroa_idx32, align 8
%.sroa.016.sroa.7.0..sroa.016.0..sroa_cast.sroa_idx33 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 3, i32 0
%21 = extractelement <2 x double> %18, i32 0
store double %21, double* %.sroa.016.sroa.7.0..sroa.016.0..sroa_cast.sroa_idx33, align 8
%.sroa.016.sroa.8.0..sroa.016.0..sroa_cast.sroa_idx34 = getelementptr inbounds %SArray.73, %SArray.73* %0, i64 0, i32 0, i64 3, i32 1
store double %12, double* %.sroa.016.sroa.8.0..sroa.016.0..sroa_cast.sroa_idx34, align 8
ret void
}
julia> x9 = @SVector randn(9);
julia> dft9 = MatrixDFT(Val(9));
julia> @code_llvm dft9 * x9
define void @"julia_*_63857"(%SArray.87* noalias nocapture sret, %SArray.83* nocapture readonly dereferenceable(1296), %SArray.82* nocapture readonly dereferenceable(72)) #0 !dbg !5 {
top:
%3 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 0
%4 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 1
%5 = load double, double* %3, align 8
%6 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 9, i32 0
%7 = load double, double* %4, align 8
%8 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 2
%9 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 18, i32 0
%10 = load double, double* %8, align 8
%11 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 3
%12 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 27, i32 0
%13 = load double, double* %11, align 8
%14 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 4
%15 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 36, i32 0
%16 = load double, double* %14, align 8
%17 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 5
%18 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 45, i32 0
%19 = load double, double* %17, align 8
%20 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 6
%21 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 54, i32 0
%22 = load double, double* %20, align 8
%23 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 7
%24 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 63, i32 0
%25 = load double, double* %23, align 8
%26 = getelementptr inbounds %SArray.82, %SArray.82* %2, i64 0, i32 0, i64 8
%27 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 72, i32 0
%28 = load double, double* %26, align 8
%29 = bitcast %SArray.83* %1 to <4 x double>*
%30 = load <4 x double>, <4 x double>* %29, align 8
%31 = insertelement <4 x double> undef, double %5, i32 0
%32 = insertelement <4 x double> %31, double %5, i32 1
%33 = insertelement <4 x double> %32, double %5, i32 2
%34 = insertelement <4 x double> %33, double %5, i32 3
%35 = fmul <4 x double> %34, %30
%36 = bitcast double* %6 to <4 x double>*
%37 = load <4 x double>, <4 x double>* %36, align 8
%38 = insertelement <4 x double> undef, double %7, i32 0
%39 = insertelement <4 x double> %38, double %7, i32 1
%40 = insertelement <4 x double> %39, double %7, i32 2
%41 = insertelement <4 x double> %40, double %7, i32 3
%42 = fmul <4 x double> %41, %37
%43 = fadd <4 x double> %35, %42
%44 = bitcast double* %9 to <4 x double>*
%45 = load <4 x double>, <4 x double>* %44, align 8
%46 = insertelement <4 x double> undef, double %10, i32 0
%47 = insertelement <4 x double> %46, double %10, i32 1
%48 = insertelement <4 x double> %47, double %10, i32 2
%49 = insertelement <4 x double> %48, double %10, i32 3
%50 = fmul <4 x double> %49, %45
%51 = fadd <4 x double> %43, %50
%52 = bitcast double* %12 to <4 x double>*
%53 = load <4 x double>, <4 x double>* %52, align 8
%54 = insertelement <4 x double> undef, double %13, i32 0
%55 = insertelement <4 x double> %54, double %13, i32 1
%56 = insertelement <4 x double> %55, double %13, i32 2
%57 = insertelement <4 x double> %56, double %13, i32 3
%58 = fmul <4 x double> %57, %53
%59 = fadd <4 x double> %51, %58
%60 = bitcast double* %15 to <4 x double>*
%61 = load <4 x double>, <4 x double>* %60, align 8
%62 = insertelement <4 x double> undef, double %16, i32 0
%63 = insertelement <4 x double> %62, double %16, i32 1
%64 = insertelement <4 x double> %63, double %16, i32 2
%65 = insertelement <4 x double> %64, double %16, i32 3
%66 = fmul <4 x double> %65, %61
%67 = fadd <4 x double> %59, %66
%68 = bitcast double* %18 to <4 x double>*
%69 = load <4 x double>, <4 x double>* %68, align 8
%70 = insertelement <4 x double> undef, double %19, i32 0
%71 = insertelement <4 x double> %70, double %19, i32 1
%72 = insertelement <4 x double> %71, double %19, i32 2
%73 = insertelement <4 x double> %72, double %19, i32 3
%74 = fmul <4 x double> %73, %69
%75 = fadd <4 x double> %67, %74
%76 = bitcast double* %21 to <4 x double>*
%77 = load <4 x double>, <4 x double>* %76, align 8
%78 = insertelement <4 x double> undef, double %22, i32 0
%79 = insertelement <4 x double> %78, double %22, i32 1
%80 = insertelement <4 x double> %79, double %22, i32 2
%81 = insertelement <4 x double> %80, double %22, i32 3
%82 = fmul <4 x double> %81, %77
%83 = fadd <4 x double> %75, %82
%84 = bitcast double* %24 to <4 x double>*
%85 = load <4 x double>, <4 x double>* %84, align 8
%86 = insertelement <4 x double> undef, double %25, i32 0
%87 = insertelement <4 x double> %86, double %25, i32 1
%88 = insertelement <4 x double> %87, double %25, i32 2
%89 = insertelement <4 x double> %88, double %25, i32 3
%90 = fmul <4 x double> %89, %85
%91 = fadd <4 x double> %83, %90
%92 = bitcast double* %27 to <4 x double>*
%93 = load <4 x double>, <4 x double>* %92, align 8
%94 = insertelement <4 x double> undef, double %28, i32 0
%95 = insertelement <4 x double> %94, double %28, i32 1
%96 = insertelement <4 x double> %95, double %28, i32 2
%97 = insertelement <4 x double> %96, double %28, i32 3
%98 = fmul <4 x double> %97, %93
%99 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 2, i32 0
%100 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 11, i32 0
%101 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 20, i32 0
%102 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 29, i32 0
%103 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 38, i32 0
%104 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 47, i32 0
%105 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 56, i32 0
%106 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 65, i32 0
%107 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 74, i32 0
%108 = bitcast double* %99 to <4 x double>*
%109 = load <4 x double>, <4 x double>* %108, align 8
%110 = fmul <4 x double> %34, %109
%111 = bitcast double* %100 to <4 x double>*
%112 = load <4 x double>, <4 x double>* %111, align 8
%113 = fmul <4 x double> %41, %112
%114 = fadd <4 x double> %110, %113
%115 = bitcast double* %101 to <4 x double>*
%116 = load <4 x double>, <4 x double>* %115, align 8
%117 = fmul <4 x double> %49, %116
%118 = fadd <4 x double> %114, %117
%119 = bitcast double* %102 to <4 x double>*
%120 = load <4 x double>, <4 x double>* %119, align 8
%121 = fmul <4 x double> %57, %120
%122 = fadd <4 x double> %118, %121
%123 = bitcast double* %103 to <4 x double>*
%124 = load <4 x double>, <4 x double>* %123, align 8
%125 = fmul <4 x double> %65, %124
%126 = fadd <4 x double> %122, %125
%127 = bitcast double* %104 to <4 x double>*
%128 = load <4 x double>, <4 x double>* %127, align 8
%129 = fmul <4 x double> %73, %128
%130 = fadd <4 x double> %126, %129
%131 = bitcast double* %105 to <4 x double>*
%132 = load <4 x double>, <4 x double>* %131, align 8
%133 = fmul <4 x double> %81, %132
%134 = fadd <4 x double> %130, %133
%135 = bitcast double* %106 to <4 x double>*
%136 = load <4 x double>, <4 x double>* %135, align 8
%137 = fmul <4 x double> %89, %136
%138 = fadd <4 x double> %134, %137
%139 = bitcast double* %107 to <4 x double>*
%140 = load <4 x double>, <4 x double>* %139, align 8
%141 = fmul <4 x double> %97, %140
%142 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 4, i32 0
%143 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 13, i32 0
%144 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 22, i32 0
%145 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 31, i32 0
%146 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 40, i32 0
%147 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 49, i32 0
%148 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 58, i32 0
%149 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 67, i32 0
%150 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 76, i32 0
%151 = bitcast double* %142 to <4 x double>*
%152 = load <4 x double>, <4 x double>* %151, align 8
%153 = fmul <4 x double> %34, %152
%154 = bitcast double* %143 to <4 x double>*
%155 = load <4 x double>, <4 x double>* %154, align 8
%156 = fmul <4 x double> %41, %155
%157 = fadd <4 x double> %153, %156
%158 = bitcast double* %144 to <4 x double>*
%159 = load <4 x double>, <4 x double>* %158, align 8
%160 = fmul <4 x double> %49, %159
%161 = fadd <4 x double> %157, %160
%162 = bitcast double* %145 to <4 x double>*
%163 = load <4 x double>, <4 x double>* %162, align 8
%164 = fmul <4 x double> %57, %163
%165 = fadd <4 x double> %161, %164
%166 = bitcast double* %146 to <4 x double>*
%167 = load <4 x double>, <4 x double>* %166, align 8
%168 = fmul <4 x double> %65, %167
%169 = fadd <4 x double> %165, %168
%170 = bitcast double* %147 to <4 x double>*
%171 = load <4 x double>, <4 x double>* %170, align 8
%172 = fmul <4 x double> %73, %171
%173 = fadd <4 x double> %169, %172
%174 = bitcast double* %148 to <4 x double>*
%175 = load <4 x double>, <4 x double>* %174, align 8
%176 = fmul <4 x double> %81, %175
%177 = fadd <4 x double> %173, %176
%178 = bitcast double* %149 to <4 x double>*
%179 = load <4 x double>, <4 x double>* %178, align 8
%180 = fmul <4 x double> %89, %179
%181 = fadd <4 x double> %177, %180
%182 = bitcast double* %150 to <4 x double>*
%183 = load <4 x double>, <4 x double>* %182, align 8
%184 = fmul <4 x double> %97, %183
%185 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 6, i32 0
%186 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 15, i32 0
%187 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 24, i32 0
%188 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 33, i32 0
%189 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 42, i32 0
%190 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 51, i32 0
%191 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 60, i32 0
%192 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 69, i32 0
%193 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 78, i32 0
%194 = bitcast double* %185 to <4 x double>*
%195 = load <4 x double>, <4 x double>* %194, align 8
%196 = fmul <4 x double> %34, %195
%197 = bitcast double* %186 to <4 x double>*
%198 = load <4 x double>, <4 x double>* %197, align 8
%199 = fmul <4 x double> %41, %198
%200 = fadd <4 x double> %196, %199
%201 = bitcast double* %187 to <4 x double>*
%202 = load <4 x double>, <4 x double>* %201, align 8
%203 = fmul <4 x double> %49, %202
%204 = fadd <4 x double> %200, %203
%205 = bitcast double* %188 to <4 x double>*
%206 = load <4 x double>, <4 x double>* %205, align 8
%207 = fmul <4 x double> %57, %206
%208 = fadd <4 x double> %204, %207
%209 = bitcast double* %189 to <4 x double>*
%210 = load <4 x double>, <4 x double>* %209, align 8
%211 = fmul <4 x double> %65, %210
%212 = fadd <4 x double> %208, %211
%213 = bitcast double* %190 to <4 x double>*
%214 = load <4 x double>, <4 x double>* %213, align 8
%215 = fmul <4 x double> %73, %214
%216 = fadd <4 x double> %212, %215
%217 = bitcast double* %191 to <4 x double>*
%218 = load <4 x double>, <4 x double>* %217, align 8
%219 = fmul <4 x double> %81, %218
%220 = fadd <4 x double> %216, %219
%221 = bitcast double* %192 to <4 x double>*
%222 = load <4 x double>, <4 x double>* %221, align 8
%223 = fmul <4 x double> %89, %222
%224 = fadd <4 x double> %220, %223
%225 = bitcast double* %193 to <4 x double>*
%226 = load <4 x double>, <4 x double>* %225, align 8
%227 = fmul <4 x double> %97, %226
%228 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 8, i32 0
%229 = bitcast double* %228 to <2 x double>*
%230 = load <2 x double>, <2 x double>* %229, align 8
%231 = insertelement <2 x double> undef, double %5, i32 0
%232 = insertelement <2 x double> %231, double %5, i32 1
%233 = fmul <2 x double> %232, %230
%234 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 17, i32 0
%235 = bitcast double* %234 to <2 x double>*
%236 = load <2 x double>, <2 x double>* %235, align 8
%237 = insertelement <2 x double> undef, double %7, i32 0
%238 = insertelement <2 x double> %237, double %7, i32 1
%239 = fmul <2 x double> %238, %236
%240 = fadd <2 x double> %233, %239
%241 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 26, i32 0
%242 = bitcast double* %241 to <2 x double>*
%243 = load <2 x double>, <2 x double>* %242, align 8
%244 = insertelement <2 x double> undef, double %10, i32 0
%245 = insertelement <2 x double> %244, double %10, i32 1
%246 = fmul <2 x double> %245, %243
%247 = fadd <2 x double> %240, %246
%248 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 35, i32 0
%249 = bitcast double* %248 to <2 x double>*
%250 = load <2 x double>, <2 x double>* %249, align 8
%251 = insertelement <2 x double> undef, double %13, i32 0
%252 = insertelement <2 x double> %251, double %13, i32 1
%253 = fmul <2 x double> %252, %250
%254 = fadd <2 x double> %247, %253
%255 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 44, i32 0
%256 = bitcast double* %255 to <2 x double>*
%257 = load <2 x double>, <2 x double>* %256, align 8
%258 = insertelement <2 x double> undef, double %16, i32 0
%259 = insertelement <2 x double> %258, double %16, i32 1
%260 = fmul <2 x double> %259, %257
%261 = fadd <2 x double> %254, %260
%262 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 53, i32 0
%263 = bitcast double* %262 to <2 x double>*
%264 = load <2 x double>, <2 x double>* %263, align 8
%265 = insertelement <2 x double> undef, double %19, i32 0
%266 = insertelement <2 x double> %265, double %19, i32 1
%267 = fmul <2 x double> %266, %264
%268 = fadd <2 x double> %261, %267
%269 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 62, i32 0
%270 = bitcast double* %269 to <2 x double>*
%271 = load <2 x double>, <2 x double>* %270, align 8
%272 = insertelement <2 x double> undef, double %22, i32 0
%273 = insertelement <2 x double> %272, double %22, i32 1
%274 = fmul <2 x double> %273, %271
%275 = fadd <2 x double> %268, %274
%276 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 71, i32 0
%277 = bitcast double* %276 to <2 x double>*
%278 = load <2 x double>, <2 x double>* %277, align 8
%279 = insertelement <2 x double> undef, double %25, i32 0
%280 = insertelement <2 x double> %279, double %25, i32 1
%281 = fmul <2 x double> %280, %278
%282 = fadd <2 x double> %275, %281
%283 = getelementptr inbounds %SArray.83, %SArray.83* %1, i64 0, i32 0, i64 80, i32 0
%284 = bitcast double* %283 to <2 x double>*
%285 = load <2 x double>, <2 x double>* %284, align 8
%286 = insertelement <2 x double> undef, double %28, i32 0
%287 = insertelement <2 x double> %286, double %28, i32 1
%288 = fmul <2 x double> %287, %285
%289 = fadd <4 x double> %91, %98
%290 = fadd <4 x double> %138, %141
%291 = fadd <4 x double> %181, %184
%292 = fadd <4 x double> %224, %227
%293 = fadd <2 x double> %282, %288
%294 = bitcast %SArray.87* %0 to <4 x double>*
store <4 x double> %289, <4 x double>* %294, align 8
%"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 0
%295 = bitcast double* %"#temp#.sroa.0.sroa.4.sroa.0.0.#temp#.sroa.0.sroa.4.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
store <4 x double> %290, <4 x double>* %295, align 8
%"#temp#.sroa.0.sroa.6.sroa.0.0.#temp#.sroa.0.sroa.6.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 4, i32 0
%296 = bitcast double* %"#temp#.sroa.0.sroa.6.sroa.0.0.#temp#.sroa.0.sroa.6.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
store <4 x double> %291, <4 x double>* %296, align 8
%"#temp#.sroa.0.sroa.8.sroa.0.0.#temp#.sroa.0.sroa.8.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 0
%297 = bitcast double* %"#temp#.sroa.0.sroa.8.sroa.0.0.#temp#.sroa.0.sroa.8.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <4 x double>*
store <4 x double> %292, <4 x double>* %297, align 8
%"#temp#.sroa.0.sroa.10.sroa.0.0.#temp#.sroa.0.sroa.10.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 8, i32 0
%298 = bitcast double* %"#temp#.sroa.0.sroa.10.sroa.0.0.#temp#.sroa.0.sroa.10.0.#temp#.sroa.0.0..sroa_cast1.sroa_cast.sroa_idx" to <2 x double>*
store <2 x double> %293, <2 x double>* %298, align 8
ret void
}
julia> @code_llvm fft(x9)
define void @julia_fft_63737(%SArray.87* noalias nocapture sret, %SArray.82* nocapture readonly dereferenceable(72)) #0 !dbg !5 {
top:
%2 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 0
%3 = load double, double* %2, align 8
%4 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 1
%5 = load double, double* %4, align 8
%6 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 2
%7 = load double, double* %6, align 8
%8 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 3
%9 = load double, double* %8, align 8
%10 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 4
%11 = load double, double* %10, align 8
%12 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 5
%13 = load double, double* %12, align 8
%14 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 6
%15 = load double, double* %14, align 8
%16 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 7
%17 = load double, double* %16, align 8
%18 = getelementptr inbounds %SArray.82, %SArray.82* %1, i64 0, i32 0, i64 8
%19 = load double, double* %18, align 8
%20 = fadd fast double %9, %3
%21 = fadd fast double %20, %15
%22 = fmul fast double %9, 5.000000e-01
%23 = fsub fast double %3, %22
%.neg355 = fmul fast double %15, -5.000000e-01
%24 = fadd fast double %.neg355, %23
%25 = fsub fast double %15, %9
%26 = fmul fast double %25, 0x3FEBB67AE8584CAB
%27 = fsub fast double %9, %15
%28 = fmul fast double %27, 0x3FEBB67AE8584CAB
%29 = fadd fast double %11, %5
%30 = fadd fast double %29, %17
%31 = fmul fast double %11, 5.000000e-01
%32 = fsub fast double %5, %31
%.neg353 = fmul fast double %17, -5.000000e-01
%33 = fadd fast double %.neg353, %32
%34 = fsub fast double %17, %11
%35 = fsub fast double %11, %17
%36 = fadd fast double %13, %7
%37 = fadd fast double %36, %19
%38 = fmul fast double %13, 5.000000e-01
%39 = fsub fast double %7, %38
%.neg351 = fmul fast double %19, -5.000000e-01
%40 = fadd fast double %.neg351, %39
%41 = fsub fast double %19, %13
%42 = fsub fast double %13, %19
%43 = fadd fast double %30, %21
%44 = fadd fast double %43, %37
%45 = fmul fast double %34, 0x3FE1D03E70EA84FF
%46 = fmul fast double %34, 0x3FE53AAFE3631109
%.neg = fmul fast double %33, 0xBFE491B7523C161C
%47 = fmul fast double %41, 0x3FEB4AB2F290BEAB
%48 = fmul fast double %41, 0x3FC33FC62FEE2044
%.neg333 = fmul fast double %40, 0xBFEF838B8C811C17
%49 = fadd fast double %45, %24
%50 = insertelement <2 x double> <double 0x3FE8836FA2CF5039, double undef>, double %35, i32 1
%51 = insertelement <2 x double> undef, double %33, i32 0
%52 = insertelement <2 x double> %51, double 0x3FE53AAFE3631109, i32 1
%53 = fmul fast <2 x double> %50, %52
%54 = extractelement <2 x double> %53, i32 0
%55 = fadd fast double %49, %54
%56 = fadd fast double %55, %47
%57 = insertelement <2 x double> <double 0x3FC63A1A7E0B7389, double undef>, double %42, i32 1
%58 = insertelement <2 x double> undef, double %40, i32 0
%59 = insertelement <2 x double> %58, double 0x3FC33FC62FEE2044, i32 1
%60 = fmul fast <2 x double> %57, %59
%61 = extractelement <2 x double> %60, i32 0
%62 = fadd fast double %56, %61
%63 = fadd fast double %.neg, %26
%64 = fadd fast double %63, %46
%65 = fadd fast double %64, %.neg333
%66 = fadd fast double %65, %48
%67 = fmul fast double %33, 0x3FC63A1A7E0B7389
%68 = fmul fast double %35, 0x3FEB4AB2F290BEAB
%69 = fmul fast double %35, 0x3FC33FC62FEE2044
%.neg334 = fmul fast double %33, 0xBFEF838B8C811C17
%70 = fmul fast double %40, 0x3FEE11F642522D1C
%71 = fmul fast double %42, 0x3FD2F4E9034C7357
%72 = fmul fast double %40, 0x3FD5E3A8748A0BF5
%.neg335 = fmul fast double %42, 0xBFEA0AA16F5E991B
%73 = fadd fast double %68, %24
%74 = fadd fast double %73, %67
%75 = fadd fast double %74, %71
%76 = fsub fast double %75, %70
%77 = fadd fast double %28, %.neg334
%78 = fadd fast double %69, %77
%79 = fadd fast double %78, %.neg335
%80 = fsub fast double %79, %72
%.neg336 = fmul fast double %30, -5.000000e-01
%81 = fadd fast double %21, %.neg336
%82 = fmul fast double %37, 5.000000e-01
%83 = fsub fast double %81, %82
%84 = fsub fast double %37, %30
%85 = fmul fast double %84, 0x3FEBB67AE8584CAB
%86 = fmul fast double %33, 0x3FEE11F642522D1C
%87 = fmul fast double %34, 0x3FD2F4E9034C7357
%88 = fmul fast double %33, 0x3FD5E3A8748A0BF5
%.neg337 = fmul fast double %34, 0xBFEA0AA16F5E991B
%.neg338 = fmul fast double %41, 0xBFE1D03E70EA84FF
%89 = fmul fast double %40, 0x3FE491B7523C161C
%90 = fadd fast double %87, %24
%91 = fsub fast double %.neg338, %86
%92 = fadd fast double %.neg337, %26
%93 = fsub fast double %89, %88
%94 = fmul fast double %33, 0xBFEE11F642522D1C
%.neg339 = fmul fast double %35, 0xBFD2F4E9034C7357
%.neg340 = fmul fast double %35, 0xBFEA0AA16F5E991B
%95 = fmul fast double %42, 0x3FE53AAFE3631109
%96 = insertelement <4 x double> <double 0x3FE8836FA2CF5039, double undef, double undef, double undef>, double %41, i32 1
%97 = insertelement <4 x double> %96, double %42, i32 2
%98 = insertelement <4 x double> %97, double %40, i32 3
%99 = insertelement <4 x double> undef, double %40, i32 0
%100 = insertelement <4 x double> %99, double 0x3FE53AAFE3631109, i32 1
%101 = insertelement <4 x double> %100, double 0x3FE1D03E70EA84FF, i32 2
%102 = insertelement <4 x double> %101, double 0xBFE491B7523C161C, i32 3
%103 = fmul fast <4 x double> %98, %102
%104 = fadd fast double %.neg339, %24
%105 = fadd fast double %104, %94
%106 = fadd fast double %.neg340, %28
%107 = fadd fast double %106, %88
%108 = insertelement <4 x double> undef, double %91, i32 0
%109 = insertelement <4 x double> %108, double %93, i32 1
%110 = insertelement <4 x double> %109, double %105, i32 2
%111 = insertelement <4 x double> %110, double %107, i32 3
%112 = fadd fast <4 x double> %111, %103
%113 = insertelement <4 x double> undef, double %90, i32 0
%114 = insertelement <4 x double> %113, double %92, i32 1
%115 = extractelement <4 x double> %103, i32 0
%116 = insertelement <4 x double> %114, double %115, i32 2
%117 = insertelement <4 x double> %116, double %95, i32 3
%118 = fadd fast <4 x double> %112, %117
%119 = fsub fast double %30, %37
%120 = fmul fast double %119, 0x3FEBB67AE8584CAB
%.neg343 = fmul fast double %34, 0xBFEB4AB2F290BEAB
%121 = fmul fast double %34, 0x3FC33FC62FEE2044
%122 = fmul fast double %33, 0x3FEF838B8C811C17
%123 = fmul fast double %40, 0xBFEE11F642522D1C
%.neg344 = fmul fast double %41, 0xBFD2F4E9034C7357
%.neg345 = fmul fast double %41, 0xBFEA0AA16F5E991B
%124 = fadd fast double %.neg343, %24
%125 = fadd fast double %124, %67
%126 = fadd fast double %125, %.neg344
%127 = fadd fast double %126, %123
%128 = fadd fast double %122, %26
%129 = fadd fast double %128, %121
%130 = fadd fast double %129, %.neg345
%131 = fadd fast double %130, %72
%132 = insertelement <2 x double> <double 0xBFE1D03E70EA84FF, double undef>, double %33, i32 1
%133 = insertelement <2 x double> undef, double %35, i32 0
%134 = insertelement <2 x double> %133, double 0x3FE491B7523C161C, i32 1
%135 = fmul fast <2 x double> %132, %134
%136 = insertelement <2 x double> <double 0xBFEB4AB2F290BEAB, double undef>, double %40, i32 1
%137 = insertelement <2 x double> undef, double %42, i32 0
%138 = insertelement <2 x double> %137, double 0x3FEF838B8C811C17, i32 1
%139 = fmul fast <2 x double> %136, %138
%140 = insertelement <2 x double> undef, double %24, i32 0
%141 = insertelement <2 x double> %140, double %28, i32 1
%142 = fadd fast <2 x double> %135, %141
%143 = fadd fast <2 x double> %142, %53
%144 = fadd fast <2 x double> %143, %139
%145 = fadd fast <2 x double> %144, %60
%.sroa.0294.sroa.0.0..sroa.0294.0..sroa_cast.sroa_idx = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 0, i32 0
store double %44, double* %.sroa.0294.sroa.0.0..sroa.0294.0..sroa_cast.sroa_idx, align 8
%.sroa.0294.sroa.2.0..sroa.0294.0..sroa_cast.sroa_idx316 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 0, i32 1
store double 0.000000e+00, double* %.sroa.0294.sroa.2.0..sroa.0294.0..sroa_cast.sroa_idx316, align 8
%.sroa.0294.sroa.3.0..sroa.0294.0..sroa_cast.sroa_idx317 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 1, i32 0
store double %62, double* %.sroa.0294.sroa.3.0..sroa.0294.0..sroa_cast.sroa_idx317, align 8
%.sroa.0294.sroa.4.0..sroa.0294.0..sroa_cast.sroa_idx318 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 1, i32 1
store double %66, double* %.sroa.0294.sroa.4.0..sroa.0294.0..sroa_cast.sroa_idx318, align 8
%.sroa.0294.sroa.5.0..sroa.0294.0..sroa_cast.sroa_idx319 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 0
store double %76, double* %.sroa.0294.sroa.5.0..sroa.0294.0..sroa_cast.sroa_idx319, align 8
%.sroa.0294.sroa.6.0..sroa.0294.0..sroa_cast.sroa_idx320 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 2, i32 1
store double %80, double* %.sroa.0294.sroa.6.0..sroa.0294.0..sroa_cast.sroa_idx320, align 8
%.sroa.0294.sroa.7.0..sroa.0294.0..sroa_cast.sroa_idx321 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 3, i32 0
store double %83, double* %.sroa.0294.sroa.7.0..sroa.0294.0..sroa_cast.sroa_idx321, align 8
%.sroa.0294.sroa.8.0..sroa.0294.0..sroa_cast.sroa_idx322 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 3, i32 1
store double %85, double* %.sroa.0294.sroa.8.0..sroa.0294.0..sroa_cast.sroa_idx322, align 8
%.sroa.0294.sroa.9.0..sroa.0294.0..sroa_cast.sroa_idx323 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 4, i32 0
%146 = bitcast double* %.sroa.0294.sroa.9.0..sroa.0294.0..sroa_cast.sroa_idx323 to <4 x double>*
store <4 x double> %118, <4 x double>* %146, align 8
%.sroa.0294.sroa.13.0..sroa.0294.0..sroa_cast.sroa_idx327 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 0
store double %83, double* %.sroa.0294.sroa.13.0..sroa.0294.0..sroa_cast.sroa_idx327, align 8
%.sroa.0294.sroa.14.0..sroa.0294.0..sroa_cast.sroa_idx328 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 6, i32 1
store double %120, double* %.sroa.0294.sroa.14.0..sroa.0294.0..sroa_cast.sroa_idx328, align 8
%.sroa.0294.sroa.15.0..sroa.0294.0..sroa_cast.sroa_idx329 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 7, i32 0
store double %127, double* %.sroa.0294.sroa.15.0..sroa.0294.0..sroa_cast.sroa_idx329, align 8
%.sroa.0294.sroa.16.0..sroa.0294.0..sroa_cast.sroa_idx330 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 7, i32 1
store double %131, double* %.sroa.0294.sroa.16.0..sroa.0294.0..sroa_cast.sroa_idx330, align 8
%.sroa.0294.sroa.17.0..sroa.0294.0..sroa_cast.sroa_idx331 = getelementptr inbounds %SArray.87, %SArray.87* %0, i64 0, i32 0, i64 8, i32 0
%147 = bitcast double* %.sroa.0294.sroa.17.0..sroa.0294.0..sroa_cast.sroa_idx331 to <2 x double>*
store <2 x double> %145, <2 x double>* %147, align 8
ret void
} Is there some trick that would help? Would be good to know before I bother seriously consider splitting up the numbers into their real and complex components.
Ideally, I would have liked for this to stay optimal into the ~64 range. [Also, GPUs don't have SIMD instructions, but ideally it would have great CPU performance too.]
I don't think this is an issue anywhere -- do you have something specific in mind?
Definitely a lot of literature there that I'm unfamiliar with, and optimizations that go into something like FFTW.
Ha ha. That name comes up often on Wikiedia's FFT-related pages! |
In case it is helpful, I wrote a static-size FFT generator in Julia a while back: JuliaLang/julia#6193 |
I suppose this is the proper channel? (There's also a #staticarrays channel on Slack.) I'm happy that you tried for a PR here, but given the changes to |
Yes, there's no "proper" channel as such, the community is mostly loosely coupled in the package ecosystem. It's pretty easy to create a new package - I'd definitely encourage it for fft work as it should be pretty decoupled from the core StaticArrays functionality. Thanks for your big description of the use case, this sounds really interesting and a good use for statically sized transformations if you're going to stick to several fixed sizes.
This sounds odd. It's probably due to
The compiler shouldn't need any help with this - LLVM optimization passes do a great job of reasoning about the lifetime of values which need to be retained, optimizing the order of operations to keep intermediate results in registers rather than storing them on the stack, reusing stack slots to keep the frame small, etc. What you can do to help is generate straight line code which is easy for the optimizer to reason about, which I think you're already doing. If I had to guess I'd say the naive DFT only wins in some cases because it maps very nicely to the hardware and has few data dependencies (potentially less pipeline stalls). Data dependencies don't look like they'd be a problem for your bigger unrolled FFTs because there's a lot of arithmetic to do between each stage. In general another thing you can do to help the compiler is logical rearrangement of operations which are equivalent for real numbers but forbidden by IEEE floating point semantics. Or simply transformations which the compiler can't discover. I see you've already got fastmath in there so that may allow some of these anyway.
Type stability probably shouldn't be a problem for unrolled loops - even if a variable would change type in the equivalent looped version - as each variable will have a well defined type at each point in the unrolled source. But I'm not sure the compiler deals with all such cases properly in 0.6. You might get some value out of using reals where possible in the initial stages, I'd say it's worth trying. By the way, for expression generation the neatest technique I've seen contributed to static arrays is the code in |
Can I just ask, whether it would be possible to put the fft efforts presented here in a separate package? I think there would be much appreciation for a pure Julia fft implementation that therefore also supports non-Float32/Float64 data types. I would be very interested to run a rfft/irfft with Float16,Bfloat16 and other low precision formats. |
ωcos(N, i) = cospi(-2i/N) | ||
ωsin(N, i) = sinpi(-2i/N) | ||
|
||
ω(N, i) = ωcos(N, i) + im*ωsin(N, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the generated code take advantage of the many trig factors that are +/-1 or +/-i?
I'm torn on this. On the one hand, we should close it because Static FFTs really deserve their own package and this PR is incomplete. On the other hand, it's a cool idea and a neat prototype. On balance I think I should close this issue because |
work on performance.
Just wanted this to be here for review / criticism / etc.
The style probably needs a lot of cleaning up.
"fft.jl" probably shouldn't be the first test.
Etc.
Besides style, critical features to add before this gets accepted:
sin
would evaluate to 0, use a real instead of a complex number. If not going full-split, I'd have to watch out for type stability when doing label-swapping on each iteration.**Do I have to swap labels? That is, how smart is the compiler about abandoning / clearing out memory that is no longer used? Right now I'm repeatedly overwriting the same sets (which means I can't switch from real to complex) because I don't want to clutter caches with junk.
Anyway, a basic overview: I just copied the Cooley Tukey algorithm from Wikipedia, minus the bit twiddling.
The vast majority of the functions just work to build up an expression.
The input is a vector (or tuple -- straightforward to allow) whose size, N, is known at compile time.
First, each element of the input is extracted into a variable named "x_1", "x_2", etc, corresponding to the index via
Base.Cartesian.@nextract
.Then, based on how it factors, we organize each of these into mini prime ffts. For example, when N=30, this is the generated expression:
Imagine drawing a tree, following the recursive definition from the largest to smallest prime factors.
That is, 5 * 3 * 2 -- so we start with the smallest (2x 2ffts) of variables (5x3) apart.
From there, I simplified things by reorganizing the indices to place ffts next to each other. Each iteration of the assembler contracts the number of ffts (but they grow longer by the same margin), and then building the output vector in the end.
The reason for all the
+=
is because the functions will otherwise allocate memory and slow down dramatically. Splitting them up into several lines solves that problem.iffts are basically the same thing, just negative of
exp
's exponents (orsin
/cos
's parameters) and a scaling factor.Performance of the algorithm may need general improvement. I haven't tried it much beyond very small sizes, but it does well at small sizes like
SVectors
promise.But it is pure Julia, and has all the "just works" examples that come with it (especially if it is also defined to work on NTuples).