Initial implementation of butterfly FFT with various ways of indexing #6

daanhb · 2016-02-26T10:42:21Z

This test is a comparison of several naive implementations of the recursive butterfly algorithm for the FFT. Large numbers of views are being created in the recursion. Currently, the cost of the actual computations seems to be dominant over the overhead of creating views. Whether data is being copied or not, does not seem to matter much.

The example does show that views allocate memory (quite a lot, in this case). The test compares implementations where all data is being copied, where views/subs are created, or where views are simulated manually by passing around either strides or ranges as extra parameters. The latter approach does not allocate memory.

Perhaps by optimizing the implementation (precomputing twiddle factors comes to mind), the difference between the various ways of indexing can be made more pronounced.

stevengj · 2016-02-26T12:51:08Z

Might be worthwhile to test some simplified version of my pure-Julia FFT from JuliaLang/julia#6193 (@yuyichao has an updated version of this somewhere, I forget where), too; we need to have benchmarks of optimized code too.

yuyichao · 2016-02-26T13:24:59Z

The updated version in on the yyc/dftnew_rebase branch and I've just rebased it on the current master.

daanhb · 2016-02-26T13:40:37Z

The current test here simply measures differences between different ways of indexing, not between ways of computing the FFT. What would be an appropriate benchmark for the optimized FFT (which I'm looking forward to using btw)? Of course it is worthwhile in any case to benchmark the Julia implementation against the external library, but I'm not sure what that says about indexing performance.

Another use case for views I've been wondering about is tensor-product generalizations. I see the multidimensional Julia FFT is based on StridedArray's and it seems to use some explicit calculations with strides. I guess one could compare this to an implementation using 1D fft's applied to views/slices, as well as to an implementation that copies each row/column/mode? I use n-dimensional ffts quite a lot in FrameFuns.jl and currently we are using the copy approach. Would that make sense as an indexing benchmark?

stevengj · 2016-02-26T16:15:05Z

@daanhb, my general concern is that benchmarking a slow and unrealistic way to compute something is not always meaningful. An ideal benchmark, to me, is an algorithm for a realistic problem that is (or should be) competitive with "serious" production implementations. (Radix-2 FFTs to begin with have not been very competitive for decades, and making lots of subarray copies and computing trig functions in inner loops only makes things worse.)

(fft_sub! is 10x slower than using plan_fft for length 2^10 and 20x slower for 2^15.)

(In FrameFuns, why aren't you just using the built-in (FFTW-based) multidimensional FFTs?)

stevengj · 2016-02-26T16:27:34Z

That being said, I agree that we need benchmarks of indexing etcetera. ~~~But in that case, why bother to compute an FFT? Why not just do a bunch of indexing in a tight loop, e.g. just summing different subarrays chosen arbitrarily? That way you get a cleaner result that only measures indexing and nothing else.~~~

Oh, see, the whole purpose of this repo is to collect quasi-realistic user codes that test array-view performance. Still, you might consider pre-computing the trigonometric factors so that you aren't benchmarking complex exp.

daanhb · 2016-02-28T16:59:47Z

It's true this example is synthetic and that remains so after precomputing the trigonometric factors (even more so). I did this example recently simply to try out views and I'm not using it in any code. Perhaps this repository is not the place for it :-)

For completeness, after precomputing all trigonometric factors, the test mainly benchmarks the overhead of using sub/view in a recursive algorithm. They are both a little faster than copying, and view wins by a small margin over sub. Manually passing offsets and strides does not have memory allocation overhead and is much faster. It is faster than copying even for small N, so there is hope that stack allocated views might win overall also. Currently, at least views were never worse than copying.

daanhb · 2016-02-28T17:05:16Z

@stevengj Regarding FrameFuns, currently the 1D fft's are hidden from each other behind a few layers of abstractions, so it was a practical choice. It will take some well-placed multiple dispatch magic to call a multidimensional fft. I was initially not unhappy with the performance of copying slices, but that was for small problems. We will do some more systematic tests (including with the Julia fft) in the next couple of weeks. With algorithmic improvements elsewhere, the fft's became our bottleneck, so there is a clear motivation.

ViralBShah · 2016-02-28T18:13:53Z

Would be great to have this in for exactly these reasons, with the precomputed trig factors.

daanhb · 2016-02-28T18:52:59Z

In any case it was also good to revisit the experiment, because the initial BigFloat implementation was wrong (I must always remember that 2 times pi converts to Float64 by default).

daanhb · 2016-10-06T07:15:19Z

@stevengj Closing this old pull request again, but just for the record: using multidimensional FFT's was (of course) way way faster than the iterated 1d version with copying we had before.

Initial implementation of butterfly FFT with various ways of indexing

83410b8

Switched to precomputing all complex exponentials

f217085

daanhb closed this Feb 28, 2016

daanhb reopened this Feb 28, 2016

daanhb closed this Oct 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of butterfly FFT with various ways of indexing #6

Initial implementation of butterfly FFT with various ways of indexing #6

daanhb commented Feb 26, 2016

stevengj commented Feb 26, 2016

yuyichao commented Feb 26, 2016

daanhb commented Feb 26, 2016

stevengj commented Feb 26, 2016

stevengj commented Feb 26, 2016

daanhb commented Feb 28, 2016

daanhb commented Feb 28, 2016

ViralBShah commented Feb 28, 2016

daanhb commented Feb 28, 2016

daanhb commented Oct 6, 2016

Initial implementation of butterfly FFT with various ways of indexing #6

Initial implementation of butterfly FFT with various ways of indexing #6

Conversation

daanhb commented Feb 26, 2016

stevengj commented Feb 26, 2016

yuyichao commented Feb 26, 2016

daanhb commented Feb 26, 2016

stevengj commented Feb 26, 2016

stevengj commented Feb 26, 2016

daanhb commented Feb 28, 2016

daanhb commented Feb 28, 2016

ViralBShah commented Feb 28, 2016

daanhb commented Feb 28, 2016

daanhb commented Oct 6, 2016