Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of butterfly FFT with various ways of indexing #6

Closed
wants to merge 2 commits into from

Conversation

daanhb
Copy link

@daanhb daanhb commented Feb 26, 2016

This test is a comparison of several naive implementations of the recursive butterfly algorithm for the FFT. Large numbers of views are being created in the recursion. Currently, the cost of the actual computations seems to be dominant over the overhead of creating views. Whether data is being copied or not, does not seem to matter much.

The example does show that views allocate memory (quite a lot, in this case). The test compares implementations where all data is being copied, where views/subs are created, or where views are simulated manually by passing around either strides or ranges as extra parameters. The latter approach does not allocate memory.

Perhaps by optimizing the implementation (precomputing twiddle factors comes to mind), the difference between the various ways of indexing can be made more pronounced.

@stevengj
Copy link

Might be worthwhile to test some simplified version of my pure-Julia FFT from JuliaLang/julia#6193 (@yuyichao has an updated version of this somewhere, I forget where), too; we need to have benchmarks of optimized code too.

@yuyichao
Copy link

The updated version in on the yyc/dftnew_rebase branch and I've just rebased it on the current master.

@daanhb
Copy link
Author

daanhb commented Feb 26, 2016

The current test here simply measures differences between different ways of indexing, not between ways of computing the FFT. What would be an appropriate benchmark for the optimized FFT (which I'm looking forward to using btw)? Of course it is worthwhile in any case to benchmark the Julia implementation against the external library, but I'm not sure what that says about indexing performance.

Another use case for views I've been wondering about is tensor-product generalizations. I see the multidimensional Julia FFT is based on StridedArray's and it seems to use some explicit calculations with strides. I guess one could compare this to an implementation using 1D fft's applied to views/slices, as well as to an implementation that copies each row/column/mode? I use n-dimensional ffts quite a lot in FrameFuns.jl and currently we are using the copy approach. Would that make sense as an indexing benchmark?

@stevengj
Copy link

@daanhb, my general concern is that benchmarking a slow and unrealistic way to compute something is not always meaningful. An ideal benchmark, to me, is an algorithm for a realistic problem that is (or should be) competitive with "serious" production implementations. (Radix-2 FFTs to begin with have not been very competitive for decades, and making lots of subarray copies and computing trig functions in inner loops only makes things worse.)

(fft_sub! is 10x slower than using plan_fft for length 2^10 and 20x slower for 2^15.)

(In FrameFuns, why aren't you just using the built-in (FFTW-based) multidimensional FFTs?)

@stevengj
Copy link

That being said, I agree that we need benchmarks of indexing etcetera. ~~~But in that case, why bother to compute an FFT? Why not just do a bunch of indexing in a tight loop, e.g. just summing different subarrays chosen arbitrarily? That way you get a cleaner result that only measures indexing and nothing else.~~~

Oh, see, the whole purpose of this repo is to collect quasi-realistic user codes that test array-view performance. Still, you might consider pre-computing the trigonometric factors so that you aren't benchmarking complex exp.

@daanhb
Copy link
Author

daanhb commented Feb 28, 2016

It's true this example is synthetic and that remains so after precomputing the trigonometric factors (even more so). I did this example recently simply to try out views and I'm not using it in any code. Perhaps this repository is not the place for it :-)

For completeness, after precomputing all trigonometric factors, the test mainly benchmarks the overhead of using sub/view in a recursive algorithm. They are both a little faster than copying, and view wins by a small margin over sub. Manually passing offsets and strides does not have memory allocation overhead and is much faster. It is faster than copying even for small N, so there is hope that stack allocated views might win overall also. Currently, at least views were never worse than copying.

@daanhb
Copy link
Author

daanhb commented Feb 28, 2016

@stevengj Regarding FrameFuns, currently the 1D fft's are hidden from each other behind a few layers of abstractions, so it was a practical choice. It will take some well-placed multiple dispatch magic to call a multidimensional fft. I was initially not unhappy with the performance of copying slices, but that was for small problems. We will do some more systematic tests (including with the Julia fft) in the next couple of weeks. With algorithmic improvements elsewhere, the fft's became our bottleneck, so there is a clear motivation.

@daanhb daanhb closed this Feb 28, 2016
@ViralBShah
Copy link
Contributor

Would be great to have this in for exactly these reasons, with the precomputed trig factors.

@daanhb daanhb reopened this Feb 28, 2016
@daanhb
Copy link
Author

daanhb commented Feb 28, 2016

In any case it was also good to revisit the experiment, because the initial BigFloat implementation was wrong (I must always remember that 2 times pi converts to Float64 by default).

@daanhb
Copy link
Author

daanhb commented Oct 6, 2016

@stevengj Closing this old pull request again, but just for the record: using multidimensional FFT's was (of course) way way faster than the iterated 1d version with copying we had before.

@daanhb daanhb closed this Oct 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants