Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

Closed
wants to merge 30 commits into from

Conversation

mreineck
Copy link
Collaborator

I'm not sure if this is of much general interest, but here is a small experiment that switches all of finufft's FFTs from FFTW to ducc.fft (formerly known as pocketfft and used by scipy).

Advantages:

  • all FFT-related sources can be directly included into finufft (if so desired); this should make configuration and compilation easier.
  • no FFT planning is necessary, removing potentially a lot of housekeeping code
  • performance is better than FFTW if FFTW_ESTIMATE was used before; for FFTW_MEASURE, FFTW may still win in most cases, but differences should be fairly small.

The changes to use ducc.fft are minimal (just a few lines inside finufft.cpp and small adjustments to the makefile)

In its current state, the PR is just meant as a demo. It does not remove any FFTW-related code except the actual fftw_execute call, which means that all FFTW planning etc. is still done (uselessly). Also, I only adjusted the makefile, since I'm not at all familiar with cmake.

If this looks interesting to you, please give it a try, run a few benchmarks and let me know if you have any questions!

@mreineck mreineck marked this pull request as draft June 14, 2023 15:57
@mreineck
Copy link
Collaborator Author

I think the CMake build should almost work now ... I just can't convince the Cuda compiler to use the C++17 standard. Giving up on this for now.

@mreineck
Copy link
Collaborator Author

I just noticed that you have actually been discussing this topic in devel/cufinufft_tasks_meeting_Jun2023.txt ;-)

@blackwer
Copy link
Member

This was actually part of the inspiration for my fft_bench repo. There are licensing issues (GPL) with FFTW that we were potentially trying to avoid, so we were looking into distributions with MKL et. al. It looks like ducc is also GPL, so this doesn't avoid that issue, but the lack of planning is a huge plus. It's also why I was curious about improving the performance of the 1D ffts of ducc.

Anyway, this looks great. I'm sure we'll talk more about it soon.

@mreineck
Copy link
Collaborator Author

Actually, the FFT part of ducc is also available under the 3-clause BSD license,so this would not be an issue (see "Licensing terms" in https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/README.md).

@mreineck mreineck marked this pull request as ready for review June 27, 2023 18:57
@mreineck
Copy link
Collaborator Author

mreineck commented Jun 30, 2023

BTW, implementing #304 on this branch should be fairly simple, if you can tell me which parts of fwBatch are zero on input / not needed on output.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jun 30, 2023

Let's do the 2D case, for upsampfac=2. Then the four corners of the 2D FFT array are the ones that copy to the user's uniform I/O data. (output for type1, input for type2).
Ie the numbers in the 4x4 pattern here, each representing a N1/2 by N2/2 block,

3..4
....
....
1..2

get stacked (also applying "deconvolve" correction) to the U array [2,1;4,3] in matlab notation.
This means the y-1dffts (done first since they have the worse stride) form two contiguous blocks
namely indices {0,1,.. N1/2-1} and {nf1-N1/2,...,nf1-1}.
That's assuming N1 even, zero-indexing, and nf1 is the fine grid size, approx upsampfac*N1.
Then you do all x-1dffts (stride 1).

3D is analogous.

The best way would be to use the indices I coded into src/finufft.cpp: deconvolveshuffle{1,2,3}d
since they work for any upsampfac. Sorry, there are factors of two due to handling complex #s as real,imag here. Let me know if stuck.

It would be nice to implement this idea using fftw interface too, so we can test fftw and mkl.
I hope that's not too hard to see from your ducc0 implementation.

@mreineck
Copy link
Collaborator Author

Thanks, that's what I arrived at as well! (There might still be off-by-one errors in some cases, but the test suite works ...

I hope that's not too hard to see from your ducc0 implementation.

It's not hard, but unfortunately it will be quite verbose ...

@ahbarnett
Copy link
Collaborator

Hi Martin, Robert, Libin & co,

Before you spend too much effort on this (and I notice you removed FFTW from all the make/cmake/CI files in your PR, which is quite a lot of work) we should strategize, since we would not be able to simply bring in your PR. The main reason is we don't want to stop having FFTW as an option.
Another thing you should know is we only intend to support cmake going forward, for users and for CI, so effort on make.inc.* files is wasted.

There are two main improvements on the table for the CPU code (for now), which are orthogonal to each other:

  1. switchable FFTs. We see variations between FFTW, MKL, and ducc0 by factors of 2 often, but with different winners in different settings. We had not prioritized being switchable, since if there is enthusiasm we could continue with this.

  2. exploiting zero-patterns in 2D and 3D transforms, obtaining rough speedups of 1.5x to 2x for any FFT, perhaps, at least for upsampfac=2. For upsampfac=1.25 a smaller but still useful speedup would result.
    Assuming we get the indexing right for a single transform, a complication is that we have this "many vector" version, that, as you can see in

    p->fftwPlan = FFTW_PLAN_MANY_DFT(dim, ns, p->batchSize, p->fwBatch,

    batches together in a single plan a stack of nthreads nD transforms.
    This gave impressive speedups for smaller sizes (see the test/finufft{1,2,3}d_many examples).
    If we are instead doing strided 1d FFTs for everything, we have to decide how the multithreading is organized. There is probably no reason to batch by transform number, and instead could batch by 1d line transforms. This requires some testing how best to do this.
    It may vary between FFTW, MKL, and ducc0, also.

Sorry, I have to go for now, but any thoughts about this? Best, Alex

@mreineck
Copy link
Collaborator Author

Dear Alex,

just one thing: please don't worry that I'm spending too much time on this and would be unhappy if it isn't merged! I have been doing this strictly for fun up to now, and I admit I was perhaps a bit over-eager when I started stripping the "-lfftw3 -lfftw3_omp" commands fron the demos etc :-)
From my point of view it's no problem to select any subset of the functionality of this patch; I can do this quickly once you have decided what should go in.

Cheers,
Martin

@janden janden force-pushed the master branch 3 times, most recently from 2e637b9 to 0e5f3f3 Compare August 30, 2023 11:26
@DiamonDinoia
Copy link
Collaborator

Hi @mreineck,

We are thinking of providing a switchable fft interface.
I noticed that you worked out a way to avoid computing the zeros in the fft which can make things faster.

At least, I would like to provide the option to the users to test both.

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

Cheers,
Marco

@mreineck
Copy link
Collaborator Author

If you like, I can bring this branch up to date; shouldn't be too much work!

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

The full ducc package is released under GPLv2+, which isn't considered permissive any more by most people. But the FFT part (and all ducc source code it depends on) is also available under BSD3, which should be fine.

Still, I recommend very thorough benchmarking before you decide to tweak the ducc FFT code any further. Most of the advantage that FFTW and MKL FFT have over ducc FFT comes (I'm pretty sure) from special passes for higher powers of 2 (length 16, 32, 64) that are not in ducc simply because their source code is huge. Vectorization should be pretty good overall, especially in multi-D transforms. My personal gut feeling is that the current implementation strikes a fairly good balance between performance and maintainability.

@DiamonDinoia
Copy link
Collaborator

If you like, I can bring this branch up to date; shouldn't be too much work!

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

The full ducc package is released under GPLv2+, which isn't considered permissive any more by most people. But the FFT part (and all ducc source code it depends on) is also available under BSD3, which should be fine.

Still, I recommend very thorough benchmarking before you decide to tweak the ducc FFT code any further. Most of the advantage that FFTW and MKL FFT have over ducc FFT comes (I'm pretty sure) from special passes for higher powers of 2 (length 16, 32, 64) that are not in ducc simply because their source code is huge. Vectorization should be pretty good overall, especially in multi-D transforms. My personal gut feeling is that the current implementation strikes a fairly good balance between performance and maintainability.

If you can bring this up to date it would be great. I personally have two requirements:

  1. possibillity of switching between fftw and ducc (defines, a wapper, compile time switch )
  2. ducc code is not copy pasted here but downloaded using cpm cmake to be consistent with the other dependencies, I can help you with this as it is quick.

I have not looked at the code for the powers of 2, it might be possible to generate them at compile time using templates instead of hardcoding no? It might require c++17 but to mantain ducc c++11 compatible these can be enabled only if c++17 is supported by the compiler. 

@mreineck
Copy link
Collaborator Author

If you help me with the automatic installation of the ducc sources, then I'm pretty sure I can make this work. I'll probably start work on a new branch though, otherwise the diffs become too large.

Ducc requires C++17 already, so no special measures are needed if you want to use it in your FFT experiments.

@DiamonDinoia
Copy link
Collaborator

Sure. I will start a new branch for this and I will change cmake so that it pulls the sources.

@mreineck
Copy link
Collaborator Author

Perfect, thanks a lot!

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Jun 21, 2024

Hi Martin,

I made the following fork: https://github.com/DiamonDinoia/finufft/tree/switchable-fft
I only support cmake as I am not a make person.

The way this works: here
basically what I usually do:

mkdir build && cd build
cmake ../ -DFINUFFT_BUILD_EXAMPLES:BOOL=ON \
-DFINUFFT_BUILD_TESTS:BOOL=ON \
-DFINUFFT_ENABLE_SANITIZERS:BOOL=ON \
-DFINUFFT_USE_OPENMP:BOOL=ON \
-DFINUFFT_USE_DUCC0:BOOL=ON \
-DCMAKE_BUILD_TYPE=Release

I sometimes do -DCMAKE_BUILD_TYPE=Debug to enable debug symbols and disable some optimizations

this will automatically fetch ducc0 and create the define FINUFFT_USE_DUCC0 that we can use to differentiate between calling fftw or ducc0.

I would keep an eye out as we are about to merge the new vectorized spreader, it should not affect this as the changes will be in separate files.

The way I envision it is to write a wrapper for the various fft calls:
fft_wrapper.[h,cpp]

fft_makeplan
fft_execute

and inside with a define we switch between the various immplementation

@mreineck
Copy link
Collaborator Author

Thanks for the instructions; I'm not really a cmake person, so they are really helpful!

Not urgent, but I'm interested for the future: how do I enable generating the Python bindings with cmake?

Also I'm currently running into a small problem when compiling:

[ 98%] Building CXX object devel/CMakeFiles/foldrescale.dir/foldrescale.cpp.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/13/include/immintrin.h:109,
                 from /home/martin/codes/finufft/devel/foldrescale.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/13/include/fmaintrin.h: In function ‘__m256d foldRescaleVec(__m256d, int64_t)’:
/usr/lib/gcc/x86_64-linux-gnu/13/include/fmaintrin.h:47:1: error: inlining failed in call to ‘always_inline’ ‘__m256d _mm256_fmadd_pd(__m256d, __m256d, __m256d)’: target specific option mismatch
   47 | _mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
      | ^~~~~~~~~~~~~~~
/home/martin/codes/finufft/devel/foldrescale.cpp:83:46: note: called from here
   83 |   result                    = _mm256_fmadd_pd(x, x2pi, half);
      |                               ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
make[2]: *** [devel/CMakeFiles/foldrescale.dir/build.make:76: devel/CMakeFiles/foldrescale.dir/foldrescale.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2163: devel/CMakeFiles/foldrescale.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

This looks like a gcc issue which I should be able to sort out myself.

@mreineck
Copy link
Collaborator Author

Nevermind :-) I found the answer to the Python installation in the docs!

The compilation problem goes away if I comment out the line
target_compile_options(foldrescale PRIVATE -mavx2)
in devel/CMakeLists.txt.
However this is a bit strange, since my computer supports AVX2. No idea what's exactly going on here.

@mreineck
Copy link
Collaborator Author

I'm encoutering a small cmake related problem ... could you please help me with compiling a few object files for ducc0? For a start I simply tried to append the relevant cc files to FINUFFT_PRECISION_DEPENDENT_SOURCES. That almost worked, but I'm using SINGLE as an enum identifier in ducc0, and compiling with -DSINGLE will of course produce utter chaos :-)

I'll probably just need infra/string_utils.cc, infra/threading.cc, infra/mav.cc, and math/gridding_kernel.cc compiled. That should be done with the same flags as finufft and used as a precision-independent library.

@DiamonDinoia
Copy link
Collaborator

Hi Martin,

I pushed the changes you requested. More files can be added in cmake/setupDUCC.cmake.

Cheers,
Marco

@mreineck
Copy link
Collaborator Author

I think we can close this now; it is superseded by #463

@mreineck mreineck closed this Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants