Tech demo: switch from FFTW to pocketfft/ducc.fft #287

mreineck · 2023-06-14T15:57:00Z

I'm not sure if this is of much general interest, but here is a small experiment that switches all of finufft's FFTs from FFTW to ducc.fft (formerly known as pocketfft and used by scipy).

Advantages:

all FFT-related sources can be directly included into finufft (if so desired); this should make configuration and compilation easier.
no FFT planning is necessary, removing potentially a lot of housekeeping code
performance is better than FFTW if FFTW_ESTIMATE was used before; for FFTW_MEASURE, FFTW may still win in most cases, but differences should be fairly small.

The changes to use ducc.fft are minimal (just a few lines inside finufft.cpp and small adjustments to the makefile)

In its current state, the PR is just meant as a demo. It does not remove any FFTW-related code except the actual fftw_execute call, which means that all FFTW planning etc. is still done (uselessly). Also, I only adjusted the makefile, since I'm not at all familiar with cmake.

If this looks interesting to you, please give it a try, run a few benchmarks and let me know if you have any questions!

mreineck · 2023-06-14T16:58:13Z

I think the CMake build should almost work now ... I just can't convince the Cuda compiler to use the C++17 standard. Giving up on this for now.

mreineck · 2023-06-15T21:28:08Z

I just noticed that you have actually been discussing this topic in devel/cufinufft_tasks_meeting_Jun2023.txt ;-)

…ncy)

…rmance a bit

blackwer · 2023-06-27T18:23:20Z

This was actually part of the inspiration for my fft_bench repo. There are licensing issues (GPL) with FFTW that we were potentially trying to avoid, so we were looking into distributions with MKL et. al. It looks like ducc is also GPL, so this doesn't avoid that issue, but the lack of planning is a huge plus. It's also why I was curious about improving the performance of the 1D ffts of ducc.

Anyway, this looks great. I'm sure we'll talk more about it soon.

mreineck · 2023-06-27T18:28:16Z

Actually, the FFT part of ducc is also available under the 3-clause BSD license,so this would not be an issue (see "Licensing terms" in https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/README.md).

mreineck · 2023-06-30T11:04:47Z

BTW, implementing #304 on this branch should be fairly simple, if you can tell me which parts of fwBatch are zero on input / not needed on output.

ahbarnett · 2023-06-30T14:52:31Z

Let's do the 2D case, for upsampfac=2. Then the four corners of the 2D FFT array are the ones that copy to the user's uniform I/O data. (output for type1, input for type2).
Ie the numbers in the 4x4 pattern here, each representing a N1/2 by N2/2 block,

3..4
....
....
1..2

get stacked (also applying "deconvolve" correction) to the U array [2,1;4,3] in matlab notation.
This means the y-1dffts (done first since they have the worse stride) form two contiguous blocks
namely indices {0,1,.. N1/2-1} and {nf1-N1/2,...,nf1-1}.
That's assuming N1 even, zero-indexing, and nf1 is the fine grid size, approx upsampfac*N1.
Then you do all x-1dffts (stride 1).

3D is analogous.

The best way would be to use the indices I coded into src/finufft.cpp: deconvolveshuffle{1,2,3}d
since they work for any upsampfac. Sorry, there are factors of two due to handling complex #s as real,imag here. Let me know if stuck.

It would be nice to implement this idea using fftw interface too, so we can test fftw and mkl.
I hope that's not too hard to see from your ducc0 implementation.

mreineck · 2023-06-30T14:56:48Z

Thanks, that's what I arrived at as well! (There might still be off-by-one errors in some cases, but the test suite works ...

I hope that's not too hard to see from your ducc0 implementation.

It's not hard, but unfortunately it will be quite verbose ...

ahbarnett · 2023-06-30T15:25:23Z

Hi Martin, Robert, Libin & co,

Before you spend too much effort on this (and I notice you removed FFTW from all the make/cmake/CI files in your PR, which is quite a lot of work) we should strategize, since we would not be able to simply bring in your PR. The main reason is we don't want to stop having FFTW as an option.
Another thing you should know is we only intend to support cmake going forward, for users and for CI, so effort on make.inc.* files is wasted.

There are two main improvements on the table for the CPU code (for now), which are orthogonal to each other:

switchable FFTs. We see variations between FFTW, MKL, and ducc0 by factors of 2 often, but with different winners in different settings. We had not prioritized being switchable, since if there is enthusiasm we could continue with this.
exploiting zero-patterns in 2D and 3D transforms, obtaining rough speedups of 1.5x to 2x for any FFT, perhaps, at least for upsampfac=2. For upsampfac=1.25 a smaller but still useful speedup would result.
Assuming we get the indexing right for a single transform, a complication is that we have this "many vector" version, that, as you can see in

finufft/src/finufft.cpp

Line 721 in a554aea

p->fftwPlan = FFTW_PLAN_MANY_DFT(dim, ns, p->batchSize, p->fwBatch,

batches together in a single plan a stack of nthreads nD transforms.
This gave impressive speedups for smaller sizes (see the test/finufft{1,2,3}d_many examples).
If we are instead doing strided 1d FFTs for everything, we have to decide how the multithreading is organized. There is probably no reason to batch by transform number, and instead could batch by 1d line transforms. This requires some testing how best to do this.
It may vary between FFTW, MKL, and ducc0, also.

Sorry, I have to go for now, but any thoughts about this? Best, Alex

mreineck · 2023-06-30T15:37:18Z

Dear Alex,

just one thing: please don't worry that I'm spending too much time on this and would be unhappy if it isn't merged! I have been doing this strictly for fun up to now, and I admit I was perhaps a bit over-eager when I started stripping the "-lfftw3 -lfftw3_omp" commands fron the demos etc :-)
From my point of view it's no problem to select any subset of the functionality of this patch; I can do this quickly once you have decided what should go in.

Cheers,
Martin

DiamonDinoia · 2024-06-21T14:58:46Z

Hi @mreineck,

We are thinking of providing a switchable fft interface.
I noticed that you worked out a way to avoid computing the zeros in the fft which can make things faster.

At least, I would like to provide the option to the users to test both.

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

Cheers,
Marco

mreineck · 2024-06-21T16:25:49Z

If you like, I can bring this branch up to date; shouldn't be too much work!

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

The full ducc package is released under GPLv2+, which isn't considered permissive any more by most people. But the FFT part (and all ducc source code it depends on) is also available under BSD3, which should be fine.

Still, I recommend very thorough benchmarking before you decide to tweak the ducc FFT code any further. Most of the advantage that FFTW and MKL FFT have over ducc FFT comes (I'm pretty sure) from special passes for higher powers of 2 (length 16, 32, 64) that are not in ducc simply because their source code is huge. Vectorization should be pretty good overall, especially in multi-D transforms. My personal gut feeling is that the current implementation strikes a fairly good balance between performance and maintainability.

DiamonDinoia · 2024-06-21T16:40:12Z

If you like, I can bring this branch up to date; shouldn't be too much work!

Second, ducc license is permissive no? It might be worth spending a bit of time manually vectorizing the bottlenecks to fill the gap with fftw? What do you think? depending on the size of the task I might be able to help a bit.

The full ducc package is released under GPLv2+, which isn't considered permissive any more by most people. But the FFT part (and all ducc source code it depends on) is also available under BSD3, which should be fine.

Still, I recommend very thorough benchmarking before you decide to tweak the ducc FFT code any further. Most of the advantage that FFTW and MKL FFT have over ducc FFT comes (I'm pretty sure) from special passes for higher powers of 2 (length 16, 32, 64) that are not in ducc simply because their source code is huge. Vectorization should be pretty good overall, especially in multi-D transforms. My personal gut feeling is that the current implementation strikes a fairly good balance between performance and maintainability.

If you can bring this up to date it would be great. I personally have two requirements:

possibillity of switching between fftw and ducc (defines, a wapper, compile time switch )
ducc code is not copy pasted here but downloaded using cpm cmake to be consistent with the other dependencies, I can help you with this as it is quick.

I have not looked at the code for the powers of 2, it might be possible to generate them at compile time using templates instead of hardcoding no? It might require c++17 but to mantain ducc c++11 compatible these can be enabled only if c++17 is supported by the compiler.

mreineck · 2024-06-21T19:59:55Z

If you help me with the automatic installation of the ducc sources, then I'm pretty sure I can make this work. I'll probably start work on a new branch though, otherwise the diffs become too large.

Ducc requires C++17 already, so no special measures are needed if you want to use it in your FFT experiments.

DiamonDinoia · 2024-06-21T20:33:06Z

Sure. I will start a new branch for this and I will change cmake so that it pulls the sources.

mreineck · 2024-06-21T20:39:24Z

Perfect, thanks a lot!

DiamonDinoia · 2024-06-21T21:04:43Z

Hi Martin,

I made the following fork: https://github.com/DiamonDinoia/finufft/tree/switchable-fft
I only support cmake as I am not a make person.

The way this works: here
basically what I usually do:

mkdir build && cd build
cmake ../ -DFINUFFT_BUILD_EXAMPLES:BOOL=ON \
-DFINUFFT_BUILD_TESTS:BOOL=ON \
-DFINUFFT_ENABLE_SANITIZERS:BOOL=ON \
-DFINUFFT_USE_OPENMP:BOOL=ON \
-DFINUFFT_USE_DUCC0:BOOL=ON \
-DCMAKE_BUILD_TYPE=Release

I sometimes do -DCMAKE_BUILD_TYPE=Debug to enable debug symbols and disable some optimizations

this will automatically fetch ducc0 and create the define FINUFFT_USE_DUCC0 that we can use to differentiate between calling fftw or ducc0.

I would keep an eye out as we are about to merge the new vectorized spreader, it should not affect this as the changes will be in separate files.

The way I envision it is to write a wrapper for the various fft calls:
fft_wrapper.[h,cpp]

fft_makeplan
fft_execute

and inside with a define we switch between the various immplementation

mreineck · 2024-06-22T05:44:44Z

Thanks for the instructions; I'm not really a cmake person, so they are really helpful!

Not urgent, but I'm interested for the future: how do I enable generating the Python bindings with cmake?

Also I'm currently running into a small problem when compiling:

[ 98%] Building CXX object devel/CMakeFiles/foldrescale.dir/foldrescale.cpp.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/13/include/immintrin.h:109,
                 from /home/martin/codes/finufft/devel/foldrescale.cpp:4:
/usr/lib/gcc/x86_64-linux-gnu/13/include/fmaintrin.h: In function ‘__m256d foldRescaleVec(__m256d, int64_t)’:
/usr/lib/gcc/x86_64-linux-gnu/13/include/fmaintrin.h:47:1: error: inlining failed in call to ‘always_inline’ ‘__m256d _mm256_fmadd_pd(__m256d, __m256d, __m256d)’: target specific option mismatch
   47 | _mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
      | ^~~~~~~~~~~~~~~
/home/martin/codes/finufft/devel/foldrescale.cpp:83:46: note: called from here
   83 |   result                    = _mm256_fmadd_pd(x, x2pi, half);
      |                               ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
make[2]: *** [devel/CMakeFiles/foldrescale.dir/build.make:76: devel/CMakeFiles/foldrescale.dir/foldrescale.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:2163: devel/CMakeFiles/foldrescale.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

This looks like a gcc issue which I should be able to sort out myself.

mreineck · 2024-06-22T06:04:56Z

Nevermind :-) I found the answer to the Python installation in the docs!

The compilation problem goes away if I comment out the line
target_compile_options(foldrescale PRIVATE -mavx2)
in devel/CMakeLists.txt.
However this is a bit strange, since my computer supports AVX2. No idea what's exactly going on here.

mreineck · 2024-06-22T08:30:33Z

I'm encoutering a small cmake related problem ... could you please help me with compiling a few object files for ducc0? For a start I simply tried to append the relevant cc files to FINUFFT_PRECISION_DEPENDENT_SOURCES. That almost worked, but I'm using SINGLE as an enum identifier in ducc0, and compiling with -DSINGLE will of course produce utter chaos :-)

I'll probably just need infra/string_utils.cc, infra/threading.cc, infra/mav.cc, and math/gridding_kernel.cc compiled. That should be done with the same flags as finufft and used as a precision-independent library.

DiamonDinoia · 2024-06-22T23:38:50Z

Hi Martin,

I pushed the changes you requested. More files can be added in cmake/setupDUCC.cmake.

Cheers,
Marco

mreineck · 2024-06-23T14:44:54Z

I think we can close this now; it is superseded by #463

showcase ducc FFT

7ebbf69

mreineck marked this pull request as draft June 14, 2023 15:57

mreineck added 4 commits June 14, 2023 18:01

forgot to add the actual FINUFFT changes ...

000fb37

fix for duplicate symbol

4ca29b4

try to adjust CMake system

d310c76

one more try

6009afc

mreineck added 7 commits June 14, 2023 20:10

fix newly-introduced memory leak; remove fftwPlan

f3db8be

more FFTW removal

7d0ad6e

skip FFTW installation

6443b2f

try to fix MacOS_clang

bbff2ae

one more try

11e94af

Merge remote-tracking branch 'origin/master' into switch_fft

4692202

more cleanup

e2222c8

mreineck added 10 commits June 15, 2023 23:59

remove more unused files

94cf684

more cleanups

dccfd1c

port bug fix from ducc.fft (irrelevant for finufft, just for consiste…

92e182d

…ncy)

re-add -lm for C executables

2abae38

doc cosmetics

e666d8d

merge master

10eb34e

Merge remote-tracking branch 'origin/master' into switch_fft

ed4d7fc

Merge remote-tracking branch 'origin/master' into switch_fft

2574fc1

merge master branch

7876bce

sync with ducc0 master sources; might improve multithreaded FFT perfo…

c215a1f

…rmance a bit

mreineck marked this pull request as ready for review June 27, 2023 18:57

merge master

4e96bf7

mreineck added 5 commits June 30, 2023 13:10

fix overlooked conflict

6a84236

fix overlooked conflict

2959238

ync with ducc0 sources

7bebbf7

mostly cosmetic adjustments

1768bcb

implement partial FFTs

6c3f9d2

mreineck mentioned this pull request Jun 30, 2023

Exploit structure of upsampled input data when doing FFTs #304

Closed

janden force-pushed the master branch 3 times, most recently from 2e637b9 to 0e5f3f3 Compare August 30, 2023 11:26

mreineck added 2 commits November 23, 2023 09:55

merge master and update ducc sources

8c523ba

cleanup and fix a corner case

66c725c

mreineck closed this Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

mreineck commented Jun 14, 2023

mreineck commented Jun 14, 2023

mreineck commented Jun 15, 2023

blackwer commented Jun 27, 2023

mreineck commented Jun 27, 2023

mreineck commented Jun 30, 2023 •

edited

Loading

ahbarnett commented Jun 30, 2023 •

edited

Loading

mreineck commented Jun 30, 2023

ahbarnett commented Jun 30, 2023

mreineck commented Jun 30, 2023

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024 •

edited

Loading

mreineck commented Jun 22, 2024

mreineck commented Jun 22, 2024

mreineck commented Jun 22, 2024

DiamonDinoia commented Jun 22, 2024

mreineck commented Jun 23, 2024

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

Tech demo: switch from FFTW to pocketfft/ducc.fft #287

Conversation

mreineck commented Jun 14, 2023

mreineck commented Jun 14, 2023

mreineck commented Jun 15, 2023

blackwer commented Jun 27, 2023

mreineck commented Jun 27, 2023

mreineck commented Jun 30, 2023 • edited Loading

ahbarnett commented Jun 30, 2023 • edited Loading

mreineck commented Jun 30, 2023

ahbarnett commented Jun 30, 2023

mreineck commented Jun 30, 2023

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024

mreineck commented Jun 21, 2024

DiamonDinoia commented Jun 21, 2024 • edited Loading

mreineck commented Jun 22, 2024

mreineck commented Jun 22, 2024

mreineck commented Jun 22, 2024

DiamonDinoia commented Jun 22, 2024

mreineck commented Jun 23, 2024

mreineck commented Jun 30, 2023 •

edited

Loading

ahbarnett commented Jun 30, 2023 •

edited

Loading

DiamonDinoia commented Jun 21, 2024 •

edited

Loading