Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Status #77

Open
tknopp opened this issue Jan 30, 2022 · 4 comments
Open

Performance Status #77

tknopp opened this issue Jan 30, 2022 · 4 comments

Comments

@tknopp
Copy link
Member

tknopp commented Jan 30, 2022

I made several major performance improvements over the last couple of weeks and thought that it makes sense to open an issue to describe where we stand and what I have found out. This issue can also be used for performance tracking in the future. I guess that I squeezed out about a factor of 3-5 and we are now on the same level as the fastest NFFT package on the CPU (FINUFFT).

  • The most important improvement was achieved by blocking the domain where the equidistant sampling nodes lay. It was clear that this will enable multi-threading of the adjoint but it turns out that it also improves the serial NFFT. The reason is that is allows to avoid cache misses a lot.
  • Blocking does increase precomputation time but we are still quite fast and in particular way faster than the transforms itself.
  • Within a block it is important to not touch the non-equidistant sampling nodes since that would again lead to cache misses. This can be avoided by precalculating the required offsets for each block separately. The cache misses are still there but we move it to precomputation time
  • I removed quite some type instabilities and the Julia profiler helped me a lot to find the bottlenecks.
  • The full precomputation is now much slower than using the LUT. This is not too surprising since our B matrix has quite some structure that we cannot be easily exploited when treating it just as a sparse matrix. However, we keep it as a backend since it enables GPU computing
  • I needed to switch Julia multi-threading libraries. Polyester.jl turned out to not play well with FFTW and I now use FLoops.jl.

There are certainly some smaller improvements possible. Here are some ideas:

  • precomputation is not optimized yet
  • In trafo/adjoint it might be possible to exploit SIMD instructions. Simply using the @simd macro did not help. Probably we already use SIMD?
  • Right now the block size is hardcoded to 64 in each direction, which is certainly not the best value in all possible situations. Probably we need some NFFT.MEASURE at some point.
@tknopp
Copy link
Member Author

tknopp commented Feb 12, 2022

Status Update:

  • I managed to make the FINUFFT benchmarks not include the precomputation time. Its a complete hack since I pull out the timing from the FINUFFTPlan https://github.com/JuliaMath/NFFT.jl/blob/master/benchmark/performance.jl#L69
    This means this number is kind of the lower bound but one cannot reach it in practice since FINUFFT does not allow to cache plans.
  • Of course this made FINUFFT a little bit faster requiring me to optimize even more.
  • We now implement LUT and TENSOR. The later is basically as fast as one can get (IMHO).
  • I moved from (2m+1) kernel size (2m) after great advice from NFFT3 colleagues. This makes the NFFT slightly less accurate but improves the runtime.
  • I added new graphs "performance vs accuracy", see: https://juliamath.github.io/NFFT.jl/dev/performance/

The only benchmark where we are slightly slower than FINUFFT is the 3D one. Here are two remarks:

  • When switching from m=7 to m=8 one can see a sudden performance drop. This is likely because we are using tuples and there seems to be some magic number for L=2*m = 16 entries, where things are slowed down.
  • In the adjoint for m<8 we are very slightly slower. Its not so clear why since the C++ code from FINUFFT and our Julia code look nearly identical.

@tknopp
Copy link
Member Author

tknopp commented Feb 13, 2022

Status update: Improved toBlock and addBlock quite a bit, which basically closed the gap to FINUFFT for the 3D NFFT. The m>8 issue above was worked around by switching from a tuple-implementation to an array implementation for larger m. Furthermore a pure real implementation seems to be slightly faster for large m while small m are faster with the complex tuple implementation. Now we have this ugly heuristic: https://github.com/JuliaMath/NFFT.jl/blob/master/src/multidimensional.jl#L376

Since we are faster or on the same level as FINUFFT, the performance work now is basically done.

@roflmaostc
Copy link

I just hook in here.
I noticed, large allocations which were caused by the automatic threading settings of NFFT.jl but ultimately were caused by FFTW.
See here.
What's your opinion on that? By default you always do multi threading, right?

@tknopp
Copy link
Member Author

tknopp commented Jun 19, 2022

@roflmaostc: That is a little bit more complicated:

  • We do automatically thread if, and only if, the user starts Julia with multiple threads. Unfortunately, Julia does not allow to set the number of threads during runtime
  • We right now do not change the number of FFTW threads because I did not want to change global state. If you look at my test script, you can see that I set it manually https://github.com/JuliaMath/NFFT.jl/blob/master/benchmark/performance_simple.jl. However, given that Add get_num_threads FFTW.jl#171 has landed we can now properly chose the number of threads. In your concrete situation you might want to set FFTW threads globally to 1 and get better performance.
  • You can disable threading by setting NFFT._use_threads[] = false. But this works only if you do this once before calling any function, i.e. this is not a runtime option. I am not happy with this situation and we need to discuss this with Julia core threading devs.

Independently of all that: I am benchmarking large 1D / 2D / 3D transforms and they all benefit from multi-threading. May here be the issue that your transforms are 1D and short (< 10000)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants