Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Idea for performance and memory improvement #293

Open
paquiteau opened this issue Jun 18, 2023 · 9 comments
Open

An Idea for performance and memory improvement #293

paquiteau opened this issue Jun 18, 2023 · 9 comments

Comments

@paquiteau
Copy link

Hi there, I am doing a PhD on (functional) MRI Reconstruction, and found some possible optimization for the algorithm of (cu)finufft.

This happens in the call to FFTW, in the case of type 1 (Non Uniform to Uniform Points). Instead of computing the full oversampled grid (with generally osf = 2, so 8x more memory in 3D) the internal call to FFTW can be planned to performed a pruned transform (https://www.fftw.org/pruned.html) . Basically The last step of the butterfly has to be done by hand (and only one half is computed).
If osf ≠ 2, I don't think its worth it (the butterfly cannot be split evenly).
This would also require the deconvolveshuffle?d functions to be modified.

From my understanding this would not apply to type 2 (uniform to non-uniform), because you want the fine frequency grid to do the interpolation.

Unfortunately, I don't have the time nor the competences to propose an implementation. Yet, I think this could contribute to reduce the memory footprint and increase performances. Let me know if this idea spark anything.

(BTW, Thank you for the great work on (cu)finufft, I could not do my PhD without it)

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jun 19, 2023 via email

@paquiteau
Copy link
Author

Hi,

The core of my work relies on mind-inria/mri-nufft, where we provide interfaces for all the NUFFT librairies out there, including (cu)finufft, which is the most stable/actively maintained/fastest. In MRI the NUFFT is wrapped around with multi-coil (using sensitivity maps) and density compensation1. The coil dimension is essentially the batch dimension (with typically N_coils = 32). The case of functional MRI is even more demanding because multiple volumes (≈ 400 volume of typical size 192x192x128, with NU_points≈150k) are acquired, potentially with a different sampling pattern for each, and for compressed sensing based reconstruction where temporal information is shared, well, you need to setup/execute/destroy (when there is not enough memory) that many plans (I am not completely there yet, but if you want more detail about those methods, everything lies in paquiteau/pysap-fmri.

Regarding the pruned FFT, I am not the only to have think about actually. People working on BART have proposed a similar idea in an ISMRM Abstract (sadly not in open access), and a pointer to the code is here. In fact BART propose its own implementation of the NUFFT, I haven't had the chance to test/benchmark it however.

I guess its a matter of how bad the stride would kill the performance benefit, if in the end the wall time remains the same and we get to use less memory, that's still a win (especially for GPU usage).

I am looking forward for the future development of (cu)finufft and will happily battle-test it,

Pierre-Antoine

Footnotes

  1. Actually there is some trouble for estimating the density compensation vector with finufft, but that might deserve its own issue

@mreineck
Copy link
Collaborator

There may be other ways than pruning to speed up the FFT. In a unifom-to-nonuniform 2D NUFFT with, say, an oversampling factor of 2, you only need to transform over the first axis for half of the array, since the other half of the array contains vectors that are identically zero. This saves 25% of FFT cost without having to use complicated hand-crafted algorithms. In 3D, you have to transform a quarter of the array along the first axis and half of the array along the second axis, saving an even larger fraction of FFT time.

In the nonuniform-to-uniform direction things are not as obvious, but the same amount of time can be saved,since you don't need the accurate results on the entire oversamped grid, just in a part of it.

I'm using this approach pretty successfully in my own code.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jun 19, 2023 via email

@mreineck
Copy link
Collaborator

and you'd write a custom 2d and 3d FFT call composed
of many 1d transforms, just the subsets needed. (I'm assuming that wouldn't
be slower than what FFTW does for 2d or 3d... I haven't looked at their
code).

For 2D it would probably be a set of three plans generated with fftw_plan_many_dft (two for the first axis, one for the second). Except for the planning overhead (and yes, this may be a problem), this should not cause any unexpected slowdowns. For 3D, it's seven plans.

@paquiteau
Copy link
Author

To foster more discussion, here is the abstract I was referring to https://perso.crans.org/comby/ISMRM2023/ISMRM%202023.html (the abstracts are only correctly viewable as html pages, I scrapped it off). The BART NUFFT is indeed slower, but more memory efficient, due to their trick. Note that the abstract do not compare BART and cufinufft, probably to avoid some embarrassment.

The abstract does not refers to the strides problem you mention, the only one I can see is happening on the kernel (which is much smaller) rather than on the data. I think this tricks is complementary to the one of @mreineck , but I won't consider myself a black belt of FFT-jitsu, so I will be happy to have more insight on this.

@mreineck
Copy link
Collaborator

mreineck commented Jun 20, 2023

I agree that the two optimizations complementary, so they can be combined.

There is one more aspect that should be considered when doing large FFTs: if the strides of some array dimensions are critical (i.e. they are a multiple of 4096 bytes on most current CPUs), cache re-use will be extremely bad, and the FFT will be very slow. Sadly, this situation mostly turns up for FFT sizes that are considered optimal, i.e. large powers of 2. Some trickery with array strides can work around the problem, and it can give huge performance boosts.

Here is an example that needs to be run in ipython (because of the %timeit magic). Please note that this problem exists with all FFT implementations; I just chose ducc here, because this allows for a very short example code.

import numpy as np
import ducc0

shape=(4096,4096)

# unpadded test
a=np.zeros(shape, dtype=np.complex128)
print(a.shape)
print(a.strides)
%timeit ducc0.fft.c2c(a, axes=(0,), out=a)

# padded test, avoiding critical strides
a=ducc0.misc.make_noncritical(a)
print(a.shape)
print(a.strides)
a[()] = 0
%timeit ducc0.fft.c2c(a, axes=(0,), out=a)

Making arrays non-critical in finufft is not trivial, since multi-D data are assumed to be stored in compact form, so I have not tried to change this yet. When working with power-of-2 grids and oversampling factors of 2, it should give substantial speedups however.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Jun 20, 2023 via email

@mreineck
Copy link
Collaborator

Ah, of course, you have the choice of just increasing the array dimensions a little bit, I forgot about that!
make_noncritical does not change the actual array dimensions, it just adds minimal padding where necessary; as a method for obtaining actual FFT lengths this would be quite bad...
You can update next235even such that it takes an additional argument, which is the unit stride in bytes along the direction for which you want the length. Then you just ignore all potential results where length*stride/4096 is an integer. You have to start from the last dimension (in C-ordering) and work your way forwards.
Overall this approach might be slightly worse than the padding one, since you avoid large power-of-2 lengths, which are in principle really advantageous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants