-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading #19
Comments
nice, so funny seeing this stuff landing in this package. I worked on the original multithreading code in Julia JuliaLang/julia#6741 Regarding FFTW: You just have to call |
I didn't realize that multithreading in Julia was so old! Would it make sense to add FFTW.set_num_threads(Threads.nthread()) in NFFT? |
Yes this sounds good. Although this probably needs a more general solution since its always a little bit bas if we change global state in a user library. Maybe
is a little safer since we would not set it to 1 if a user runs Julia in serial mode but has set FFTW to be parallel |
Good point with the global state. There's not much documentation for |
No thats not possible. Its a global value in But I would not care to much. More interesting is how good your speed ups are. Parallel nfft is quite nice and that often seen in the wild... ;-) |
I'm currently on an old computer with only 2 threads and here it's about 30% faster for a 2D example I'm testing. |
More timing suggests that the threading constructs take quite some time. E.g. for 2D case in
|
Hey, that not that bad. In multithreading its not that simple to achieve full speedups. So this is not necessary an issue with the Julia implementation of threads. The issue in multithreading is usually memory accesses to similar regions from different threads. This will lead fighting CPU caches. Do we already sort the k-space nodes? Might be a good idea... |
Are the k-nodes part of the |
I meant https://github.com/tknopp/NFFT.jl/blob/master/src/NFFT.jl#L56 I usually use NFFT for MRI reconstruction where |
We don't currently sort |
One problem with sorting in the plan creation is that |
Yes indeed. One might for with an indirect index. Might make sense to look how the solved this in the NFFT C library: |
I have merged your multithreaded convolve into master |
@tknopp It seems to be more difficult to utilize multithreading with |
Yes, this is a little trickier and I don't have a direct solution. However, look at this from the manual:
It means that you have access to the threadId. We might need to have some temporary array within the loop where intermediate results from the sum are stored. And then a final sequential step to combine the result. An alternative would be a lock within the loop. But I don't know if locks are exposed within the threading interface. |
There is in base |
A possibility to speed-up calculations is to use FFTW planning flags: As a side note, I wouldn't force the number of FFTW threads to the number of Julia threads, the user should set them explicitely. |
I agree that it would be nice to expose the features that FFTW exposes. We may do this by simply forwarding the Why would one use a different number of threads in the Julia multi-threading part in contrast to the FFTW multithreading part? |
Well, they're different things. I guess this isn't used in Threads.@threads for i in 1:16
fft(rand(10))
end One can start Julia with just one thread and set FFTW threads afterwards, this is what I usually do. |
yes. This is indeed a shortcoming of the current threading system. Hopefully we get some load/work balancing at some point. |
I think I can try and do this next week-end. I already did something similar in |
In Julia 1.2 I was able to use threaded fft within threads on an old Mac with 8 threads just fine:
So I hope to revisit a threaded version of NFFT when time permits... |
Watch out for Julia 1.3. It features a very powerful new threading architecture where basically it will be possible to nest parallel loops without the need to care about if an inner loop might be parallelized. If I understand correctly this is even integrated with C libraries such as FFTW: JuliaMath/FFTW.jl#105. My most performant NFFT implementation (precompute = FULL) uses an ordinary sparse matrix for the "convolution" matrix. This will require the matrix-vector operation to be multi-threaded. This is discussed here: JuliaLang/julia#29525 |
Hi Tobias, Thanks for your efforts, much appreciated! |
Are you using If yes it basically boils down to multithreading sparse matrix operations: A sparse matrix vector multiplication should be straight forward. There was als a PR to Julia base for this: JuliaLang/julia#29525 |
I do not. I use each NFFT only once to calculate a Toeplitz kernel (cf. PR #57 ), so I don't think it is worth computing a sparse matrix, or is it? Yet, I create about 1000 Toeplitz kernels (each 512^3 incl. oversampling), so we are still talking ~2h computation. The creation of the LUT is quick, as is the the FFT (on a 40-core machine) and the biggest bottle-necks are the convolution and the apodization, which are single threaded:
So I would looovvvvveee a multi-threaded version of these functions. At least the apodization should be rather trivially parallel, shouldn't it? |
Hey Jacob, could you provide me the exact parameters with that you ended up with that numbers? I cannot get an apodization step that is so long for 512^3. In contrast, In any case: I created the branch I also added the multithreaded |
I cleanup the performance stats code a little bit which will be useful for pushing this forward. There is also a test
|
Oh, ok. I realize that I use it a somewhat exotic, i.e. highly undersampled case. With an oversampled matrix size of 512^3, I have only 46k data points. I reconstruct, in the widest sense, a time series where I use correlations between the different time frames. Hence the strong under sampling. But I guess this pretty common in dynamic imaging, so a speedup at the for me relevant points could be of interest for a broader community. I took a look at your |
Yes, highly under sampled data is of course a use case! So we thus need to speed up the apodization as well. We can do this by multi-threading but probably there are some other low hanging fruits. I never investigated that operation because it never was slow enough. And yes, I know that you want the adoint to be fast. I just wanted to get started with something that is easy :-) |
I changed the second performance test so that we can directly test your setup. I know get
for N=256 (*2 oversampling) and M = 46000. This is the single-threaded performance. So I am completely FFT dominated. And in particular my anodizations are much faster than in your setup (I am on a 7 years old MacBook). |
FYI: I incorporated multithreading in the pre-computation of the LUT in #57 . So the FFT becomes really fast when a) having many cores (e.g. 40) and b) using the MEASURE flag. The latter allows, in my use case, a 100x speed up compared to the ESTIMATE flag on the same cores. For this reason, I also created an in-place constructor of the NFFTPlan in #57, which allows me to measure the FFT plan once, and the replace the LUT for each time frame. |
ok, I see, the MEASURE was what I was missing. That indeed makes thinks drastically faster. I will have a look at the multithreaded adjoint conv and we can also multi-thread the apodizations. It would be great if you could give me a performance test code similar to I will grant you commit rights such that you can also work directly on branches here. |
I added an example to the branch, but it is hard to emulate, as my use case with an oversampled matrix size of 512^3 a) needs more memory than available on some systems (19GB), b) the FFT time is dominant, unless you have many CPUs. |
Ok, thanks, we need to tackle this step by step. In the mean time I implemented a "straight forward" multi-threading implementation here https://github.com/tknopp/NFFT.jl/blob/tk/multithreading/src/multidimensional.jl#L85: Obviously the allocating part needs to be removed into the planing phase, so this is just experimenting with it. My conclusions are:
So I fear that this is not completely trivial. 1. might be fixable by sorting the nodes. 2. depends a little bit on your use case. Are you using the same plan several time and just exchanging nodes? Then my implementation could work already. We just need to pull Another idea would be to do some research on how others have tackled this. There are several GPU implementations out there and I think also the NFFT C library is multi-threaded. |
I should have written that before trying to tackle this... We need the algorithm sketched on pages 13/14 but it doesn't look so easy that one implements that before breakfast ;-) |
Yeah, those zero fills can be nasty. That is exactly what I fixed with my first tiny test-balloon PR in the constructor. Uff, yeah, have a big breakfast first! I see that there is Julia interface to the NFFT3 package and that you are named as a contributor. Do you know the advantages / shortcomings of this C-library for MR image reconstruction? I have never used it... |
There is no real downside in NFFT3 and there is even a Julia wrapper for it https://github.com/NFFT/NFFT3.jl I started NFFT.jl as an experiment to test the language and its claim to be as fast as C. Apparently the experiment was somewhat successful. Both libraries don't have the same features but IMHO it is a huge advantage that the Julia community has its own pure NFFT implementation that remains hackable to people not knowing the NFFT3 code base by heart. But back to topic: I implemented on the tk/multi-threading two new implementations for |
Update: I finally managed to also multithread the apodization function which will give you an additional speedup. I only have two cores right now, so I cannot test it under larger scale. But I hope with this the apodization is not the bottleneck in your benchmarks anymore. |
Note to myself: We will use the multi-threading strategy for the adjoint convolve described in this paper: https://arxiv.org/abs/1808.06736 |
Done! We now use a similar blocking strategy as outlined in FINUFFT, which is actually quite similar to NFFT3. I did not went to the complete details but if I understand it correctly, the difference is just that NFFT3 makes a blocking on the linearized memory, while FINUFFT blocks in both directions. The nice thing is that blocking even helps in the non-threaded case, so we can use this as the default path. Here are some preliminary benchmarks. They still have some smaller problems but I don't expect larger deviations anymore. For NFFT3 I use the PRE_PSI instead PRE_LIN_PSI which gives faster nfft for the cost of slightly longer precomputation. |
@tknopp: I have finally gotten around to parallelizing parts of NFFT. With the new
Threads.@threads
theconvolve!
function uses multithreading -- check my branch.Compare a profiling of the old code (The purple part on the same level as the highlighted
convolve!
isfft
.)With the new code (with
export JULIA_NUM_THREADS=2
):The (regular) FFT call now consumes a significant proportion of the time.
Recalling stevengj's comments in #17 FFTW 3.3.5 is needed to multithread
fft
and this version is used in Julia v0.6. But I can't find info on how to use multithreading forfft
.The text was updated successfully, but these errors were encountered: