-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple cache alignment for serial FFT #242
Conversation
Note, the PR doesn't include the update for the IFFT, since its not clear to me how one should go about cache aligning it. You start with only needing a small subset of roots, and continue needing more every iteration. The main solution I can think of is either building the new roots as you go through iterations, or using 50% extra RAM and storing essentially two copies of the root list. |
I confirmed that our stylistic changes above didn't affect performance |
Another way to improve cache locality might be to unroll the butterfly loops by constant factor (2 or 4 etc.) which can help by encouraging prefetching of cache lines. Not sure if this applies since group elements generally already take up a single cache line. But it could help prefetching, perhaps. By the way, the celo-org mega merge also contains some code for prefetching over generic data types, I think I will move that code to |
I think that for IFFT, since we are using the roots multiple times, it still does make sense to spend the initial data copy cost. So one just allocates the array/vector with capacity Btw, what's with the funny name? Actually, what's the out of order ordering...? How is it possible to perform an FFT if it is out of order? |
Just a quick question, why in the case of non-parallel, it is using:
But in the original code, it seems that it would use |
@weikengchen It's because in the serial case we re-arrange the roots so that the relevant roots are at |
@weikengchen I think it's cos in the parallel case we don't do the cache alignment step. Btw @ValarDragon don't you think we should cache align for the parallel case too, since it helps when the gap is small? Further, I think I could also post a PR where either the chunk parallelisation itself is chunked (instead of par_iter) when the gap is large, so that it would help for all cases. I suppose here one could query the threadpool to saturate core utilisation at the maximum chunk size? Btw have you already started on the IFFTs? In addition, we also do loop unrolling. That could wait till later. |
Ok. Got it now! |
Description
This PR adds a simple cache alignment for sequential FFTs. See #177 for more context, but essentially in the FFT at the moment, the roots are accessed in a manner that is cache unfriendly, which causes lots of latency.
It takes the approach of ensuring in the sequential case that the roots are laid out in memory in the exact same form that they are used in, in every iteration.
This provides pretty significant speedups to the algorithm. (They are super-linear in polynomial size, as they have increasingly notable affects at the later round of the FFT).
Here is a list of speedup percentages just at varying sizes, as measured on our benchmark server.
It is left as a TODO to implement this for the IFFT loop, and to implement this in the parallel case. It is likely that benefits at smaller instance sizes will be more profound on commodity hardware with smaller cache sizes than our benchmark server.
Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
Pending
section inCHANGELOG.md
Files changed
in the Github PR explorer