-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache align FFT core loop code #177
Conversation
poly/src/domain/radix2/fft.rs
Outdated
const MAX_ROOT_STRIDE: usize = 2; | ||
|
||
// Once the size of the cache aligned roots is below this number, stop re-aligning it. | ||
// TODO: Figure out how we can make this depend on field size & system cache size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it depend on field size by figuring out the field size as F::characteristic() * F::degree()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But its a constant, so it would have to go in the FftField params right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or can you have constants depend on a template parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right, they're not const fn
... I'm not sure we need it to be actually const; the compiler should be able to perform constant folding
Prior to the "seperate cache align levels" commit, I was seeing a 5% slowdown to parallel FFTs, until a size Maybe this suggests that we need a parameter for when to start cache aligning as well? (Also I think we should just brute force over all the parameters to set the final version. Do we have scripts for brute forcing over constants, w/ benchmarks?) |
We don't have scripts to brute force over the parameters. One way could be to just make an inner function that accepts the constants as parameters, and then run standard criterion benchmarks over the range of parameters? (also it would be nice if this code was cache-oblivious, so we didn't have to worry about these things lol) |
Hrmm, another option could be a python script setup. (python regex the file to swap out the constants, run cargo bench, get the output and output a CSV of all the results at the end) Agreed that making this cache oblivious would be ideal. In the serial case, we're already doing that with an exit early for once its all in cache, but removing that exit early should be perfectly fine for bigger ffts. (since the work becomes increasingly negligible, maybe a bit of a slow down for small FFTs) Not sure how that works out for the parallel case, with choosing between whether to do that in parallel or not. Its not looking like that suffices so far for the parallel case =/ |
Should I just convert this to something that is faster in the serial case over most sizes, and leave as a TODO investigating the parallel case? I don't think I'll have time to converge speeding up the parallel FFT for a couple of weeks. (However, I'll be very surprised if the parallel case doesn't show a similar speed improvement with the right parameterization / method of generating the cache aligned vector) |
So is that ^ the state of things atm? If so, sure, let's just merge it for the serial case, and leave a TODO for the parallel case. |
Yeah that is the state of things atm |
Let's update this only for the serial case and then let's merge? |
ping @ValarDragon? |
Closing this in favor of #242. I will write up learnings from this code into an issue for later optimizations to the parallel case & IFFT |
Description
This PR changes the core loop for the FFT to access the roots of unity in a cache aligned way, similar to what is implemented in libiop. Currently its set to minimize computation, which is actually pretty bad for performance given that its memory bottlenecked.
In libiop we saw that doing this was a huge speedup. So far, I've implemented this in this PR with an arbitrary set of parameters, and this is already a
10%
speedup in the serial case. I anticipate it will perform better. (Code is not yet benchmarked for the parallel case)Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
Pending
section inCHANGELOG.md
Files changed
in the Github PR explorer