Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache align FFT core loop code #177

Closed
wants to merge 11 commits into from
Closed

Cache align FFT core loop code #177

wants to merge 11 commits into from

Conversation

ValarDragon
Copy link
Member

Description

This PR changes the core loop for the FFT to access the roots of unity in a cache aligned way, similar to what is implemented in libiop. Currently its set to minimize computation, which is actually pretty bad for performance given that its memory bottlenecked.

In libiop we saw that doing this was a huge speedup. So far, I've implemented this in this PR with an arbitrary set of parameters, and this is already a 10% speedup in the serial case. I anticipate it will perform better. (Code is not yet benchmarked for the parallel case)


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (main)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests - n/a, covered by existing tests
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

@ValarDragon ValarDragon requested a review from Pratyush January 14, 2021 19:23
@ValarDragon ValarDragon marked this pull request as draft January 14, 2021 19:23
const MAX_ROOT_STRIDE: usize = 2;

// Once the size of the cache aligned roots is below this number, stop re-aligning it.
// TODO: Figure out how we can make this depend on field size & system cache size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it depend on field size by figuring out the field size as F::characteristic() * F::degree()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But its a constant, so it would have to go in the FftField params right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can you have constants depend on a template parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, they're not const fn... I'm not sure we need it to be actually const; the compiler should be able to perform constant folding

@ValarDragon
Copy link
Member Author

Prior to the "seperate cache align levels" commit, I was seeing a 5% slowdown to parallel FFTs, until a size 2 << 21 fft, in which case it was a 7.5% speedup.

Maybe this suggests that we need a parameter for when to start cache aligning as well? (Also I think we should just brute force over all the parameters to set the final version. Do we have scripts for brute forcing over constants, w/ benchmarks?)

@Pratyush
Copy link
Member

We don't have scripts to brute force over the parameters. One way could be to just make an inner function that accepts the constants as parameters, and then run standard criterion benchmarks over the range of parameters?

(also it would be nice if this code was cache-oblivious, so we didn't have to worry about these things lol)

@ValarDragon
Copy link
Member Author

ValarDragon commented Jan 14, 2021

Hrmm, another option could be a python script setup. (python regex the file to swap out the constants, run cargo bench, get the output and output a CSV of all the results at the end)

Agreed that making this cache oblivious would be ideal. In the serial case, we're already doing that with an exit early for once its all in cache, but removing that exit early should be perfectly fine for bigger ffts. (since the work becomes increasingly negligible, maybe a bit of a slow down for small FFTs) Not sure how that works out for the parallel case, with choosing between whether to do that in parallel or not.

Its not looking like that suffices so far for the parallel case =/

@ValarDragon
Copy link
Member Author

ValarDragon commented Feb 4, 2021

Should I just convert this to something that is faster in the serial case over most sizes, and leave as a TODO investigating the parallel case? I don't think I'll have time to converge speeding up the parallel FFT for a couple of weeks. (However, I'll be very surprised if the parallel case doesn't show a similar speed improvement with the right parameterization / method of generating the cache aligned vector)

@Pratyush
Copy link
Member

Pratyush commented Feb 4, 2021

Prior to the "seperate cache align levels" commit, I was seeing a 5% slowdown to parallel FFTs, until a size 2 << 21 fft, in which case it was a 7.5% speedup.

So is that ^ the state of things atm? If so, sure, let's just merge it for the serial case, and leave a TODO for the parallel case.

@ValarDragon
Copy link
Member Author

Yeah that is the state of things atm

@Pratyush
Copy link
Member

Let's update this only for the serial case and then let's merge?

@Pratyush
Copy link
Member

ping @ValarDragon?

@ValarDragon
Copy link
Member Author

Closing this in favor of #242. I will write up learnings from this code into an issue for later optimizations to the parallel case & IFFT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants