Cache align FFT core loop code #177

ValarDragon · 2021-01-14T19:23:09Z

Description

This PR changes the core loop for the FFT to access the roots of unity in a cache aligned way, similar to what is implemented in libiop. Currently its set to minimize computation, which is actually pretty bad for performance given that its memory bottlenecked.

In libiop we saw that doing this was a huge speedup. So far, I've implemented this in this PR with an arbitrary set of parameters, and this is already a 10% speedup in the serial case. I anticipate it will perform better. (Code is not yet benchmarked for the parallel case)

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

Targeted PR against correct branch (main)
Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
Wrote unit tests - n/a, covered by existing tests
Updated relevant documentation in the code
Added a relevant changelog entry to the Pending section in CHANGELOG.md
Re-reviewed Files changed in the Github PR explorer

poly/src/domain/radix2/fft.rs

Pratyush · 2021-01-14T19:48:37Z

poly/src/domain/radix2/fft.rs

+const MAX_ROOT_STRIDE: usize = 2;
+
+// Once the size of the cache aligned roots is below this number, stop re-aligning it.
+// TODO: Figure out how we can make this depend on field size & system cache size.


We can make it depend on field size by figuring out the field size as F::characteristic() * F::degree()

But its a constant, so it would have to go in the FftField params right?

Or can you have constants depend on a template parameter?

Ah you're right, they're not const fn... I'm not sure we need it to be actually const; the compiler should be able to perform constant folding

ValarDragon · 2021-01-14T19:52:37Z

Prior to the "seperate cache align levels" commit, I was seeing a 5% slowdown to parallel FFTs, until a size 2 << 21 fft, in which case it was a 7.5% speedup.

Maybe this suggests that we need a parameter for when to start cache aligning as well? (Also I think we should just brute force over all the parameters to set the final version. Do we have scripts for brute forcing over constants, w/ benchmarks?)

Pratyush · 2021-01-14T19:56:04Z

We don't have scripts to brute force over the parameters. One way could be to just make an inner function that accepts the constants as parameters, and then run standard criterion benchmarks over the range of parameters?

(also it would be nice if this code was cache-oblivious, so we didn't have to worry about these things lol)

ValarDragon · 2021-01-14T20:01:34Z

Hrmm, another option could be a python script setup. (python regex the file to swap out the constants, run cargo bench, get the output and output a CSV of all the results at the end)

Agreed that making this cache oblivious would be ideal. In the serial case, we're already doing that with an exit early for once its all in cache, but removing that exit early should be perfectly fine for bigger ffts. (since the work becomes increasingly negligible, maybe a bit of a slow down for small FFTs) Not sure how that works out for the parallel case, with choosing between whether to do that in parallel or not.

Its not looking like that suffices so far for the parallel case =/

ValarDragon · 2021-02-04T03:18:06Z

Should I just convert this to something that is faster in the serial case over most sizes, and leave as a TODO investigating the parallel case? I don't think I'll have time to converge speeding up the parallel FFT for a couple of weeks. (However, I'll be very surprised if the parallel case doesn't show a similar speed improvement with the right parameterization / method of generating the cache aligned vector)

Pratyush · 2021-02-04T19:52:14Z

Prior to the "seperate cache align levels" commit, I was seeing a 5% slowdown to parallel FFTs, until a size 2 << 21 fft, in which case it was a 7.5% speedup.

So is that ^ the state of things atm? If so, sure, let's just merge it for the serial case, and leave a TODO for the parallel case.

ValarDragon · 2021-02-04T20:00:46Z

Yeah that is the state of things atm

Pratyush · 2021-02-10T03:16:59Z

Let's update this only for the serial case and then let's merge?

Pratyush · 2021-03-17T15:29:46Z

ping @ValarDragon?

ValarDragon · 2021-03-23T18:56:35Z

Closing this in favor of #242. I will write up learnings from this code into an issue for later optimizations to the parallel case & IFFT

Cache align FFT roots code

6986ed1

ValarDragon requested a review from Pratyush January 14, 2021 19:23

ValarDragon marked this pull request as draft January 14, 2021 19:23

ValarDragon commented Jan 14, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Show resolved Hide resolved

ValarDragon added 2 commits January 14, 2021 13:37

Adjust constants, attempt at parallel cache align method

2f4326f

Separate out cache align levels

366836c

Pratyush reviewed Jan 14, 2021

View reviewed changes

ValarDragon added 7 commits January 15, 2021 14:57

Fix parallel size issue

1225c01

Uncomment line

2d34353

Fix

5f96e5c

Fix large case bug

80c5ca2

Try removing variable from core loop

d907f38

Try removing parameters

e36a224

parallel base case

1b85664

Merge branch 'master' into cache_align_fft

8c9193d

ValarDragon mentioned this pull request Mar 23, 2021

Simple cache alignment for serial FFT #242

Merged

6 tasks

ValarDragon closed this Mar 23, 2021

Pratyush mentioned this pull request Mar 25, 2021

Cache alignment for serial and parallel FFT and IFFT #245

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache align FFT core loop code #177

Cache align FFT core loop code #177

ValarDragon commented Jan 14, 2021

Pratyush Jan 14, 2021

ValarDragon Jan 14, 2021

ValarDragon Jan 14, 2021

Pratyush Jan 15, 2021

ValarDragon commented Jan 14, 2021

Pratyush commented Jan 14, 2021

ValarDragon commented Jan 14, 2021 •

edited

Loading

ValarDragon commented Feb 4, 2021 •

edited

Loading

Pratyush commented Feb 4, 2021

ValarDragon commented Feb 4, 2021

Pratyush commented Feb 10, 2021

Pratyush commented Mar 17, 2021

ValarDragon commented Mar 23, 2021

Cache align FFT core loop code #177

Cache align FFT core loop code #177

Conversation

ValarDragon commented Jan 14, 2021

Description

Pratyush Jan 14, 2021

Choose a reason for hiding this comment

ValarDragon Jan 14, 2021

Choose a reason for hiding this comment

ValarDragon Jan 14, 2021

Choose a reason for hiding this comment

Pratyush Jan 15, 2021

Choose a reason for hiding this comment

ValarDragon commented Jan 14, 2021

Pratyush commented Jan 14, 2021

ValarDragon commented Jan 14, 2021 • edited Loading

ValarDragon commented Feb 4, 2021 • edited Loading

Pratyush commented Feb 4, 2021

ValarDragon commented Feb 4, 2021

Pratyush commented Feb 10, 2021

Pratyush commented Mar 17, 2021

ValarDragon commented Mar 23, 2021

ValarDragon commented Jan 14, 2021 •

edited

Loading

ValarDragon commented Feb 4, 2021 •

edited

Loading