Simple cache alignment for serial FFT #242

ValarDragon · 2021-03-23T18:50:59Z

Description

This PR adds a simple cache alignment for sequential FFTs. See #177 for more context, but essentially in the FFT at the moment, the roots are accessed in a manner that is cache unfriendly, which causes lots of latency.

It takes the approach of ensuring in the sequential case that the roots are laid out in memory in the exact same form that they are used in, in every iteration.

This provides pretty significant speedups to the algorithm. (They are super-linear in polynomial size, as they have increasingly notable affects at the later round of the FFT).
Here is a list of speedup percentages just at varying sizes, as measured on our benchmark server.

2^15 - 3.8%
2^16 - 5.8%
2^17 - 9.6%
2^18 - 11.4%
2^19 - 12.8%
2^20 - 15.74%
2^21 - 19.7%

It is left as a TODO to implement this for the IFFT loop, and to implement this in the parallel case. It is likely that benefits at smaller instance sizes will be more profound on commodity hardware with smaller cache sizes than our benchmark server.

Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

Targeted PR against correct branch (master)
Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
Wrote unit tests - covered by existing tests
Updated relevant documentation in the code
Added a relevant changelog entry to the Pending section in CHANGELOG.md
Re-reviewed Files changed in the Github PR explorer

ValarDragon · 2021-03-23T19:08:22Z

Note, the PR doesn't include the update for the IFFT, since its not clear to me how one should go about cache aligning it.

You start with only needing a small subset of roots, and continue needing more every iteration. The main solution I can think of is either building the new roots as you go through iterations, or using 50% extra RAM and storing essentially two copies of the root list.

poly/src/domain/radix2/fft.rs

ValarDragon · 2021-03-23T21:12:14Z

I confirmed that our stylistic changes above didn't affect performance

jon-chuang · 2021-03-24T04:13:51Z

Another way to improve cache locality might be to unroll the butterfly loops by constant factor (2 or 4 etc.) which can help by encouraging prefetching of cache lines. Not sure if this applies since group elements generally already take up a single cache line. But it could help prefetching, perhaps.

By the way, the celo-org mega merge also contains some code for prefetching over generic data types, I think I will move that code to ark_std so other code areas can benefit.

jon-chuang · 2021-03-24T04:24:15Z

I think that for IFFT, since we are using the roots multiple times, it still does make sense to spend the initial data copy cost.

So one just allocates the array/vector with capacity len / 2 and then writes to it just like is being done for io_helper. So the extra mem cost is 50% of the roots of unity mem cost.

Btw, what's with the funny name? Actually, what's the out of order ordering...? How is it possible to perform an FFT if it is out of order?

weikengchen · 2021-03-24T05:37:10Z

Just a quick question, why in the case of non-parallel, it is using:

#[cfg(not(feature = "parallel"))]
let index = chunk_index;

But in the original code, it seems that it would use nchunks * chunk_index? Note that in both cases, it is done over different chunks:
ark_std::cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {

Pratyush · 2021-03-24T07:57:42Z

@weikengchen It's because in the serial case we re-arrange the roots so that the relevant roots are at chunk_index, instead of nchunks * chunk_index.

jon-chuang · 2021-03-24T08:04:59Z

@weikengchen I think it's cos in the parallel case we don't do the cache alignment step.

Btw @ValarDragon don't you think we should cache align for the parallel case too, since it helps when the gap is small? Further, I think I could also post a PR where either the chunk parallelisation itself is chunked (instead of par_iter) when the gap is large, so that it would help for all cases. I suppose here one could query the threadpool to saturate core utilisation at the maximum chunk size?

Btw have you already started on the IFFTs?

In addition, we also do loop unrolling. That could wait till later.

weikengchen · 2021-03-24T08:13:34Z

Ok. Got it now!

ValarDragon added 3 commits March 23, 2021 11:35

Simplest cache alignment for serial FFT

84f1f00

Add changelog entry

b33ea97

Forgot to add changelog

6be4dd5

ValarDragon requested a review from Pratyush March 23, 2021 18:55

ValarDragon mentioned this pull request Mar 23, 2021

Cache align FFT core loop code #177

Closed

6 tasks

Pratyush reviewed Mar 23, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Outdated Show resolved Hide resolved

Pratyush reviewed Mar 23, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Show resolved Hide resolved

Pratyush reviewed Mar 23, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Outdated Show resolved Hide resolved

Fixes

1a3a3b3

ValarDragon commented Mar 23, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Outdated Show resolved Hide resolved

ValarDragon commented Mar 23, 2021

View reviewed changes

poly/src/domain/radix2/fft.rs Outdated Show resolved Hide resolved

Pratyush and others added 6 commits March 23, 2021 13:56

Tweak

4788efd

Clean up

40fa3ac

fix

5137945

Cleaner

b3dcf70

Change scoping to include comments

6cc9708

Fix lint

ddbfb09

Pratyush approved these changes Mar 23, 2021

View reviewed changes

Pratyush merged commit 5bd69f4 into master Mar 23, 2021

Pratyush deleted the cache_align_fft_v2 branch March 23, 2021 21:37

jon-chuang mentioned this pull request Mar 24, 2021

Cache alignment for serial and parallel FFT and IFFT #245

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple cache alignment for serial FFT #242

Simple cache alignment for serial FFT #242

ValarDragon commented Mar 23, 2021 •

edited

Loading

ValarDragon commented Mar 23, 2021 •

edited

Loading

ValarDragon commented Mar 23, 2021

jon-chuang commented Mar 24, 2021 •

edited

Loading

jon-chuang commented Mar 24, 2021 •

edited

Loading

weikengchen commented Mar 24, 2021 •

edited

Loading

Pratyush commented Mar 24, 2021

jon-chuang commented Mar 24, 2021 •

edited

Loading

weikengchen commented Mar 24, 2021

Simple cache alignment for serial FFT #242

Simple cache alignment for serial FFT #242

Conversation

ValarDragon commented Mar 23, 2021 • edited Loading

Description

ValarDragon commented Mar 23, 2021 • edited Loading

ValarDragon commented Mar 23, 2021

jon-chuang commented Mar 24, 2021 • edited Loading

jon-chuang commented Mar 24, 2021 • edited Loading

weikengchen commented Mar 24, 2021 • edited Loading

Pratyush commented Mar 24, 2021

jon-chuang commented Mar 24, 2021 • edited Loading

weikengchen commented Mar 24, 2021

ValarDragon commented Mar 23, 2021 •

edited

Loading

ValarDragon commented Mar 23, 2021 •

edited

Loading

jon-chuang commented Mar 24, 2021 •

edited

Loading

jon-chuang commented Mar 24, 2021 •

edited

Loading

weikengchen commented Mar 24, 2021 •

edited

Loading

jon-chuang commented Mar 24, 2021 •

edited

Loading