Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple cache alignment for serial FFT #242

Merged
merged 10 commits into from
Mar 23, 2021
Merged

Simple cache alignment for serial FFT #242

merged 10 commits into from
Mar 23, 2021

Conversation

ValarDragon
Copy link
Member

@ValarDragon ValarDragon commented Mar 23, 2021

Description

This PR adds a simple cache alignment for sequential FFTs. See #177 for more context, but essentially in the FFT at the moment, the roots are accessed in a manner that is cache unfriendly, which causes lots of latency.

It takes the approach of ensuring in the sequential case that the roots are laid out in memory in the exact same form that they are used in, in every iteration.

This provides pretty significant speedups to the algorithm. (They are super-linear in polynomial size, as they have increasingly notable affects at the later round of the FFT).
Here is a list of speedup percentages just at varying sizes, as measured on our benchmark server.

  • 2^15 - 3.8%
  • 2^16 - 5.8%
  • 2^17 - 9.6%
  • 2^18 - 11.4%
  • 2^19 - 12.8%
  • 2^20 - 15.74%
  • 2^21 - 19.7%

It is left as a TODO to implement this for the IFFT loop, and to implement this in the parallel case. It is likely that benefits at smaller instance sizes will be more profound on commodity hardware with smaller cache sizes than our benchmark server.


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (master)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests - covered by existing tests
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

@ValarDragon ValarDragon requested a review from Pratyush March 23, 2021 18:55
@ValarDragon ValarDragon mentioned this pull request Mar 23, 2021
6 tasks
@ValarDragon
Copy link
Member Author

ValarDragon commented Mar 23, 2021

Note, the PR doesn't include the update for the IFFT, since its not clear to me how one should go about cache aligning it.

You start with only needing a small subset of roots, and continue needing more every iteration. The main solution I can think of is either building the new roots as you go through iterations, or using 50% extra RAM and storing essentially two copies of the root list.

@ValarDragon
Copy link
Member Author

I confirmed that our stylistic changes above didn't affect performance

@Pratyush Pratyush merged commit 5bd69f4 into master Mar 23, 2021
@Pratyush Pratyush deleted the cache_align_fft_v2 branch March 23, 2021 21:37
@jon-chuang
Copy link
Contributor

jon-chuang commented Mar 24, 2021

Another way to improve cache locality might be to unroll the butterfly loops by constant factor (2 or 4 etc.) which can help by encouraging prefetching of cache lines. Not sure if this applies since group elements generally already take up a single cache line. But it could help prefetching, perhaps.

By the way, the celo-org mega merge also contains some code for prefetching over generic data types, I think I will move that code to ark_std so other code areas can benefit.

@jon-chuang
Copy link
Contributor

jon-chuang commented Mar 24, 2021

I think that for IFFT, since we are using the roots multiple times, it still does make sense to spend the initial data copy cost.

So one just allocates the array/vector with capacity len / 2 and then writes to it just like is being done for io_helper. So the extra mem cost is 50% of the roots of unity mem cost.

Btw, what's with the funny name? Actually, what's the out of order ordering...? How is it possible to perform an FFT if it is out of order?

@weikengchen
Copy link
Member

weikengchen commented Mar 24, 2021

Just a quick question, why in the case of non-parallel, it is using:

#[cfg(not(feature = "parallel"))]
let index = chunk_index;

But in the original code, it seems that it would use nchunks * chunk_index? Note that in both cases, it is done over different chunks:
ark_std::cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {

@Pratyush
Copy link
Member

@weikengchen It's because in the serial case we re-arrange the roots so that the relevant roots are at chunk_index, instead of nchunks * chunk_index.

@jon-chuang
Copy link
Contributor

jon-chuang commented Mar 24, 2021

@weikengchen I think it's cos in the parallel case we don't do the cache alignment step.

Btw @ValarDragon don't you think we should cache align for the parallel case too, since it helps when the gap is small? Further, I think I could also post a PR where either the chunk parallelisation itself is chunked (instead of par_iter) when the gap is large, so that it would help for all cases. I suppose here one could query the threadpool to saturate core utilisation at the maximum chunk size?

Btw have you already started on the IFFTs?

In addition, we also do loop unrolling. That could wait till later.

@weikengchen
Copy link
Member

Ok. Got it now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants