Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

Open
f4lc0n-asm opened this issue Dec 9, 2024 · 0 comments

Comments

@f4lc0n-asm
Copy link

f4lc0n-asm commented Dec 9, 2024

Hello,

here are the 20% and up to 32% faster ASM drop-in replacements for dsps_fft2r_fc32_ae32_.S and dsps_fft2r_fc32_aes3_.S from ESP-DSP repository.

I achieved this by:

  • Replacing slli / add.n pair with addx8 on 3 occasions.
  • Carefully reordering instructions for much fewer pipeline stalls and preventing the assembler to automatically insert a NOP in front of a hw loop so that the first instruction in the hw loop entirely fits within a naturally aligned four byte region. Consult Xtensa ISA Summary (LOOPNEZ description) for more info - it's vital for achieving the maximum loop speed!
  • Removing some superfluous instructions (forgotten debugging stuff).

The following table shows the number of cycles for various FFT routines:

   N |   ANSI |  ESP32 | ESP32 O|ESP32-S3|ESP32-S3 O
  16 |   1277 |    977 |    817 |    930 |    784
  32 |   2972 |   2318 |   1934 |   2207 |   1853
  64 |   6835 |   5379 |   4483 |   5124 |   4290
 128 |  15514 |  12264 |  10216 |  11689 |   9767
 256 |  34785 |  27565 |  22957 |  26286 |  21932
 512 |  77160 |  61234 |  50994 |  58419 |  48689
1024 | 169583 | 134711 | 112183 | 128568 | 107062
2048 | 369782 | 293948 | 244796 | 280637 | 233531
4096 | 800893 | 636993 | 530497 | 608322 | 505920

Note 1: Passing the 3rd argument float* dsps_fft_w_table_fc32 to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.

Note 2: Optimized ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of some pipeline stalls which cannot be avoided.

Cheers!

f4lc0n

fft2r_fp32_xtensa_v1.2.zip (7-Zip)
Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)! The new performance table (see the rightmost column):

   N |   ANSI |  ESP32 | ESP32 O|ESP32-S3|ESP32-S3 O
  16 |   1277 |    977 |    817 |    930 |    735
  32 |   2972 |   2318 |   1934 |   2207 |   1724
  64 |   6835 |   5379 |   4483 |   5124 |   3969
 128 |  15514 |  12264 |  10216 |  11689 |   8998
 256 |  34785 |  27565 |  22957 |  26286 |  20139
 512 |  77160 |  61234 |  50994 |  58419 |  44592
1024 | 169583 | 134711 | 112183 | 128568 |  97845
2048 | 369782 | 293948 | 244796 | 280637 | 213050
4096 | 800893 | 636993 | 530497 | 608322 | 460863
@github-actions github-actions bot changed the title SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 9, 2024
@f4lc0n-asm f4lc0n-asm changed the title SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant