You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replacing slli / add.n pair with addx8 on 3 occasions.
Carefully reordering instructions for much fewer pipeline stalls and preventing the assembler to automatically insert a NOP in front of a hw loop so that the first instruction in the hw loop entirely fits within a naturally aligned four byte region. Consult Xtensa ISA Summary (LOOPNEZ description) for more info - it's vital for achieving the maximum loop speed!
Removing some superfluous instructions (forgotten debugging stuff).
The following table shows the number of cycles for various FFT routines:
Note 1: Passing the 3rd argument float* dsps_fft_w_table_fc32 to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.
Note 2: Optimized ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of some pipeline stalls which cannot be avoided.
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
Dec 9, 2024
f4lc0n-asm
changed the title
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
Dec 13, 2024
Hello,
here are the 20% and up to 32% faster ASM drop-in replacements for dsps_fft2r_fc32_ae32_.S and dsps_fft2r_fc32_aes3_.S from ESP-DSP repository.
I achieved this by:
slli / add.n
pair withaddx8
on 3 occasions.LOOPNEZ
description) for more info - it's vital for achieving the maximum loop speed!The following table shows the number of cycles for various FFT routines:
Note 1: Passing the 3rd argument
float* dsps_fft_w_table_fc32
to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.Note 2: Optimized ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of some pipeline stalls which cannot be avoided.
Cheers!
f4lc0n
fft2r_fp32_xtensa_v1.2.zip (7-Zip)
Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)! The new performance table (see the rightmost column):
The text was updated successfully, but these errors were encountered: