SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284

f4lc0n-asm · 2024-09-29T17:58:17Z

ESP32(-S3) fp32 division is notoriously slow. It can be made faster several times by using a reciprocal asm sequence, which is accurate to 1 ULP - precise enough for most cases.
ESP32(-S3) ABI specifies passing both func's input args and output value in general-purpose regs (A2-A15) - even for floats, but for inline assembly in C that may not be the case - tested various scenarios and both input and output are passed in fp32 regs (F0-F15) where possible, which surely speeds things up :)
This code was inspired by https://blog.llandsmeer.com/tech/2021/04/08/esp32-s2-fpu.html, which I significantly enhanced:

2 asm instructions less (no wfr/rfr)
no fixed fp32 regs (gcc can freely choose which ones - allows more optimizations)
no static keyword for recipsf2() (is visible outside its source file)

Here it is with Public Domain License:

__attribute__((always_inline)) inline
float recipsf2(float input) {
    float result, temp;
    asm(
        "recip0.s %0, %2\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "madd.s %0, %0, %1\n"
        "const.s %1, 1\n"
        "msub.s %1, %2, %0\n"
        "maddn.s %0, %0, %1\n"
        :"=&f"(result),"=&f"(temp):"f"(input)
    );
    return result;
}

#define DIV(a, b) (a)*recipsf2(b)

Cheers,

f4lc0n

Fixed: Added & for the temp var so that it is mapped to a unique fp32 reg (in some recipsf2() usage cases it wasn't).
Changed: Removed volatile after asm.
Changed: The 1st maddn.s to madd.s so that it corresponds to the canonical reciprocal sequence in Xtensa ISA Summary on p. 113.

The text was updated successfully, but these errors were encountered:

f4lc0n-asm · 2024-10-02T18:22:11Z

Ideally, this could be incorporated into Xtensa GCC. Activating the fast float division using a reciprocal with something like #pragma fast_float_div.

f4lc0n-asm · 2024-10-04T17:11:13Z

To test DIV(a,b), I replaced q=(b1-b2)/(a1-a2); with q=DIV(b1-b2,a1-a2); and q=b1+(d-a1)*((d-a2)*((b2-b3)/(a2-a3)-q)/(a3-a1)+q); with q=b1+(d-a1)*((d-a2)*DIV(DIV(b2-b3,a2-a3)-q,a3-a1)+q); - the resulting assembly was MUCH more optimized, NO automatic vars on the stack used and func execution cycles dropped from 397 to 277, i.e. 1.43× faster! See float plinc(float d) in plinc.c in https://github.com/espressif/esp-dsp/files/14334535/dcomp_v1.2.zip at espressif/esp-dsp#76. Also measured just the 2 expressions: 202 vs. 80 cycles and 144 vs. 37 instructions - i.e. 2½× faster and 3.9× smaller! Quite a surprise! :) If you know the Xtensa GCC folks, please, let them know! Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284

f4lc0n-asm commented Sep 29, 2024 •

edited

Loading

f4lc0n-asm commented Oct 2, 2024

f4lc0n-asm commented Oct 4, 2024 •

edited

Loading

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284

SOLVED: An ESP32(-S3) fast fp32 division using a reciprocal inline assembly with arg/result passsed in fp32 regs for even more speed! (AUD-5741) #1284

Comments

f4lc0n-asm commented Sep 29, 2024 • edited Loading

f4lc0n-asm commented Oct 2, 2024

f4lc0n-asm commented Oct 4, 2024 • edited Loading

f4lc0n-asm commented Sep 29, 2024 •

edited

Loading

f4lc0n-asm commented Oct 4, 2024 •

edited

Loading