Discuss: -ffast-math and other math optimization flags #782

rcombs · 2024-06-04T18:58:08Z

This was brought up in #780 (comment).

-ffast-math is shorthand for several flags:

-fno-math-errno (we decided in build: add -fno-math-errno to allow inlining of math functions #780 that we definitely want this)
-funsafe-math-optimizations, which is shorthand for:
- -fno-signed-zeros (probably fine?)
- -fno-trapping-math (unclear what this would mean for us?)
- -fassociative-math (unsure)
- -freciprocal-math (unsure)
-ffinite-math-only (we have a couple isnans; not sure if this is safe wrt those?)
-fno-rounding-math (this is actually the default anyway?)
-fno-signaling-nans (this is also the default)
-fcx-limited-range (we don't use complex numbers)
-fexcess-precision=fast (this is the default)

So this leaves -fno-signed-zeros, -fno-trapping-math, -fassociative-math, -freciprocal-math, and -ffinite-math-only up for discussion. One simple test would be to build with and without each flag, and see what code changes and what perf impact they have.

I think if we have cases where -fassociative-math or -freciprocal-math help (and are safe), we should probably be explicitly adjusting our floating-point code to allow the compiler to produce optimal output even without those flags.

The text was updated successfully, but these errors were encountered:

kasper93 · 2024-06-04T20:37:34Z

I think most if not all flags would be safe to use. For -ffinite-math-only it mostly means we have to be careful about dividing by zero, which is most common case of producing nans. Either way, library like libass hardly needs a IEEE compliant floats. Even -ffast-math will probably work in most of the cases, though from experience it is generally better to avoid it, unless we can with confidence say that libass code is good for this. Better start little by little and enable only safe(r) flags. -ffast-math can be deceptive, because it gives nice performance boost and things look fine at first sight, but after a while you start hitting corner cases.

-fno-signed-zeros is safe to enable
-fno-trapping-math probably also safe, I'm not familiar with libass codebase, but I imagine it is not needed.

-fassociative-math and -freciprocal-math will increase error and start producing floats that are bit off in some cases. Questions is if it is a problem. I would be careful with those, but still probably nothing catastrophic will not happen.

MrSmile · 2024-06-04T21:53:27Z

-funsafe-math-optimizations looks safe to me. We don't use float hacks, only treat them as approximation for real numbers.

-ffinite-math-only: for isnan() in dtoi32() it's possible to rearrange condition without it:

if (!(val > INT32_MIN && val < INT32_MAX + 1LL))

And I think that isnan() in ass_outline_update_min_transformed_x() can be triggered only for broken matrix with NaNs. In that case the exact value of min_x doesn't matter due to subsequent failure in quantize_transform(), so we can replace FFMINMAX with some code to ensure there's no UB in lrint() call.

rcombs · 2024-06-04T22:03:22Z

I think both of our uses of isnan are impossible to hit in practice (ass_strtod can't return NaN); however, ass_strtod can probably return infinity, and we want to catch that.

The docs on -ffinite-math-only are somewhat unclear, but I've just compared clang's output with it vs without when building libass for arm64:

The most common change is that fmax is replaced with fmaxnm (which is identical apart from NaN behavior); same with fmin/fminnm. Not sure why the compiler's so eager to make that change, but sure.

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

<_change_alpha>:
                ucvtf d1, w1
                ldr b2, [x0]
                ucvtf d2, d2
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d0, d3, d2, d0
                mov x8, #-0x3e20000000000000
                fmov  d1, x8
                fcmp  d0, d1
                fccmp d0, d0, #0x1, hi
                mov x8, #0x41e0000000000000
                fmov  d1, x8
                fccmp d0, d1, #0x0, vc
                fcvtzs  w8, d0
                csel  w8, wzr, w8, ge
                strb  w8, [x0]
                ret

->

<_change_alpha>:
                stp d9, d8, [sp, #-0x30]!
                stp x20, x19, [sp, #0x10]
                stp x29, x30, [sp, #0x20]
                add x29, sp, #0x20
                mov x19, x0
                ldr w8, [x0]
                and w20, w8, #0xffffff00
                and w8, w8, #0xff
                ucvtf d1, w1
                ucvtf d2, w8
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d8, d3, d2, d0
                fmov  d0, d8
                bl  0x2a09c [rcombs note: symbol stub for isnand]
                mov x8, #0x41e0000000000000
                fmov  d0, x8
                fcmp  d8, d0
                mov x8, #-0x3e20000000000000
                fmov  d0, x8
                fccmp d8, d0, #0x4, lt
                ccmp  w0, #0x0, #0x0, gt
                fcvtzs  w8, d8
                and w8, w8, #0xff
                csel  w8, wzr, w8, ne
                orr w8, w8, w20
                str w8, [x19]
                ldp x29, x30, [sp, #0x20]
                ldp x20, x19, [sp, #0x10]
                ldp d9, d8, [sp], #0x30
                ret

I also saw a few changes like this, which seems like an actual positive improvement (on Firestorm, fcmp+fcsel is 7 cycles, vs 2 cycles for fmax[nm]):

-                       fcmp    d0, d6
-                       fcsel   d0, d0, d6, gt
+                       fmaxnm  d0, d0, d6

However, this same improvement can be had with a change like this:

diff --git a/libass/ass_utils.h b/libass/ass_utils.h
index a1beb769..ebc44e78 100644
--- a/libass/ass_utils.h
+++ b/libass/ass_utils.h
@@ -43,8 +43,17 @@
 #define MSGL_V 6
 #define MSGL_DBG2 7
 
-#define FFMAX(a,b) ((a) > (b) ? (a) : (b))
-#define FFMIN(a,b) ((a) > (b) ? (b) : (a))
+#define FFMAX(a, b) _Generic(((a)+(b)), \
+     double: fmax((a), (b)), \
+      float: fmaxf((a), (b)),  \
+    default: ((a) > (b) ? (a) : (b)) \
+)
+#define FFMIN(a, b) _Generic(((a)+(b)), \
+     double: fmin((a), (b)), \
+      float: fminf((a), (b)),  \
+    default: ((a) > (b) ? (b) : (a)) \
+)
+
 #define FFMINMAX(c,a,b) FFMIN(FFMAX(c, a), b)
 
 #define ASS_PI 3.14159265358979323846

TheOneric · 2024-06-04T22:44:31Z

re -ffinite-math-only we have e.g. #447 (comment) where NaNs are apparently relied upon to detect bogus values and there might have been more such cases (-fsanitize=float-divide-by-zero is intentionally omitted from CI atm due to this). This doesn't seem safe to enable rn

Except errno and traps, whether or not we rely on standard-compliant behaviour isn't that easy to search for. I think there was also some discussion on whether or not to rely on IEEE float behaviour to optimise something in #483

kasper93 · 2024-06-05T00:38:32Z

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

Are you sure you not flipped the results?

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        ldr     w1, [x0]
        fmov    d30, 1.0e+0
        and     w2, w1, 255
        and     w1, w1, -256
        fsub    d30, d30, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        fmadd   d0, d31, d30, d0
        fcmp    d0, d0
        bvs     .L2
        mov     x2, -4476578029606273024
        fmov    d31, x2
        fcmpe   d0, d31
        bls     .L2
        mov     x2, 4746794007248502784
        fmov    d31, x2
        fcmpe   d0, d31
        bge     .L2
        fcvtzs  w2, d0
        and     w2, w2, 255
        orr     w1, w1, w2
.L2:
        str     w1, [x0]
        ret

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        mov     x2, -4476578029606273024
        ldr     w1, [x0]
        fmov    d29, x2
        mov     x2, 4746794007248502784
        fmov    d28, 1.0e+0
        fmov    d30, x2
        and     w2, w1, 255
        fsub    d28, d28, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        and     w1, w1, -256
        mov     w3, w1
        fmadd   d0, d31, d28, d0
        fcmpe   d0, d29
        fcvtzs  w2, d0
        fccmpe  d0, d30, 0, hi
        and     w2, w2, 255
        orr     w1, w1, w2
        csel    w1, w3, w1, ge
        str     w1, [x0]
        ret

However, this same improvement can be had with a change like this:

One thing is to show the intent to compiler and another is to allow its do the job. Indeed for floating point values, it is generally better to use fmax, but in practice compiler should recognize the pattern.

rcombs · 2024-06-05T00:55:49Z

Are you sure you not flipped the results?

I thought I might've, but I double-checked!

in practice compiler should recognize the pattern.

The pattern as written (((a) > (b) ? (a) : (b))) is not permitted to compile to fmax if a or b may be NaN: the existing macro must return b if a is NaN, but fmax returns NaN. It can't use fmaxnm either: if b is NaN, it must return b, but fmaxnm would return a if non-NaN.

kasper93 · 2024-06-05T01:08:20Z

On x86 at least it is compiled directly to

double foo(double a, double b) {
    return ((a) > (b) ? (a) : (b));
}

foo:                                    # @foo
        vmaxsd  xmm0, xmm0, xmm1
        ret

If it indeed is not universal for other platforms to have similar behavior, than indeed using fmax directly would help.

rcombs · 2024-06-05T01:36:24Z

vmaxsd is more amenable to this optimization:

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.

This is exactly the behavior of ((a) > (b) ? (a) : (b)), so we get the single instruction. ARM doesn't have a max instruction that prefers the second operand like that, so we're better-off using fmax there. However, looking deeper, I see that fmax doesn't always get optimized down to maxsd on x86, so we are better-off using the conditional there; if we add the generic, we should make the fmax paths platform-conditional.

astiob · 2024-08-02T19:37:44Z

Seeing as this has been brought up again in #806, FWIW I continue to think that even NaNs don’t actually matter to libass and we could/should just be using -ffast-math etc., as I remember MrSmile saying for years he’s been doing this on his own machine. Where we do have NaN checks in our code, they’re purely/mainly defensive. Where NaN semantics have been brought up in earlier discussions, the context has been exceptional situations that shouldn’t happen to begin with and NaNs merely didn’t make matters even worse.

Of course, an abundance of caution is always good and a careful audit of the relevant code before the switch is flipped won’t hurt. But I don’t think “libass depends on NaNs for correctness” is an accurate statement: I don’t think we ever intended to do this, and we should have no reason to do so.

kasper93 · 2024-08-02T21:57:30Z

But I don’t think “libass depends on NaNs for correctness” is an accurate statement

I agree. I only mentioned it before, because there are isnan checks, which would be noop after in "fast" mode, so for completeness this part would need to changes, that's all.

TheOneric added performance build labels Jun 6, 2024

NBruderman mentioned this issue Aug 4, 2024

utils: implement our own lrint() for Visual C++ #806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: -ffast-math and other math optimization flags #782

Discuss: -ffast-math and other math optimization flags #782

rcombs commented Jun 4, 2024

kasper93 commented Jun 4, 2024

MrSmile commented Jun 4, 2024

rcombs commented Jun 4, 2024

TheOneric commented Jun 4, 2024

kasper93 commented Jun 5, 2024

rcombs commented Jun 5, 2024

kasper93 commented Jun 5, 2024

rcombs commented Jun 5, 2024 •

edited

Loading

astiob commented Aug 2, 2024 •

edited

Loading

kasper93 commented Aug 2, 2024

Discuss: -ffast-math and other math optimization flags #782

Discuss: -ffast-math and other math optimization flags #782

Comments

rcombs commented Jun 4, 2024

kasper93 commented Jun 4, 2024

MrSmile commented Jun 4, 2024

rcombs commented Jun 4, 2024

TheOneric commented Jun 4, 2024

kasper93 commented Jun 5, 2024

rcombs commented Jun 5, 2024

kasper93 commented Jun 5, 2024

rcombs commented Jun 5, 2024 • edited Loading

astiob commented Aug 2, 2024 • edited Loading

kasper93 commented Aug 2, 2024

rcombs commented Jun 5, 2024 •

edited

Loading

astiob commented Aug 2, 2024 •

edited

Loading