Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: -ffast-math and other math optimization flags #782

Open
rcombs opened this issue Jun 4, 2024 · 10 comments
Open

Discuss: -ffast-math and other math optimization flags #782

rcombs opened this issue Jun 4, 2024 · 10 comments

Comments

@rcombs
Copy link
Member

rcombs commented Jun 4, 2024

This was brought up in #780 (comment).

-ffast-math is shorthand for several flags:

  • -fno-math-errno (we decided in build: add -fno-math-errno to allow inlining of math functions #780 that we definitely want this)
  • -funsafe-math-optimizations, which is shorthand for:
    • -fno-signed-zeros (probably fine?)
    • -fno-trapping-math (unclear what this would mean for us?)
    • -fassociative-math (unsure)
    • -freciprocal-math (unsure)
  • -ffinite-math-only (we have a couple isnans; not sure if this is safe wrt those?)
  • -fno-rounding-math (this is actually the default anyway?)
  • -fno-signaling-nans (this is also the default)
  • -fcx-limited-range (we don't use complex numbers)
  • -fexcess-precision=fast (this is the default)

So this leaves -fno-signed-zeros, -fno-trapping-math, -fassociative-math, -freciprocal-math, and -ffinite-math-only up for discussion. One simple test would be to build with and without each flag, and see what code changes and what perf impact they have.

I think if we have cases where -fassociative-math or -freciprocal-math help (and are safe), we should probably be explicitly adjusting our floating-point code to allow the compiler to produce optimal output even without those flags.

@kasper93
Copy link
Contributor

kasper93 commented Jun 4, 2024

I think most if not all flags would be safe to use. For -ffinite-math-only it mostly means we have to be careful about dividing by zero, which is most common case of producing nans. Either way, library like libass hardly needs a IEEE compliant floats. Even -ffast-math will probably work in most of the cases, though from experience it is generally better to avoid it, unless we can with confidence say that libass code is good for this. Better start little by little and enable only safe(r) flags. -ffast-math can be deceptive, because it gives nice performance boost and things look fine at first sight, but after a while you start hitting corner cases.

-fno-signed-zeros is safe to enable
-fno-trapping-math probably also safe, I'm not familiar with libass codebase, but I imagine it is not needed.

-fassociative-math and -freciprocal-math will increase error and start producing floats that are bit off in some cases. Questions is if it is a problem. I would be careful with those, but still probably nothing catastrophic will not happen.

@MrSmile
Copy link
Member

MrSmile commented Jun 4, 2024

-funsafe-math-optimizations looks safe to me. We don't use float hacks, only treat them as approximation for real numbers.

-ffinite-math-only: for isnan() in dtoi32() it's possible to rearrange condition without it:

if (!(val > INT32_MIN && val < INT32_MAX + 1LL))

And I think that isnan() in ass_outline_update_min_transformed_x() can be triggered only for broken matrix with NaNs. In that case the exact value of min_x doesn't matter due to subsequent failure in quantize_transform(), so we can replace FFMINMAX with some code to ensure there's no UB in lrint() call.

@rcombs
Copy link
Member Author

rcombs commented Jun 4, 2024

I think both of our uses of isnan are impossible to hit in practice (ass_strtod can't return NaN); however, ass_strtod can probably return infinity, and we want to catch that.

The docs on -ffinite-math-only are somewhat unclear, but I've just compared clang's output with it vs without when building libass for arm64:

The most common change is that fmax is replaced with fmaxnm (which is identical apart from NaN behavior); same with fmin/fminnm. Not sure why the compiler's so eager to make that change, but sure.

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

<_change_alpha>:
                ucvtf d1, w1
                ldr b2, [x0]
                ucvtf d2, d2
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d0, d3, d2, d0
                mov x8, #-0x3e20000000000000
                fmov  d1, x8
                fcmp  d0, d1
                fccmp d0, d0, #0x1, hi
                mov x8, #0x41e0000000000000
                fmov  d1, x8
                fccmp d0, d1, #0x0, vc
                fcvtzs  w8, d0
                csel  w8, wzr, w8, ge
                strb  w8, [x0]
                ret

->

<_change_alpha>:
                stp d9, d8, [sp, #-0x30]!
                stp x20, x19, [sp, #0x10]
                stp x29, x30, [sp, #0x20]
                add x29, sp, #0x20
                mov x19, x0
                ldr w8, [x0]
                and w20, w8, #0xffffff00
                and w8, w8, #0xff
                ucvtf d1, w1
                ucvtf d2, w8
                fmov  d3, #1.00000000
                fsub  d3, d3, d0
                fmul  d0, d1, d0
                fmadd d8, d3, d2, d0
                fmov  d0, d8
                bl  0x2a09c [rcombs note: symbol stub for isnand]
                mov x8, #0x41e0000000000000
                fmov  d0, x8
                fcmp  d8, d0
                mov x8, #-0x3e20000000000000
                fmov  d0, x8
                fccmp d8, d0, #0x4, lt
                ccmp  w0, #0x0, #0x0, gt
                fcvtzs  w8, d8
                and w8, w8, #0xff
                csel  w8, wzr, w8, ne
                orr w8, w8, w20
                str w8, [x19]
                ldp x29, x30, [sp, #0x20]
                ldp x20, x19, [sp, #0x10]
                ldp d9, d8, [sp], #0x30
                ret

I also saw a few changes like this, which seems like an actual positive improvement (on Firestorm, fcmp+fcsel is 7 cycles, vs 2 cycles for fmax[nm]):

-                       fcmp    d0, d6
-                       fcsel   d0, d0, d6, gt
+                       fmaxnm  d0, d0, d6

However, this same improvement can be had with a change like this:

diff --git a/libass/ass_utils.h b/libass/ass_utils.h
index a1beb769..ebc44e78 100644
--- a/libass/ass_utils.h
+++ b/libass/ass_utils.h
@@ -43,8 +43,17 @@
 #define MSGL_V 6
 #define MSGL_DBG2 7
 
-#define FFMAX(a,b) ((a) > (b) ? (a) : (b))
-#define FFMIN(a,b) ((a) > (b) ? (b) : (a))
+#define FFMAX(a, b) _Generic(((a)+(b)), \
+     double: fmax((a), (b)), \
+      float: fmaxf((a), (b)),  \
+    default: ((a) > (b) ? (a) : (b)) \
+)
+#define FFMIN(a, b) _Generic(((a)+(b)), \
+     double: fmin((a), (b)), \
+      float: fminf((a), (b)),  \
+    default: ((a) > (b) ? (b) : (a)) \
+)
+
 #define FFMINMAX(c,a,b) FFMIN(FFMAX(c, a), b)
 
 #define ASS_PI 3.14159265358979323846

@TheOneric
Copy link
Member

re -ffinite-math-only we have e.g. #447 (comment) where NaNs are apparently relied upon to detect bogus values and there might have been more such cases (-fsanitize=float-divide-by-zero is intentionally omitted from CI atm due to this). This doesn't seem safe to enable rn

Except errno and traps, whether or not we rely on standard-compliant behaviour isn't that easy to search for. I think there was also some discussion on whether or not to rely on IEEE float behaviour to optimise something in #483

@kasper93
Copy link
Contributor

kasper93 commented Jun 5, 2024

The largest difference is surprising: the compiler seems to be much less aggressive about inlining floating-point math routines, and inserts calls to isnan where they'd otherwise have been optimized out. This seems bad! For instance:

Are you sure you not flipped the results?

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        ldr     w1, [x0]
        fmov    d30, 1.0e+0
        and     w2, w1, 255
        and     w1, w1, -256
        fsub    d30, d30, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        fmadd   d0, d31, d30, d0
        fcmp    d0, d0
        bvs     .L2
        mov     x2, -4476578029606273024
        fmov    d31, x2
        fcmpe   d0, d31
        bls     .L2
        mov     x2, 4746794007248502784
        fmov    d31, x2
        fcmpe   d0, d31
        bge     .L2
        fcvtzs  w2, d0
        and     w2, w2, 255
        orr     w1, w1, w2
.L2:
        str     w1, [x0]
        ret

-fno-finite-math-only

change_alpha:
        ucvtf   d31, w1
        mov     x2, -4476578029606273024
        ldr     w1, [x0]
        fmov    d29, x2
        mov     x2, 4746794007248502784
        fmov    d28, 1.0e+0
        fmov    d30, x2
        and     w2, w1, 255
        fsub    d28, d28, d0
        fmul    d0, d31, d0
        scvtf   d31, w2
        and     w1, w1, -256
        mov     w3, w1
        fmadd   d0, d31, d28, d0
        fcmpe   d0, d29
        fcvtzs  w2, d0
        fccmpe  d0, d30, 0, hi
        and     w2, w2, 255
        orr     w1, w1, w2
        csel    w1, w3, w1, ge
        str     w1, [x0]
        ret

However, this same improvement can be had with a change like this:

One thing is to show the intent to compiler and another is to allow its do the job. Indeed for floating point values, it is generally better to use fmax, but in practice compiler should recognize the pattern.

@rcombs
Copy link
Member Author

rcombs commented Jun 5, 2024

Are you sure you not flipped the results?

I thought I might've, but I double-checked!

in practice compiler should recognize the pattern.

The pattern as written (((a) > (b) ? (a) : (b))) is not permitted to compile to fmax if a or b may be NaN: the existing macro must return b if a is NaN, but fmax returns NaN. It can't use fmaxnm either: if b is NaN, it must return b, but fmaxnm would return a if non-NaN.

@kasper93
Copy link
Contributor

kasper93 commented Jun 5, 2024

On x86 at least it is compiled directly to

double foo(double a, double b) {
    return ((a) > (b) ? (a) : (b));
}
foo:                                    # @foo
        vmaxsd  xmm0, xmm0, xmm1
        ret

If it indeed is not universal for other platforms to have similar behavior, than indeed using fmax directly would help.

@rcombs
Copy link
Member Author

rcombs commented Jun 5, 2024

vmaxsd is more amenable to this optimization:

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN, and OR.

This is exactly the behavior of ((a) > (b) ? (a) : (b)), so we get the single instruction. ARM doesn't have a max instruction that prefers the second operand like that, so we're better-off using fmax there. However, looking deeper, I see that fmax doesn't always get optimized down to maxsd on x86, so we are better-off using the conditional there; if we add the generic, we should make the fmax paths platform-conditional.

@astiob
Copy link
Member

astiob commented Aug 2, 2024

Seeing as this has been brought up again in #806, FWIW I continue to think that even NaNs don’t actually matter to libass and we could/should just be using -ffast-math etc., as I remember MrSmile saying for years he’s been doing this on his own machine. Where we do have NaN checks in our code, they’re purely/mainly defensive. Where NaN semantics have been brought up in earlier discussions, the context has been exceptional situations that shouldn’t happen to begin with and NaNs merely didn’t make matters even worse.

Of course, an abundance of caution is always good and a careful audit of the relevant code before the switch is flipped won’t hurt. But I don’t think “libass depends on NaNs for correctness” is an accurate statement: I don’t think we ever intended to do this, and we should have no reason to do so.

@kasper93
Copy link
Contributor

kasper93 commented Aug 2, 2024

But I don’t think “libass depends on NaNs for correctness” is an accurate statement

I agree. I only mentioned it before, because there are isnan checks, which would be noop after in "fast" mode, so for completeness this part would need to changes, that's all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants