Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split suboptimal_flops into smaller lints #6867

Open
HalfVoxel opened this issue Mar 8, 2021 · 13 comments
Open

Split suboptimal_flops into smaller lints #6867

HalfVoxel opened this issue Mar 8, 2021 · 13 comments
Labels
A-lint Area: New lints S-needs-discussion Status: Needs further discussion before merging or work can be started

Comments

@HalfVoxel
Copy link

suboptimal_flops is very useful. Most of its suggestions are wins both for readability and performance, however some significantly reduce readability in my opinion.

E.g.

a * 2.0 + 3.0

is a lot more readable than

a.mul_add(2.0, 3.0)

in particular for new developers that may not be familiar with mul_add.

We'd love to enable suboptimal_flops for our code-base, but there is definitely some friction where it makes suggestions that just makes the code less readable in code that doesn't need the absolute maximum performance.

I propose splitting suboptimal_flops into two smaller lint groups such that one group contains all lints that improve performance and readability, and the other contains those that improve performance, but may impact readability negatively.

Drawbacks

More lint types

@camsteffen
Copy link
Contributor

Anything other than mul_add that you would split out? Do you include f32::exp? I'm concerned that drawing that line is too subjective.

@camsteffen camsteffen added the S-needs-discussion Status: Needs further discussion before merging or work can be started label Mar 8, 2021
@aron-granberg
Copy link

@camsteffen You mean e.g. changing

(2f32).powf(a);
E.powf(a);

to

a.exp2();
a.exp();

Those also seem like readability wins to me, or at least not making readability worse.

In any case I think that mul_add is quite distinct among the lints in suboptimal_flops in that a * C1 + C2 very often shows up in normal code, while code that uses exponential functions or logarithms is much rarer.

It is true though that it's a bit subjective to single out mul_add, but I really do think it hinders adoption of this lint rule.

@camsteffen
Copy link
Contributor

mul_add just doesn't seem any less intuitive than the other functions to me. It's easy to check the rust docs if one doesn't know, which should be a very common practice anyways. I think we at least need to receive more feedback on this. But maybe we should upgrade the lint out of nursery.

@fu5ha
Copy link

fu5ha commented Mar 12, 2021

Besides readability, mul_add is a very dubious "performance win". It will always be more accurate (no round), but depending on the arch and compile options, it could actually significantly decrease performance. For one, it can cause a call into libc in many situations (when building without the fma feature, which needs to be specifically enabled, or a specific target arch, and even when it is, it can still cause a call to libc when LTO is disabled), and even if it does get properly inlined and converted to a specific supported instruction in the binary, the performance wins are modest and can actually cause performance losses depending on how frequently FMAs are being used in comparison to regular mul/add ops by overloading FMA units.

@aron-granberg
Copy link

mul_add just doesn't seem any less intuitive than the other functions to me. It's easy to check the rust docs if one doesn't know,

Sure. I'm not arguing that it's hard to understand what it does. That's pretty easy to find, and just from the name it's pretty obvious. But there's a reason people usually prefer infix operators over function calls, they are usually more readable. Especially if you start nesting them.

@llogiq
Copy link
Contributor

llogiq commented Mar 12, 2021

I'd argue that .mul_add(_, _) having worse perf than _ * _ + _ should likely be considered a bug. Can you show me a Rust-supported architecture where this is the case?

@aron-granberg
Copy link

aron-granberg commented Mar 12, 2021

I took the liberty of writing a small test case. Running this with --release on my local machine (Intel® Core™ i9-9900K CPU @ 3.60GHz × 16) shows that the test without mul_add is about 2x faster than with mul_add.

use std::time::Instant;


#[allow(clippy::float_cmp)]
fn main() {
    let mut data = vec![];
    for i in 0..1000000 {
        // Just some arbitrary data
        data.push(((i ^ 243423) as f32, (i as f32 * 242.0) as f32, (i ^ 123953) as f32));
    }

    for _ in 0..10 {
        let t0 = Instant::now();
        let a = with_mul_add(&data);
        let t1 = Instant::now();
        let b = without_mul_add(&data);
        let t2 = Instant::now();

        println!("Time with mul_add: {:.1}ms", (t1 - t0).as_secs_f64() * 1000.0);
        println!("Time without mul_add: {:.1}ms", (t2 - t1).as_secs_f64() * 1000.0);
        
        println!("mul_add result: {}, without mul_add: {}", a, b);
    }
}

#[inline(never)]
pub fn with_mul_add(data: &[(f32, f32, f32)]) -> f32 {
    let mut s = 0.0;
    for &(x,y,z) in data {
        s += x.mul_add(y, z);
    }
    s
}

#[inline(never)]
pub fn without_mul_add(data: &[(f32, f32, f32)]) -> f32 {
    let mut s = 0.0;
    for &(x,y,z) in data {
        s += x * y + z;
    }
    s
}
$ cargo run --release 
Time with mul_add: 2.0ms
Time without mul_add: 0.9ms
mul_add result: 80296886000000000000, without mul_add: 80296886000000000000
...same thing repeated 10 times

However, running with the fma feature explicitly enabled then mul_add is about the same speed.

$ env  RUSTFLAGS="-C target-feature=+fma" cargo run --release 
Time with mul_add: 0.9ms
Time without mul_add: 0.9ms
mul_add result: 80296886000000000000, without mul_add: 80296886000000000000
...same thing repeated 10 times

I think (@termhn correct me if I'm wrong) mul_add is in practice not slower on modern architectures, however, by default the rust compiler is conservative and doesn't assume that the architecture has a fma instruction.

@fu5ha
Copy link

fu5ha commented Mar 12, 2021

Yep, you're correct. FMA are generally not inherently slower (in fact, in most cases, they are faster). However like you said the Rust compiler is conservative and doesn't assume.

There definitely are cases where you can get in trouble by only using FMAs even when you do turn on fma though. For example I switched from using all FMAs to no explicit FMAs in ultraviolet and gained a fair bit of performance on the mathbench "practical" benchmarks... this is because there are less FMA units and when your workload is very FMA heavy, you can stall the pipeline because it's waiting for FMA units to complete their work. Switching to regular mul/add gets rid of this pipeline stall. So even though the instruction itself can be/is faster in the case of a one-off or a few times, if you are in a hot loop with heavy FMA usage (even a real-world workload, like optimized ray-sphere intersection) then you can lose perf because of it.

@llogiq
Copy link
Contributor

llogiq commented Mar 12, 2021

Interesting analysis. Thank you! I believe we may want to add a note to the lint description that says people should measure carefully when applying this lint for speed.

@camsteffen
Copy link
Contributor

camsteffen commented Mar 12, 2021

Yes thanks for the additional analysis!

Here is a comparison of all the functions with godbolt: https://godbolt.org/z/9qvads
You can uncomment corresponding lines to see the assembly diff. I added black_box where the number is considered variable rather than constant.

Different instructions are used for sqrt, exp, ln, log2, log10. The 2f32.powf(x) case is optimized into using the exp2 instruction either way, which seems inconsistent. All of these instructions could have varying, platform-dependent performance implications. If we knew that the specialized instruction is always faster, then I assume the Rust compiler would take care of it (maybe that's the case for exp2).

So I think it is out of scope for Clippy to sub-categorize these cases. Maybe we could have a config like allowed_float_ops.

@HalfVoxel
Copy link
Author

HalfVoxel commented Mar 13, 2021

I ran som additional tests on the other functions:

  • sqrt vs powf(0.5)
    Same assembly output except that powf adds some additional instructions afterward which do not seem to affect the result (this seems like an llvm bug possibly?).
    I have verified that sqrt and powf(0.5) generates the exact same output for all f32 inputs. Though the bits of their NaN ouputs actually differ. I don't think any particular NaN bits are guaranteed though.
    sqrt is always faster since it executes fewer instructions (about 8% faster in my tests).

  • exp vs E.powf
    These are not strictly speaking mathematically identical.
    They also do not generate identical output for all floating point values (I verified this).
    This is because the floating point constant E is only an approximation of the real mathematical constant.
    The exp function can probably use a built-in approximation function that is a bit more accurate.

    Since they do not generate identical output, the optimizer is not allowed to touch it.
    I'd be very surprised if exp was slower than powf on any architecture. powf is often implemented in terms of exp (I think).
    My tests indicate that exp is about 1.7x as fast as powf.

  • log2 vs log(2.0)
    The documentation for log says:
    The result may not be correctly rounded owing to implementation details; self.log2() can produce more accurate results for base 2, and self.log10() can produce more accurate results for base 10.

    Since they may produce different results, the optimizer is not allowed to touch it.
    Tests indicate that log2 is sliiightly faster than log(2.0)

  • log10 vs log(10.0)
    Same reasoning as for log2

    However, an interesting result is that in this case log10 is slower than log(10.0), almost twice as slow in fact.

  • ln vs log(E)
    Same reasoning as log2

    Tests indicate that ln is sliightly faster than log(E)

  • powf(N) vs powi(N)
    These also do not have identical output, in fact they rarely seem to output the exact same value (though they are very close).
    Therefore, the optimizer cannot touch it.

    The documentation for powi states:
    Using this function is generally faster than using powf

    Which is also verified by experiment for low numbers (e.g. N=3), powi is about 6 times as fast.
    However, for very large values powi is actually slower. E.g. for N=100000000 powi is about 40% slower. Raising anything to such a ridiculously large power is very rare though, since for most values you will just get infinities or zero. So I think it's safe to say that powi is faster for all reasonable constants.

So while most of these operations are faster (some by a lot, some by a tiny amount), the optimizer can usually not do anything about it because the result will not be exactly the same due to rounding and approximations.

One thing that stands out is that log10 is so slow. And definitely not faster than log(10.0).

emilk added a commit to EmbarkStudios/rust-ecosystem that referenced this issue Mar 23, 2021
suboptimal_flops suggest rewriting code in a style that
is less readable (e.g. with mul_add) for very thin benefit
(better rounding error *on some platforms*).

See for instance EmbarkStudios/puffin@9bd18e6

I find it a net loss, so I make this PR to solicit other opinions.

Related: rust-lang/rust-clippy#6867
@TomFryersMidsummer
Copy link

I'm in agreement here that mul_add is a bit of an odd one out here. To me, sqrt, log2, log10, ln, powi(2) and abs always read better than their replacements, exp is about the same and exp2 is a bit worse, whereas mul_add is often a pretty significant readability hit.

If you're used to mathematical notation, then

a * x.powi(3) + b * x.powi(2) + c * x + d

is tolerably close to ax³ + bx² + cx + d, but

a.mul_add(x.powi(3), b.mul_add(x.powi(2), c.mul_add(x, d)))

isn't, really.

@adam-azarchs
Copy link

FWIW mul_add is what is blocking our team from adopting this lint, mostly over readability concerns. Given that it has at best a marginal performance benefit on specific architectures, and a not-insignificant performance cost1 on architectures that do not have an FMA instruction2, I think there's an argument for moving it to imprecise_flops.

It's interesting that log10 is slow. Not terribly surprising, but interesting. That suggests that it, too, should be moved to imprecise_flops.

But really, given the age-old advice to always benchmark your use case and not rely on micro-benchmarks, I think it's inappropriate for this lint to be making blanket statements about the relative performance of these functions, and the right thing to do is have configuration for them. That would also allow breaking up both suboptimal_flops and imprecise_flops into more logical groupings, like one for log/exp-related functions (including ln_1p/exp_m1), one for pow-related functions (powi, sqrt, cbrt), and one for mul_add which is an obvious odd-ball when you rephrase the groups that way.

Footnotes

  1. because it has to emulate round-off-avoidance behavior of FMA

  2. before x86-64-v3. Worth noting the pushback that various distros have gotten trying to move to even x86-64-v2. But also x64 isn't the only architecture worth considering here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lint Area: New lints S-needs-discussion Status: Needs further discussion before merging or work can be started
Projects
None yet
Development

No branches or pull requests

7 participants