Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise floating point operations (fast-math) #21690

Closed
Tracked by #2
mpdn opened this issue Jan 27, 2015 · 89 comments
Closed
Tracked by #2

Imprecise floating point operations (fast-math) #21690

mpdn opened this issue Jan 27, 2015 · 89 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-feature-request Category: A feature request, i.e: not implemented / a PR. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@mpdn
Copy link
Contributor

mpdn commented Jan 27, 2015

There should be a way to use imprecise floating point operations like GCC's and Clang's -ffast-math. The simplest way to do this would be to do like GCC and Clang and implement a command line flag, but I think a better way to do this would be to create a f32fast and f64fast type that would then call the fast LLVM math functions. This way you can easily mix fast and "slow" floating point operations.

I think this could be implemented as a library if LLVM assembly could be used in the asm macro.

@kmcallister kmcallister added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jan 28, 2015
@kmcallister
Copy link
Contributor

Inline IR was discussed on #15180. Another option is extern "llvm-intrinsic" { ... } which I vaguely think we had at some point. If we added more intrinsics to std::intrinsics would that be sufficient?

@huonw huonw added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 28, 2015
@mpdn
Copy link
Contributor Author

mpdn commented Jan 28, 2015

Yeah, adding it as a function in std::intrinsics could definitely work as well.

There are a few different fast math flags, but the fast flag is probably the most important as it implies all the other flags. Adding all of them would be unreasonable if using intrinsic functions, but I don't think all of them are necessary.

@emberian emberian self-assigned this Mar 25, 2015
@bluss
Copy link
Member

bluss commented Aug 17, 2015

This forum thread has examples of loops that llvm can vectorize well for integers, but doesn't for floats (a dot product).

@bluss bluss changed the title Imprecise floating point operations Imprecise floating point operations (fast-math) Dec 20, 2015
@emberian emberian removed their assignment Jan 5, 2016
@kornelski
Copy link
Contributor

kornelski commented Jun 8, 2017

I've prototyped it using a newtype: https://gitlab.com/kornelski/ffast-math (https://play.rust-lang.org/?gist=d516771d1d002f740cc9bf6eb5cacdf0&version=nightly&backtrace=0)

It works in simple cases, but the newtype solution is insufficient:

  • it doesn't work with floating-point literals. That's a huge pain when converting programs to this newtype.
  • it doesn't work with the as operator, and a trait to make that possible has been rejected before.
  • the wrapper type and extra level of indirection affects inlining of code using it. I've found some large functions where the newtype was slower than regular float, but not because of float math, but because other structs and calls around it weren't as optimized. I wasn't able to reproduce it in simple cases, so I'm not sure what exactly is going on.

So I'm very keen on seeing it supported natively in Rust.

@bluss
Copy link
Member

bluss commented Jun 8, 2017

@pornel The issue #24963 had a test case where a newtype impacted vectorization. That example was fixed (great!), sounds like the bug is probably still visible in similar code.

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

I've tried -ffast-math in my C vs Rust benchmark of some graphics code:

https://github.com/pedrocr/rustc-math-bench

In the C code it's a ~20% improvement in clang but no benefit with GCC. In both cases it returns a wrong result and the math is extremely simple (multiplying a vector by a matrix). According to this:

https://stackoverflow.com/questions/38978951/can-ffast-math-be-safely-used-on-a-typical-project#38981307

-ffast-math is generally too unsafe for normal usage as it implies some strange things (e.g., NaN checks always return false). So it seems sensible to have a way to opt-in only to the more benign ones.

@kornelski
Copy link
Contributor

kornelski commented Jun 8, 2017

@pedrocr Your benchmark has a loss of precision in sum regardless of fast-math mode. Both slow and fast give wrong result compared to summation using double sum.

With double for the sum and you'll get correct result, even with -ffast-math.

You get significantly different sum with float sum, because fast-math gives you a small systemic rounding error, which accumulates over 100 million additions.

All values from matrix multiplication are the same to at least 6 digits (I've diffed printf("%f", out[i]) of all values and they're all the same).

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

@pornel thanks, fixed here:

pedrocr/rustc-math-bench@8169fa3

The benchmark results are fine though, the sum is only used as a checksum. Here are the averages of three runs in ms/megapixel:

Compiler -O3 -march=native -O3 -march=native -ffast-math
clang 3.8.0-2ubuntu4 6,91 5,40 (-22%)
gcc 5.4.0-6ubuntu1~16.04.4 5,71 5,85 (+2%)

So as I mentioned before clang/llvm gets a good benefit from ffast-math but not gcc. I'd say making sure things like is_normal() still work is very important but at least on llvm it helps to be able to enable ffast-math.

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

I've suggested it would make sense to expose -ffast-math using the target-feature mechanisms:

https://internals.rust-lang.org/t/pre-rfc-stabilization-of-target-feature/5176/23

@kornelski
Copy link
Contributor

Rust has fast math intrinsics, so the fast math behavior could be limited to a specific type or selected functions, without forcing the whole program into it.

@pedrocr
Copy link

pedrocr commented Jun 9, 2017

A usable solution for my use cases would probably be to have the vector types in the simd crate be the types that allow the opt-in to ffast-math. That way there's only one type I need to conciously convert the code to for speedups. But for the general solution of in normal code having to swap types seems cumbersome. But maybe just doing return val as f32 when val is an f32fast type isn't that bad.

@Mark-Simulacrum Mark-Simulacrum added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 22, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@pedrocr
Copy link

pedrocr commented Aug 10, 2017

Created a pre-RFC discussion on internals to try and get a discussion on the best way to do this:

https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way-to-implement-ffast-math/5740

@robsmith11
Copy link

Is there a current recommended approach to using fast-math optimizations in rust nightly?

@jeffvandyke
Copy link

jeffvandyke commented Oct 24, 2019

If it helps, a good benchmark comparison article between C++ and Rust floating point optimizations (link) inside loops was written recently (Oct 19), with a good Hacker News discussion exploring this concept.

Personally, I think the key is that without specifying any (EDIT: floating-point specific) flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

(EDIT: It seems that -fvectorize -Ofast was specified for clang to get gcc-comparable results, see proceeding comment)

An important key for any discussion on opmitized float math should keep this in mind: vectorization isn't always less precise - a commenter pointed out that a vectorized floating point sum is actually more accurate than the un-vectorized version. Also see Stack Overflow: https://stackoverflow.com/a/7455442

I'm curious what criteria for vectorization clang (or gcc) uses for figuring out floating point optimization. I'm not enough of an expert in these areas to know specifics though. I'm also not sure what precision guarantees Rust makes for floating point math.

@pedrocr
Copy link

pedrocr commented Oct 24, 2019

Personally, I think the key is that without specifying any flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

That's not the case in the article. The clang compilation was using -Ofast which apparently enables -ffast-math.

@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 24, 2019
@StHagel
Copy link

StHagel commented Jun 14, 2024

While I do think that adding an option to enable fast-math in Rust is definitely desireable, I don't like the idea of making a new type for it however.

I would rather make it an optional compiler flag, that is not set by default in --release. This way I can run my existing code with fast-math enabled if I want to and not use fast-math if I don't want to. Adding a new type would require me to either change all f64s to f64fast in my entire codebase or go through every function and think about whether it makes sense to use f64fast here or not and add var as f64 and var as f64fast all over the place.

@NobodyXu
Copy link
Contributor

Putting it in profile, allowing each crate to set it, and allowing the binary crate to override it per-crate seems to make sense.

You could then enable it for your library crate if you know it is safe , and for binary crate they can disable it if it turns out doesn't work, or enable it if they know what they are doing.

@RalfJung
Copy link
Member

RalfJung commented Jun 14, 2024

Making it a compile flag that applies to other crates sounds like a terrible idea. When you download a crate from the internet, you can't know whether it was written in a way that is compatible with fast-math semantics. It is very important not to apply fast-math semantics to code that assumes IEEE semantics.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

@StHagel
Copy link

StHagel commented Jun 14, 2024

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code.

That sounds like a good path to go on in my eyes. Being able to set an attribute in the Cargo.toml which basically means "This crate is fast-math-safe". Compiling your code with fast-math on would then check every dependency whether it is fast-math-safe or not and compile it accordingly.

@usamoi
Copy link
Contributor

usamoi commented Jun 14, 2024

I use core::intrinsics::f*_fast or core::intrinsics::f*_algebraic to hint compiler for auto vectorization and it totally works. The only thing that I care about is these functions are gated with core_intrinsics, which seems quite awkward.

@calder
Copy link

calder commented Feb 2, 2025

What's preventing us from stabilizing core::intrinsics::f*_algebraic today? Those are probably sufficient for 90% of cases where you're optimizing an inner loop and are fine with individual ops being reassociated.

Here's a simple example where stable Rust is 8x slower than C++ (dot product of two 100,000 element f32 vectors) because there's no way of telling the compiler that it's OK to reorder ops to enable vectorization:

  • C++ with #pragma clang fp reassociate(on): 10us
  • Rust nightly with core::intrinsics::f[add|mul]_fast: 10us
  • Rust stable: 84us

https://github.com/calder/dot-bench

EDIT: Confirmed this would work great and close the performance gap with nightly: calder@c3c7fab

@StHagel
Copy link

StHagel commented Feb 3, 2025

EDIT: Confirmed this would work great and close the performance gap with nightly: calder@c3c7fab

That looks awesome, thanks for putting in the work!
How would one use algebraic operations in practice then? Only with a.add_algebraic(b) or would there be an option to overwrite the usual operators (+-*/%) with algebraic ones, so one doesn't have to rewrite lots of code?

@calder
Copy link

calder commented Feb 4, 2025

Only a.algebraic_*(b) functions for now to unblock 90% of use cases with as little controversy as possible (this issue has been open for 10 years and there's still no way to tell stable Rust to allow reordering) but other people / library authors can follow up with more.

@RReverser
Copy link
Contributor

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

It sounds like an even better fit could be a custom target_feature, although not sure if we are allowed to extend it with what is not, in fact, a "target" CPU feature.

The syntax seems like a very good fit though:

#[target_feature(enable = "fast_math")]
unsafe fn i_promise_its_ok_to_reorder_math_in_me() { ... }

@jgarvin
Copy link

jgarvin commented Feb 17, 2025

Making it a compile flag that applies to other crates sounds like a terrible idea. When you download a crate from the internet, you can't know whether it was written in a way that is compatible with fast-math semantics. It is very important not to apply fast-math semantics to code that assumes IEEE semantics.

Most float code doesn't assume or even consider IEEE semantics though. Most users of floating point don't know anything about the rounding guarantees or numerical stability or how the encoding works. Unfortunately likely a lot of code that works just fine with -ffast-math or equivalent and gets a significant performance benefit. I don't know the history but I suspect this is one of the reasons the flag exists in GCC. I have definitely seen it make autovec work when it otherwise didn't.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

I think it would be consistent with the rest of Rust if you were allowed to but it required using unsafe in Cargo.toml and when invoking an associated rustc flag. I don't think there's a concept of unsafe compile options currently but I can think of others that would be interesting like disabling bounds checks (even if you intend to use them, useful for measuring to make the case it's not a big perf impact).

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2025

No, a target feature is definitely wrong as that can be set via -Ctarget-feature globally, but you must never be able to apply this flag to other people's code without them opting-in. This is definitely true for the UB-inducing -ffast-math; Rust will not compromise its memory safety over floating-point performance concerns. "Most code would still be sound" is not good enough. The chances of Rust adopting an approach that can break soundness are 0, no matter how many flags GCC has (which are equally unsound but being C, that's fine for them).

A blanket global explicitly unsafe flag is also not going to fly; what would the safety comment for that even look like? "I audited all the code in my entire dependency tree"? The point of unsafe is to make things locally auditable; this flag cannot achieve that.

IMO the same goes even for the "just produces wrong results" variant (not sure if that has a standard name, it corresponds to the operations described in rust-lang/libs-team#532). We don't just go and alter the semantics of other people's code. In particular, this would be a breaking change as we'd have to take back what we say in https://doc.rust-lang.org/nightly/std/primitive.f32.html.

So, I would recommend the discussion to focus on ways to opt-in to these semantics locally, in a way controlled by the author of the respective code. Everything else is either an outright no-go (if it can break soundness) or at least highly unlikely to lead anywhere (if it breaks stable semantic promises).

@RReverser
Copy link
Contributor

No, a target feature is definitely wrong as that can be set via -Ctarget-feature globally, but you must never be able to apply this flag to other people's code without them opting-in.

Ah, true enough, forgot about the global switch. To me, the primary appeal of this syntax is the locality - ability to turn it on on per-function basis, and that sounds like something we agree on. I'm equally happy if it was a different attribute that can still be applied on per-function granularity.

@kornelski
Copy link
Contributor

Per-function attributes have a hard to resolve issue of how "deep" they apply. If the attributes applied to all code called or inlined in the function body, then they could affect code of other crates, which is a no-no. If the attributes applied only in a shallow way, not through method calls, then it could be frustrating that methods on f32 won't be affected (since they're in core), and there would be visible difference between f32 + f32 lowered to a dedicated MIR instruction and a call to f32::add. Attributes on closures aren't fully supported. Specifying how to control the exact scope of attributes may turn out to be a big task.

Previously I've proposed having a fast float type built-in into Rust, like r32, but there are many different guarantees/optimizations that users could theoretically want (finite, non-NaN, no subnormals, rounding), so this proposal died in bikeshedding.

Therefore, stabilization of fast-math intrinsics seems like the most realistic path forward: #21690 (comment)

@RReverser
Copy link
Contributor

If the attributes applied to all code called or inlined in the function body, then they could affect code of other crates, which is a no-no

I'm not sure I agree with that. I agree with the sentiment above, that global toggle affecting arbitrary crates is a no-no, but I'd say anything invoked from a function marked with such attribute is a fair game - same as, when you mark a function with target_feature, the autovectorizer can roam free over both direct function body and any code indirectly inlined in it. (this is also the reason why, like with target_feature, I think this should require unsafe)

If anything, not being able to leverage same optimisations for std and existing third-party APIs that work on floats, despite an explicit attribute on my function, would make this near-useless except for very niche microoptimisations. At that point, if I have to do those microoptimisations manually anyway, I might as well reorder code and use 3rd party crates myself instead of using new intrinsics.

It's specifically the "automatic optimisation" that makes fast-math so valuable.

@hanna-kruppe
Copy link
Contributor

Target feature is only unsafe because you have to make sure the CPU you’re running on will recognize the instructions that the compiler may emit in that function. It doesn’t otherwise affect the semantics of the code and especially not of any code that happens to get inlined into it. If it did, we couldn’t have it apply to inlined code because it would face the same problem as fast math flags: the caller is generally not able to judge whether the callee could handle the change in semantics gracefully or not. Yes, this makes it very hard to use any third party code in loops that you want to see auto-vectorized. The proper solution is to get the authors of the third party loop to opt into it (this is already possible today), or to declare that they’re fine with either semantics (needs language design).

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2025 via email

@tgross35
Copy link
Contributor

Agreeing with Ralf, I am going to close this. The reassociation part of fast math is unstably available at #136469. If anybody still has a need for FTZ/DAZ variants or unsafe variants that poison NaN/inf, feel free to open a fresh issue or propose directly as an ACP.

@jgarvin
Copy link

jgarvin commented Feb 17, 2025

Rust will not compromise its memory safety over floating-point performance concerns. "Most code would still be sound" is not good enough.

Rust compromises its memory safety all the time when you use an unsafe, that's why I only suggested it being behind a use of unsafe.

A blanket global explicitly unsafe flag is also not going to fly; what would the safety comment for that even look like? "I audited all the code in my entire dependency tree"? The point of unsafe is to make things locally auditable; this flag cannot achieve that.

No, you would attach it to specific dependencies. It's a similar procedure to any time you have optional unsafe code. When you suspect unsafety is causing a problem you turn off the unsafe flag everywhere and if that fixes the issue you begin binary searching your dependencies.

Everything else is either an outright no-go (if it can break soundness) or at least highly unlikely to lead anywhere (if it breaks stable semantic promises).

There is no promise being broken unless a library author has specifically advertised their library as being safe to use with it (and some probably would).

Zalathar added a commit to Zalathar/rust that referenced this issue Feb 18, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps
@RalfJung
Copy link
Member

RalfJung commented Feb 18, 2025

No, you would attach it to specific dependencies. It's a similar procedure to any time you have optional unsafe code.

No, this is not even remotely similar to anything we have currently. Nothing in Rust currently lets you unilaterally alter the behavior of code in your dependencies. Even if you did audit that entire crate, it's totally permitted under semver for a minor version bump of the crate to start using floating-point operations in a new way, voiding your audit.

Rust is generally carefully designed so that the only thing you ever can or have to know about another crate is its public API surface and the associated documentation. This would completely undermine that.

Everything about this flag fundamentally clashes with the idea of robust, compositional system design. It was a terrible mistake to ever add it to C compilers, and Rust should not repeat that mistake. Rust should instead explore alternative, less fragile ways of exposing those semantics. (That's just my personal opinion, I am not speaking for any team here. But I consider it quite likely that many t-lang and t-opsem people will agree with this sentiment.) A first step has been done and is tracked in #136469. Maybe a second step is a per-function / per-module / per-crate attribute, though that will have to explain convincingly which issue is solved by this that the algebraic operations do not solve yet. A sort of pre-RFC on IRLO would likely be a good starting point here; please do not try to design something from scratch in this issue.

Also please stop rehashing points that have already been made many times above. I get that not everyone agrees with the Rust design philosophy, but it's not going to change without demonstrating that all conceivable alternatives have been explored and they are all clearly inferior.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 19, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
fmease added a commit to fmease/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
bors added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

~~try-job: x86_64-gnu-nopt~~
try-job: x86_64-gnu-aux
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-feature-request Category: A feature request, i.e: not implemented / a PR. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests