64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

gnzlbg · 2018-08-10T17:38:02Z

The following operations of the f32x2 vector type produce incorrect results on gnu windows targets ({i686, x86_64}-pc-windows-gnu) in debug builds: sin, cos, fma, scalar (+,-,*,/, unary -):

64::f32x2_math_cos::cos' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(-0.00000004371139, -0.00000004371139)`,
 right: `f32x2(NaN, -0.00000004371139)`', src\v64.rs:43:1
---- v64::f32x2_math_fma::fma stdout ----
thread 'v64::f32x2_math_fma::fma' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(1.0, 1.0)`,
 right: `f32x2(NaN, NaN)`', src\v64.rs:43:1
---- v64::f32x2_math_sin::sin stdout ----
thread 'v64::f32x2_math_sin::sin' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(1.0, 1.0)`,
 right: `f32x2(NaN, 1.0)`', src\v64.rs:43:1
---- v64::f32x2_ops_scalar_arith::ops_scalar_arithmetic stdout ----
thread 'v64::f32x2_ops_scalar_arith::ops_scalar_arithmetic' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(NaN, 0.0)`,
 right: `f32x2(0.0, 0.0)`', src\v64.rs:43:1
---- v64::f32x2_ops_vector_arith::ops_vector_arithmetic stdout ----
thread 'v64::f32x2_ops_vector_arith::ops_vector_arithmetic' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(NaN, 0.0)`,
 right: `f32x2(0.0, 0.0)`', src\v64.rs:43:1

I can't reproduce locally but the following could help somebody to try to reproduce this on Windows, and maybe figure out what's going on.

This playground example might be used to reproduce this though (I can't know since I don't have access to the platform - it would be good to know if that reproduces the issue or not to narrow down the root cause):

#![feature(
    repr_simd,
    simd_ffi,
    link_llvm_intrinsics,
    platform_intrinsics
)]

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
#[allow(non_camel_case_types)]
struct f32x2(f32, f32);

extern "platform-intrinsic" {
    fn simd_fma<T>(a: T, b: T, c: T) -> T;
}

#[allow(improper_ctypes)]
extern "C" {
    #[link_name = "llvm.cos.v2f32"]
    fn cos_v2f32(x: f32x2) -> f32x2;
    #[link_name = "llvm.sin.v2f32"]
    fn sin_v2f32(x: f32x2) -> f32x2;
}

fn main() {
    let a = f32x2(1.0, 1.0);
    let b = f32x2(0.0, 0.0);
    let c = f32x2(2.0, 2.0);

    unsafe {
        assert_eq!(simd_fma(a, c, b), c);
        assert_eq!(sin_v2f32(b), b);
        assert_eq!(cos_v2f32(b), a);
    }
}

The text was updated successfully, but these errors were encountered:

gnzlbg · 2018-08-10T17:41:27Z

I am also seeing a similar failure on apple darwin: rust-lang/packed_simd#25 but I don't know if its the same issue.

nagisa · 2018-08-11T08:37:13Z

The SIMD calculations themseves are constevaluated down to {2, 2}, {0, 0} and {1, 1}, but only if assert_eq!s are removed. Changing assert_eq!s to if op != result { std::intrinsics::abort() } optimises the function down to a ret void as well, indicating that there are no issues specific to how the intrinsics are called going on.

Only taking a reference of the result and panicking with it seems to begin causing issues (I’d expect the function to still be optimised out, but it is not):

    unsafe {
    let b = f32x2(0.0, 0.0);

    match (&sin_v2f32(b), &b) {
        (a, b) => {
            if a != b {
                panic!("not equal {:?} {:?}", a, b); // not optimised out unless this panic is replaced with something that does not use `a` or `b`.
            }
        }
    }
    }

because of that, the sine in the code snippet above is calculated at runtime. For the x64 windows target the following assembly gets generated:

banana:
	subq	$200, %rsp
	movaps	%xmm8, 176(%rsp)
	movaps	%xmm7, 160(%rsp)
	movaps	%xmm6, 144(%rsp)
	movq	$0, 32(%rsp)
	movq	32(%rsp), %xmm6 ; xmm6? Only xmm0-3 are used in x64 calling convention!
	callq	sinf
	movdqa	%xmm0, %xmm8
	movdqa	%xmm6, %xmm0 
	callq	sinf
	punpckldq	%xmm0, %xmm8 
	movdqa	%xmm6, %xmm0
	callq	sinf
	movdqa	%xmm0, %xmm7 
	pshufd	$229, %xmm6, %xmm0 
	callq	sinf
	punpckldq	%xmm0, %xmm7
        ; <snip>

The sinf is called 4 times because for the SIMD operation to be legal it has to operate on at least v4f32 (?), but then the backend ends up calling sinf 4 times instead, at least once for undef input.

I have no other insight for now.

I will fill an LLVM bug for 4 sinfs as that seems to be an obvious oversight.

nagisa · 2018-08-11T09:09:10Z

The LLVM bug for 4 sinfs can be found at https://bugs.llvm.org/show_bug.cgi?id=38527.

mati865 · 2018-10-25T21:21:26Z

Fix for that LLVM bug is already available in Rust's LLVM but packed_simd tests still fail with NaN comparison.

EDIT: Example from the first post works fine.

mati865 · 2019-03-10T21:14:30Z

Can this issue get LLVM label?

Example reduced from packed_simd code, it works on Linux and fails on Windows and Wine using windows-gnu target:

#![feature(
    crate_visibility_modifier,
    link_llvm_intrinsics,
    platform_intrinsics,
    repr_simd,
    stdsimd
)]
#![allow(non_camel_case_types)]
extern crate core;
use core::arch;
use core::mem;

extern "platform-intrinsic" {
    crate fn simd_ne<d, e>(f: d, h: d) -> e;
}

extern "C" {
    #[link_name = "llvm.pow.f32"]
    fn pow_f32(f: f32, h: f32) -> f32;
}

#[repr(simd)]
#[derive(Copy, Clone)]
struct f32x2(f32, f32);
#[repr(simd)]
#[derive(Copy, Clone)]
struct i32x2(i32, i32);

impl f32x2 {
    pub fn ne(self, aq: Self) -> bool {
        let z: i32x2 = unsafe { simd_ne(self, aq) };
        use arch::x86_64::_mm_movemask_pi8;
        unsafe { _mm_movemask_pi8(mem::transmute(z)) != 0 }
    }
}

macro_rules! assert_eq_ {
    ($a:expr, $b:expr) => {
        if $a.ne($b) {
            panic!()
        }
    };
}
pub fn main() {
    let o = f32x2(1.0, 1.0);
    assert_eq_!(o, o);
    assert_eq_!(2.0, unsafe {&pow_f32(2.0, 1.0)});
    //println!("{}", unsafe {&pow_f32(2.0, 1.0)});
}

Godbolt link:
https://godbolt.org/z/bqZn4-

mati865 · 2019-10-21T22:54:13Z

Reduced as much as I could:

#![feature(core_intrinsics, link_llvm_intrinsics, repr_simd, simd_ffi, stdsimd)]
#![allow(non_camel_case_types)]
extern crate core;
use core::mem::transmute;

#[repr(simd)]
struct __m64(i64);

#[allow(improper_ctypes)]
extern "C" {
    #[link_name = "llvm.pow.f32"]
    fn pow_f32(f: f32, h: f32) -> f32;
    #[link_name = "llvm.x86.mmx.pmovmskb"]
    fn pmovmskb(a: __m64) -> i32;
}

#[repr(simd)]
#[derive(Copy, Clone)]
struct f32x2(f32, f32);

pub fn main() {
    unsafe {
        let _ = pmovmskb(transmute(0_i64));         
        //println!("{}", pow_f32(2.0, 1.0));
        if 2.0.ne(&pow_f32(2.0, 1.0)) {
            core::intrinsics::abort();
        };
    }
}

Running:

# windows-gnu
$ rustc simd.rs && ./simd.exe; echo $?
Illegal instruction
132

# linux-gnu
$ rustc simd.rs && ./simd; echo $?
0

Godbolt:
https://godbolt.org/z/ChpV1i

hanna-kruppe · 2019-10-22T11:22:43Z

My best guess is that behavior of this last program is rooted in the MMX instruction trampling all over the x87 FPU stack, making it impossible to use x87 instructions correctly until the emms instruction is executed (cc #57831). If correct, that would imply:

the libm used in the windows-gnu target (mingw glibc?) uses x87 instructions as part of its powf implementation, while the windows-msvc target does not. It seems very annoying to check this, but it might be a nice exercise in disassembly.
Inserting emms before the powf() call should fix the weird behavior. I am currently waiting for the target to install in the background to try this out, but if someone is quicker, here's the code.

If this theory is correct, this is pretty depressing. Honestly, I think the best fix would be to not use MMX intrinsics ever. Inserting emms only helps when compiling without optimization -- since LLVM assumes math intrinsics such as llvm.pow to have no side effects, it won't think twice about reordered it relative to the emms. A fix that lets legacy code live would be #58168 but that kinda seems like more trouble than it's worth.

I am not sure whether this is the same problem as the original issue or something separate. I guess if the packed_simd tests include some usage of MMX instructions, it might be related?

mati865 · 2019-10-22T11:35:16Z

Inserting emms before the powf() call should fix the weird behavior.

It does but also replacing transmute(0_i64) with __m64(0_i64) fixes it.

There is open thread on wg-llvm zulip thread if you want to talk about it.

EDIT:

the libm used in the windows-gnu target (mingw glibc?) uses x87 instructions as part of its powf implementation

Mingw-w64 powf calls pow from msvcrt, I think msvc is using ucrt.

novacrazy · 2020-04-03T17:00:11Z

Has there been any update to this? It's been years since I've been able to use the Windows GNU target.

hanna-kruppe · 2020-04-03T17:26:45Z

Are you hitting this specific issue? That's surprising, as far as we know this bug only occurs in programs that deliberately use MMX instructions (which have been obsolete for many years, and not accessible at all from stable Rust). If you observe this bug (or another bug that looks like it) in other contexts, please tell us more.

mati865 · 2020-04-03T18:12:31Z

@hanna-kruppe packed_simd is using MMX and novacrazy depends on it: #65658 (comment)

hanna-kruppe · 2020-04-03T18:25:33Z

Are you sure? In current packed_simd tree, I can only find one mention of __m64, in a macro that generates FromBits/IntoBits conversions. Operations on other repr(simd) types should not lower to MMX instructions, even if the type matches MMX capabilities (like f32x2 or i16x4).

hanna-kruppe · 2020-04-03T18:38:49Z

After some spelunking, I see that there used to be some MMX code in packed_simd for mask types, but it was removed over a year ago in rust-lang/packed_simd#207.

novacrazy · 2020-04-03T19:06:34Z

Can confirm this is still a crippling issue as of rustc 1.44.0-nightly (f509b26a7 2020-03-18) and the latest packed_simd master branch.

NaNs everywhere.

hanna-kruppe · 2020-04-03T19:12:10Z

Please be more clear: which of the many programs posted in this and related threads are you referring to?

novacrazy · 2020-04-03T19:21:23Z

My apologies. I've never been able to properly minimally reproduce this with a simple example, it only occurs in my 100k LoC work project near trig operations.

I believe your theory at #53254 (comment) is on the right track, because in addition to never being able to minimally reproduce it, I've never been able to find the source of issues by printing input variables. It appears out of nowhere, randomly. Sometimes different places between multiple runs.

hanna-kruppe · 2020-04-03T19:32:27Z

One relatively simple way to validate my theory would be to disassemble your application and check if there's any mention of mm[0-7] (which should be a good proxy for the use of MMX instructions). That won't necessarily make clear where that instruction comes from, but it might be a good sanity check.

novacrazy · 2020-04-03T19:50:01Z

Indeed, there are 4 MMX register uses in release mode and 10 MMX uses in debug mode, though my debug has opt-level=3.

For debug:

    vcmpunordps   .LCPI187_26(%rip), %xmm3, %xmm4
.Ltmp9279:
    .loc       70 14 69
    movdq2q    %xmm4, %mm0
.Ltmp9280:
    .loc       52 2377 5
    pmovmskb   %mm0, %eax

All occurrences follow vcmpunordps.

No idea how to relate those .loc back to actual code in what file.

For release:

    vcmpeqps    %xmm0, %xmm7, %xmm0
    movdq2q     %xmm0, %mm0
    vxorps      %xmm0, %xmm0, %xmm0
    pmovmskb    %mm0, %eax

All occurrences follow vcmpeqps

nikic · 2020-04-03T20:08:17Z

It looks like packed_simd is still using _mm_movemask_pi8 in: https://github.com/rust-lang/packed_simd/blob/28b5e161ceaf0096a5e9f387f46441ed6494290e/src/codegen/reductions/mask/x86/sse.rs#L51

While it is now under an SSE guard, this is still an MMX intrinsic.

novacrazy · 2020-04-03T20:11:32Z

I do use .all() fairly often. What would the alternative be? I happen to have a fork of packed_simd I can try out alternatives with now.

mati865 · 2020-04-03T20:15:28Z

@hanna-kruppe example in this thread has been minimized from packed_simd.
packed_simd calls _mm_movemask_pi8 from stdarch which calls pmovmskb.

nikic · 2020-04-03T20:18:27Z

@novacrazy Maybe just try removing this line? https://github.com/rust-lang/packed_simd/blob/28b5e161ceaf0096a5e9f387f46441ed6494290e/src/codegen/reductions/mask/x86.rs#L143

I looks like the fallback implementation is to do a bitcast and compare, which I would expect LLVM to already optimize well.

novacrazy · 2020-04-03T20:49:13Z

Aha! @nikic, commenting out all of the 64-bit masks (not just 8x8) did indeed solve all my issues.

No more MMX register uses in the assembly and no more NaNs!

mati865 · 2020-09-02T18:19:36Z

I think we can close this issue, there is nothing to fix on Rust side. Also MMX may get removed from LLVM: https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Goodbye-MMX-Maybe

RalfJung · 2020-09-10T10:46:33Z

So, is this fixed by rust-lang/stdarch#890? I can see neither _mm_movemask_pi8 nor pmovmskb being removed there so I wonder how it is related.

EDIT: Ah, I had to unfold some diffs. I guess there's no way for packed_simd to call the bad intrinsics now, but will that not make packed_simd fail to build?

mati865 · 2020-09-10T11:15:04Z

I guess there's no way for packed_simd to call the bad intrinsics now, but will that not make packed_simd fail to build?

@RalfJung that's true but those are unstable features and the fix for packed_simd is rather simple. Though packed_simd doesn't look like maintained any more.

RalfJung · 2020-09-10T11:18:33Z

Yeah, this is definitely an okay breaking change, I was just trying to understand the situation. Maybe an issue should be opened in packed_simd?

Also I'll close this issue based on @mati865's comment above.

mati865 · 2020-09-10T11:37:10Z

Opened rust-lang/packed_simd#287

gnzlbg mentioned this issue Aug 10, 2018

{i686,x86_64}-pc-windows-gnu Appveyor builds failing rust-lang/packed_simd#72

Open

gnzlbg changed the title ~~64-bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets~~ 64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets Aug 10, 2018

sanxiyn added the A-SIMD Area: SIMD (Single Instruction Multiple Data) label Aug 12, 2018

sfackler added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Mar 10, 2019

mati865 mentioned this issue Oct 21, 2019

Spurious NaNs produced by trig functions with valid inputs on Windows GNU toolchains #65658

Closed

This comment has been minimized.

Sign in to view

rustbot added C-bug Category: This is a bug. O-windows-gnu Toolchain: GNU, Operating system: Windows labels Oct 21, 2019

hanna-kruppe mentioned this issue Oct 22, 2019

Remove the MMX intrinsics rust-lang/stdarch#823

Closed

mati865 mentioned this issue Sep 3, 2020

Remove MMX from Rust #76295

Merged

RalfJung closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

gnzlbg commented Aug 10, 2018

gnzlbg commented Aug 10, 2018

nagisa commented Aug 11, 2018 •

edited

Loading

nagisa commented Aug 11, 2018

mati865 commented Oct 25, 2018 •

edited

Loading

mati865 commented Mar 10, 2019

This comment has been minimized.

mati865 commented Oct 21, 2019 •

edited

Loading

hanna-kruppe commented Oct 22, 2019

mati865 commented Oct 22, 2019 •

edited

Loading

novacrazy commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020

mati865 commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020 •

edited

Loading

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020 •

edited

Loading

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020 •

edited

Loading

nikic commented Apr 3, 2020

novacrazy commented Apr 3, 2020

mati865 commented Apr 3, 2020

nikic commented Apr 3, 2020

novacrazy commented Apr 3, 2020

mati865 commented Sep 2, 2020

RalfJung commented Sep 10, 2020 •

edited

Loading

mati865 commented Sep 10, 2020

RalfJung commented Sep 10, 2020 •

edited

Loading

mati865 commented Sep 10, 2020

64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

Comments

gnzlbg commented Aug 10, 2018

gnzlbg commented Aug 10, 2018

nagisa commented Aug 11, 2018 • edited Loading

nagisa commented Aug 11, 2018

mati865 commented Oct 25, 2018 • edited Loading

mati865 commented Mar 10, 2019

This comment has been minimized.

mati865 commented Oct 21, 2019 • edited Loading

hanna-kruppe commented Oct 22, 2019

mati865 commented Oct 22, 2019 • edited Loading

novacrazy commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020

mati865 commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020 • edited Loading

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020 • edited Loading

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020

hanna-kruppe commented Apr 3, 2020

novacrazy commented Apr 3, 2020 • edited Loading

nikic commented Apr 3, 2020

novacrazy commented Apr 3, 2020

mati865 commented Apr 3, 2020

nikic commented Apr 3, 2020

novacrazy commented Apr 3, 2020

mati865 commented Sep 2, 2020

RalfJung commented Sep 10, 2020 • edited Loading

mati865 commented Sep 10, 2020

RalfJung commented Sep 10, 2020 • edited Loading

mati865 commented Sep 10, 2020

nagisa commented Aug 11, 2018 •

edited

Loading

mati865 commented Oct 25, 2018 •

edited

Loading

mati865 commented Oct 21, 2019 •

edited

Loading

mati865 commented Oct 22, 2019 •

edited

Loading

hanna-kruppe commented Apr 3, 2020 •

edited

Loading

novacrazy commented Apr 3, 2020 •

edited

Loading

novacrazy commented Apr 3, 2020 •

edited

Loading

RalfJung commented Sep 10, 2020 •

edited

Loading

RalfJung commented Sep 10, 2020 •

edited

Loading