Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets #53254

Closed
gnzlbg opened this issue Aug 10, 2018 · 29 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. O-windows-gnu Toolchain: GNU, Operating system: Windows

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 10, 2018

crossref: rust-lang/packed_simd#72

The following operations of the f32x2 vector type produce incorrect results on gnu windows targets ({i686, x86_64}-pc-windows-gnu) in debug builds: sin, cos, fma, scalar (+,-,*,/, unary -):

64::f32x2_math_cos::cos' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(-0.00000004371139, -0.00000004371139)`,
 right: `f32x2(NaN, -0.00000004371139)`', src\v64.rs:43:1
---- v64::f32x2_math_fma::fma stdout ----
thread 'v64::f32x2_math_fma::fma' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(1.0, 1.0)`,
 right: `f32x2(NaN, NaN)`', src\v64.rs:43:1
---- v64::f32x2_math_sin::sin stdout ----
thread 'v64::f32x2_math_sin::sin' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(1.0, 1.0)`,
 right: `f32x2(NaN, 1.0)`', src\v64.rs:43:1
---- v64::f32x2_ops_scalar_arith::ops_scalar_arithmetic stdout ----
thread 'v64::f32x2_ops_scalar_arith::ops_scalar_arithmetic' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(NaN, 0.0)`,
 right: `f32x2(0.0, 0.0)`', src\v64.rs:43:1
---- v64::f32x2_ops_vector_arith::ops_vector_arithmetic stdout ----
thread 'v64::f32x2_ops_vector_arith::ops_vector_arithmetic' panicked at 'assertion failed: `(left == right)`
  left: `f32x2(NaN, 0.0)`,
 right: `f32x2(0.0, 0.0)`', src\v64.rs:43:1

I can't reproduce locally but the following could help somebody to try to reproduce this on Windows, and maybe figure out what's going on.

This playground example might be used to reproduce this though (I can't know since I don't have access to the platform - it would be good to know if that reproduces the issue or not to narrow down the root cause):

#![feature(
    repr_simd,
    simd_ffi,
    link_llvm_intrinsics,
    platform_intrinsics
)]

#[repr(simd)]
#[derive(Copy, Clone, PartialEq, Debug)]
#[allow(non_camel_case_types)]
struct f32x2(f32, f32);

extern "platform-intrinsic" {
    fn simd_fma<T>(a: T, b: T, c: T) -> T;
}

#[allow(improper_ctypes)]
extern "C" {
    #[link_name = "llvm.cos.v2f32"]
    fn cos_v2f32(x: f32x2) -> f32x2;
    #[link_name = "llvm.sin.v2f32"]
    fn sin_v2f32(x: f32x2) -> f32x2;
}

fn main() {
    let a = f32x2(1.0, 1.0);
    let b = f32x2(0.0, 0.0);
    let c = f32x2(2.0, 2.0);

    unsafe {
        assert_eq!(simd_fma(a, c, b), c);
        assert_eq!(sin_v2f32(b), b);
        assert_eq!(cos_v2f32(b), a);
    }
}
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Aug 10, 2018

I am also seeing a similar failure on apple darwin: rust-lang/packed_simd#25 but I don't know if its the same issue.

@gnzlbg gnzlbg changed the title 64-bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets 64-bit wide 32 bit floating-point packed SIMD vector arithmetic produces incorrect results on Windows GNU targets Aug 10, 2018
@nagisa
Copy link
Member

nagisa commented Aug 11, 2018

The SIMD calculations themseves are constevaluated down to {2, 2}, {0, 0} and {1, 1}, but only if assert_eq!s are removed. Changing assert_eq!s to if op != result { std::intrinsics::abort() } optimises the function down to a ret void as well, indicating that there are no issues specific to how the intrinsics are called going on.

Only taking a reference of the result and panicking with it seems to begin causing issues (I’d expect the function to still be optimised out, but it is not):

    unsafe {
    let b = f32x2(0.0, 0.0);

    match (&sin_v2f32(b), &b) {
        (a, b) => {
            if a != b {
                panic!("not equal {:?} {:?}", a, b); // not optimised out unless this panic is replaced with something that does not use `a` or `b`.
            }
        }
    }
    }

because of that, the sine in the code snippet above is calculated at runtime. For the x64 windows target the following assembly gets generated:

banana:
	subq	$200, %rsp
	movaps	%xmm8, 176(%rsp)
	movaps	%xmm7, 160(%rsp)
	movaps	%xmm6, 144(%rsp)
	movq	$0, 32(%rsp)
	movq	32(%rsp), %xmm6 ; xmm6? Only xmm0-3 are used in x64 calling convention!
	callq	sinf
	movdqa	%xmm0, %xmm8
	movdqa	%xmm6, %xmm0 
	callq	sinf
	punpckldq	%xmm0, %xmm8 
	movdqa	%xmm6, %xmm0
	callq	sinf
	movdqa	%xmm0, %xmm7 
	pshufd	$229, %xmm6, %xmm0 
	callq	sinf
	punpckldq	%xmm0, %xmm7
        ; <snip>

The sinf is called 4 times because for the SIMD operation to be legal it has to operate on at least v4f32 (?), but then the backend ends up calling sinf 4 times instead, at least once for undef input.

I have no other insight for now.

I will fill an LLVM bug for 4 sinfs as that seems to be an obvious oversight.

@nagisa
Copy link
Member

nagisa commented Aug 11, 2018

The LLVM bug for 4 sinfs can be found at https://bugs.llvm.org/show_bug.cgi?id=38527.

@sanxiyn sanxiyn added the A-SIMD Area: SIMD (Single Instruction Multiple Data) label Aug 12, 2018
@mati865
Copy link
Contributor

mati865 commented Oct 25, 2018

Fix for that LLVM bug is already available in Rust's LLVM but packed_simd tests still fail with NaN comparison.

EDIT: Example from the first post works fine.

@mati865
Copy link
Contributor

mati865 commented Mar 10, 2019

Can this issue get LLVM label?

Example reduced from packed_simd code, it works on Linux and fails on Windows and Wine using windows-gnu target:

#![feature(
    crate_visibility_modifier,
    link_llvm_intrinsics,
    platform_intrinsics,
    repr_simd,
    stdsimd
)]
#![allow(non_camel_case_types)]
extern crate core;
use core::arch;
use core::mem;

extern "platform-intrinsic" {
    crate fn simd_ne<d, e>(f: d, h: d) -> e;
}

extern "C" {
    #[link_name = "llvm.pow.f32"]
    fn pow_f32(f: f32, h: f32) -> f32;
}

#[repr(simd)]
#[derive(Copy, Clone)]
struct f32x2(f32, f32);
#[repr(simd)]
#[derive(Copy, Clone)]
struct i32x2(i32, i32);

impl f32x2 {
    pub fn ne(self, aq: Self) -> bool {
        let z: i32x2 = unsafe { simd_ne(self, aq) };
        use arch::x86_64::_mm_movemask_pi8;
        unsafe { _mm_movemask_pi8(mem::transmute(z)) != 0 }
    }
}

macro_rules! assert_eq_ {
    ($a:expr, $b:expr) => {
        if $a.ne($b) {
            panic!()
        }
    };
}
pub fn main() {
    let o = f32x2(1.0, 1.0);
    assert_eq_!(o, o);
    assert_eq_!(2.0, unsafe {&pow_f32(2.0, 1.0)});
    //println!("{}", unsafe {&pow_f32(2.0, 1.0)});
}

Godbolt link:
https://godbolt.org/z/bqZn4-

@sfackler sfackler added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Mar 10, 2019
@mati865

This comment has been minimized.

@rustbot rustbot added C-bug Category: This is a bug. O-windows-gnu Toolchain: GNU, Operating system: Windows labels Oct 21, 2019
@mati865
Copy link
Contributor

mati865 commented Oct 21, 2019

Reduced as much as I could:

#![feature(core_intrinsics, link_llvm_intrinsics, repr_simd, simd_ffi, stdsimd)]
#![allow(non_camel_case_types)]
extern crate core;
use core::mem::transmute;

#[repr(simd)]
struct __m64(i64);

#[allow(improper_ctypes)]
extern "C" {
    #[link_name = "llvm.pow.f32"]
    fn pow_f32(f: f32, h: f32) -> f32;
    #[link_name = "llvm.x86.mmx.pmovmskb"]
    fn pmovmskb(a: __m64) -> i32;
}

#[repr(simd)]
#[derive(Copy, Clone)]
struct f32x2(f32, f32);

pub fn main() {
    unsafe {
        let _ = pmovmskb(transmute(0_i64));         
        //println!("{}", pow_f32(2.0, 1.0));
        if 2.0.ne(&pow_f32(2.0, 1.0)) {
            core::intrinsics::abort();
        };
    }
}

Running:

# windows-gnu
$ rustc simd.rs && ./simd.exe; echo $?
Illegal instruction
132

# linux-gnu
$ rustc simd.rs && ./simd; echo $?
0

Godbolt:
https://godbolt.org/z/ChpV1i

@hanna-kruppe
Copy link
Contributor

My best guess is that behavior of this last program is rooted in the MMX instruction trampling all over the x87 FPU stack, making it impossible to use x87 instructions correctly until the emms instruction is executed (cc #57831). If correct, that would imply:

  • the libm used in the windows-gnu target (mingw glibc?) uses x87 instructions as part of its powf implementation, while the windows-msvc target does not. It seems very annoying to check this, but it might be a nice exercise in disassembly.
  • Inserting emms before the powf() call should fix the weird behavior. I am currently waiting for the target to install in the background to try this out, but if someone is quicker, here's the code.

If this theory is correct, this is pretty depressing. Honestly, I think the best fix would be to not use MMX intrinsics ever. Inserting emms only helps when compiling without optimization -- since LLVM assumes math intrinsics such as llvm.pow to have no side effects, it won't think twice about reordered it relative to the emms. A fix that lets legacy code live would be #58168 but that kinda seems like more trouble than it's worth.

I am not sure whether this is the same problem as the original issue or something separate. I guess if the packed_simd tests include some usage of MMX instructions, it might be related?

@mati865
Copy link
Contributor

mati865 commented Oct 22, 2019

Inserting emms before the powf() call should fix the weird behavior.

It does but also replacing transmute(0_i64) with __m64(0_i64) fixes it.

There is open thread on wg-llvm zulip thread if you want to talk about it.

EDIT:

the libm used in the windows-gnu target (mingw glibc?) uses x87 instructions as part of its powf implementation

Mingw-w64 powf calls pow from msvcrt, I think msvc is using ucrt.

@novacrazy
Copy link

Has there been any update to this? It's been years since I've been able to use the Windows GNU target.

@hanna-kruppe
Copy link
Contributor

Are you hitting this specific issue? That's surprising, as far as we know this bug only occurs in programs that deliberately use MMX instructions (which have been obsolete for many years, and not accessible at all from stable Rust). If you observe this bug (or another bug that looks like it) in other contexts, please tell us more.

@mati865
Copy link
Contributor

mati865 commented Apr 3, 2020

@hanna-kruppe packed_simd is using MMX and novacrazy depends on it: #65658 (comment)

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Apr 3, 2020

Are you sure? In current packed_simd tree, I can only find one mention of __m64, in a macro that generates FromBits/IntoBits conversions. Operations on other repr(simd) types should not lower to MMX instructions, even if the type matches MMX capabilities (like f32x2 or i16x4).

@hanna-kruppe
Copy link
Contributor

After some spelunking, I see that there used to be some MMX code in packed_simd for mask types, but it was removed over a year ago in rust-lang/packed_simd#207.

@novacrazy
Copy link

novacrazy commented Apr 3, 2020

Can confirm this is still a crippling issue as of rustc 1.44.0-nightly (f509b26a7 2020-03-18) and the latest packed_simd master branch.

NaNs everywhere.

@hanna-kruppe
Copy link
Contributor

Please be more clear: which of the many programs posted in this and related threads are you referring to?

@novacrazy
Copy link

My apologies. I've never been able to properly minimally reproduce this with a simple example, it only occurs in my 100k LoC work project near trig operations.

I believe your theory at #53254 (comment) is on the right track, because in addition to never being able to minimally reproduce it, I've never been able to find the source of issues by printing input variables. It appears out of nowhere, randomly. Sometimes different places between multiple runs.

@hanna-kruppe
Copy link
Contributor

One relatively simple way to validate my theory would be to disassemble your application and check if there's any mention of mm[0-7] (which should be a good proxy for the use of MMX instructions). That won't necessarily make clear where that instruction comes from, but it might be a good sanity check.

@novacrazy
Copy link

novacrazy commented Apr 3, 2020

Indeed, there are 4 MMX register uses in release mode and 10 MMX uses in debug mode, though my debug has opt-level=3.

For debug:

    vcmpunordps   .LCPI187_26(%rip), %xmm3, %xmm4
.Ltmp9279:
    .loc       70 14 69
    movdq2q    %xmm4, %mm0
.Ltmp9280:
    .loc       52 2377 5
    pmovmskb   %mm0, %eax

All occurrences follow vcmpunordps.

No idea how to relate those .loc back to actual code in what file.

For release:

    vcmpeqps    %xmm0, %xmm7, %xmm0
    movdq2q     %xmm0, %mm0
    vxorps      %xmm0, %xmm0, %xmm0
    pmovmskb    %mm0, %eax

All occurrences follow vcmpeqps

@nikic
Copy link
Contributor

nikic commented Apr 3, 2020

It looks like packed_simd is still using _mm_movemask_pi8 in: https://github.com/rust-lang/packed_simd/blob/28b5e161ceaf0096a5e9f387f46441ed6494290e/src/codegen/reductions/mask/x86/sse.rs#L51

While it is now under an SSE guard, this is still an MMX intrinsic.

@novacrazy
Copy link

I do use .all() fairly often. What would the alternative be? I happen to have a fork of packed_simd I can try out alternatives with now.

@mati865
Copy link
Contributor

mati865 commented Apr 3, 2020

@hanna-kruppe example in this thread has been minimized from packed_simd.
packed_simd calls _mm_movemask_pi8 from stdarch which calls pmovmskb.

@nikic
Copy link
Contributor

nikic commented Apr 3, 2020

@novacrazy Maybe just try removing this line? https://github.com/rust-lang/packed_simd/blob/28b5e161ceaf0096a5e9f387f46441ed6494290e/src/codegen/reductions/mask/x86.rs#L143

I looks like the fallback implementation is to do a bitcast and compare, which I would expect LLVM to already optimize well.

@novacrazy
Copy link

Aha! @nikic, commenting out all of the 64-bit masks (not just 8x8) did indeed solve all my issues.

No more MMX register uses in the assembly and no more NaNs!

@mati865
Copy link
Contributor

mati865 commented Sep 2, 2020

I think we can close this issue, there is nothing to fix on Rust side. Also MMX may get removed from LLVM: https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Goodbye-MMX-Maybe

@RalfJung
Copy link
Member

RalfJung commented Sep 10, 2020

So, is this fixed by rust-lang/stdarch#890? I can see neither _mm_movemask_pi8 nor pmovmskb being removed there so I wonder how it is related.

EDIT: Ah, I had to unfold some diffs. I guess there's no way for packed_simd to call the bad intrinsics now, but will that not make packed_simd fail to build?

@mati865
Copy link
Contributor

mati865 commented Sep 10, 2020

I guess there's no way for packed_simd to call the bad intrinsics now, but will that not make packed_simd fail to build?

@RalfJung that's true but those are unstable features and the fix for packed_simd is rather simple. Though packed_simd doesn't look like maintained any more.

@RalfJung
Copy link
Member

RalfJung commented Sep 10, 2020

Yeah, this is definitely an okay breaking change, I was just trying to understand the situation. Maybe an issue should be opened in packed_simd?

Also I'll close this issue based on @mati865's comment above.

@mati865
Copy link
Contributor

mati865 commented Sep 10, 2020

Opened rust-lang/packed_simd#287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-bug Category: This is a bug. O-windows-gnu Toolchain: GNU, Operating system: Windows
Projects
None yet
Development

No branches or pull requests

10 participants