Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults with 128-bit integers on Windows #45

Open
tahina-pro opened this issue Mar 18, 2020 · 17 comments
Open

Segfaults with 128-bit integers on Windows #45

tahina-pro opened this issue Mar 18, 2020 · 17 comments

Comments

@tahina-pro
Copy link

With my colleague @msprotz, we observed segfaults related to 128-bit integers in some Windows OCaml programs using ocaml-stdint.

It turns out that even the test executable produced by make test/stdint_test segfaults, with no test running, as we show below.

We are working on Windows 10 version 1909 (10.0.18363.720) with Cygwin 3.1.4-1, mingw64-x86_64-gcc 9.2.0, and OCaml 4.09.0 installed via OCaml and OPaM for Windows. If you want to quickly reproduce our config, you can pull a complete Docker image pulling the latest ocaml-stdint master (0785788) and building ocaml-stdint and its test suite. You can also rebuild the Docker image yourself using this Dockerfile.txt.

Then, below is what we are observing:

ContainerAdministrator@cff3016f7f7e /cygdrive/c/test/ocaml-stdint
$ gdb tests/stdint_test.exe
GNU gdb (GDB) (Cygwin 8.2.1-1) 8.2.1

(gdb) run
Starting program: /cygdrive/c/test/ocaml-stdint/tests/stdint_test.exe
[New Thread 5000.0x33c4]
[New Thread 5000.0x32b0]
[New Thread 5000.0x5110]
[New Thread 5000.0x5710]

Thread 1 received signal SIGSEGV, Segmentation fault.
copy_uint128 (i=<optimized out>) at uint128_stubs.c:188
188       Uint128_val(res) = i;
(gdb) backtrace
#0  copy_uint128 (i=<optimized out>) at uint128_stubs.c:188
#1  0x0000000000494e65 in uint128_of_int (v=<optimized out>) at uint128_conv.c:34
#2  0x000000000042f2e6 in camlStdint__entry ()
#3  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

As a workaround, my colleague @msprotz proposed to recompile ocaml-stdint with gcc optimizations disabled, which we did in tahina-pro/ocaml-stdint@99ac1fa, and there we no longer observe these segfaults.

@rixed
Copy link
Contributor

rixed commented Mar 23, 2020

Do you know if your version of mingw supports the __int128_t type?
Windows docker image cannot run on Linux so I cannot repro your exact issue, but I also have a segfault when running the tests without HAVE_INT128, so maybe that's related?

@tahina-pro
Copy link
Author

tahina-pro commented Mar 23, 2020

Yes, in a container running from the Docker image, I can successfully compile and run a __int128_t function that computes 2^(2^n):

docker run -i -t tahina/20200318ocamlstdint

ContainerAdministrator@52e7a61f5bdc /cygdrive/c/test/ocaml-stdint
$ cat <<EOF >a.c
#include <stdint.h>
#include <assert.h>
__int128_t two_pow_two_pow_minus_one(int count) {
  __int128_t res = 2;
  for (int i = 0; i < count; ++i) {
    res *= res;
  }
  assert (res > 0);
  return res - 1;
}
EOF

$ x86_64-w64-mingw32-gcc.exe -Wall -c a.c

$ cat <<EOF >test.c
#include <inttypes.h>
#include <stdio.h>
__int128_t two_pow_two_pow_minus_one(int count);
int main(void) {
  uint64_t res = (uint64_t) (two_pow_two_pow_minus_one(6));
  printf("%" PRIu64 "\n", res);
  return 0;
}
EOF

$ x86_64-w64-mingw32-gcc.exe -Wall -c test.c

$ x86_64-w64-mingw32-gcc.exe -Wall -o test.exe a.o test.o

$ ./test.exe
18446744073709551615

$ echo $?
0

@hcarty
Copy link

hcarty commented May 21, 2020

If it helps - using an older 4.08.0 installation, also using fdopen's opam distribution, I don't get a segfault. On 4.10.0 I do get it.

@rseymour
Copy link

Tried compiling the above test C progs with -O3 and they compiled/ran without error. My prior test w/ an ifdef check also worked.

@msprotz
Copy link

msprotz commented May 21, 2020

I should mention that the error started happening after we upgraded GCC (specifically: mingw) to version 9. We were previously on version 7 and there was no trouble.

@rseymour
Copy link

rseymour commented May 21, 2020

Thread 1 received signal SIGSEGV, Segmentation fault.
0x000000000044e87d in copy_int128 (i=<optimized out>) at int128_stubs.c:146
146       Int128_val(res) = i;

This gdb output seems to be saying that GCC optimized out the variable i in the copy_int128() function. I tried throwing a volatile next to it in the args to the function, but it didn't seem to stop it. Feels like gcc at -O2 shouldn't optimize out the right hand side of an assignment, but who knows.

@rseymour
Copy link

I found perl folks wrap code with a lot of guards to protect it from new GCC... perl11/cperl@c0b59c8 (this was a failed(?) commit to add even more -O0 style guards to the existing ones)

@rgrinberg
Copy link
Collaborator

@rseymour do you think we should such guards as well?

@rseymour
Copy link

rseymour commented Oct 7, 2020

We didn't end up going with them as the other fix worked, but I feel like it's the "right" way to do it until GCC fixes itself.

@msprotz
Copy link

msprotz commented Oct 26, 2020

Just coming back to this issue after a long while...

Are we positive that this is a GCC bug? I can't tell from the previous discussion what the "misbehavior" might be. The fact that we observe the issue at O2 and above doesn't necessarily reflect a GCC issue; it could be that the bindings code does something illegal per the C standard, and that GCC only becomes "smart" at O2.

Maybe the issue is this: when HAVE_INT128 is absent, some unspecified issue observable only at -O2 and above causes a segfault.

Then there are two fixes:

  • make HAVE_INT128 enabled on windows/mingw
  • fix the underlying issue.

Is this an accurate summary?

Would you be open to the first fix?

FYI, we're still running our entire infrastructure with the workaround outlined in this bug: we compile the test program, see if it segfaults, and if it does, recompile ocaml-stdint with -O0 to prevent the error from happening.

Thanks!

Jonathan

@dra27
Copy link
Contributor

dra27 commented Oct 26, 2020

I just discovered I was being bitten by this too - making HAVE_INT128 is definitely not helping with gcc 9.2.0 on Cygwin 3.1.7 in a Docker container on Windows 10 1910 (10.0.18363.1139).

A brief glance suggests this should work (a very brief glance suggests mingw-w64 doesn't provide the macro at all):

diff --git a/lib/int128.h b/lib/int128.h
index 194e1ae..1db352d 100644
--- a/lib/int128.h
+++ b/lib/int128.h
@@ -1,7 +1,7 @@
 #ifndef OCAML_INT128_H
 #define OCAML_INT128_H

-#if defined(__SIZEOF_INT128__)
+#if defined(__SIZEOF_INT128__) || defined(__MINGW32__)
 #define HAVE_INT128
 typedef __int128_t int128;
 #else
diff --git a/lib/uint128.h b/lib/uint128.h
index 3f97dab..864ea40 100644
--- a/lib/uint128.h
+++ b/lib/uint128.h
@@ -1,7 +1,7 @@
 #ifndef OCAML_UINT128_H
 #define OCAML_UINT128_H

-#if defined(__SIZEOF_INT128__)
+#if defined(__SIZEOF_INT128__) || defined(__MINGW32__)
 #define HAVE_UINT128
 typedef __uint128_t uint128;
 #else

@dra27
Copy link
Contributor

dra27 commented Oct 26, 2020

This and forcing -O0 is working fine, for an unoptimized value of "fine"!

@msprotz
Copy link

msprotz commented Oct 26, 2020

Your solution seems fine to me @dra27 -- do I understand you correctly that we need both HAVE_INT128 and -O0? Definitely something fishy going on here but I agree that as long as the issue is mitigated it's "fine" for us too.

Thanks for chiming in! Delightful to see you jump in on a Windows-only obscure optimization-related bug ;-).

@dra27
Copy link
Contributor

dra27 commented Oct 26, 2020

I haven't tested if -O0 on its own is sufficient, no (it looks like it probably will be). I was going on the assumption that intrinsic support for __int128_t is better, so it seemed better to have that by default, but then I also discovered that it was still segfaulting...

A glorious coincidence to have jumped in within hours of you - I just happen to be hacking cap'n proto on mingw-w64 at the moment! 🙂

@rixed
Copy link
Contributor

rixed commented Oct 27, 2020

May I suggest adding a reference to this issue in a comment in the patch itself? I think actual comments are better than relying on the commit message in instances like these.

@dra27
Copy link
Contributor

dra27 commented Oct 27, 2020

I should have had a more in-depth glance, __SIZEOF_INT128__, is provided.

@dra27
Copy link
Contributor

dra27 commented Oct 27, 2020

OK, I've dug a bit further. The problem is alignment, and it's not Windows-specific, I expect it's just not been seen on Linux or elsewhere yet. Here's the relevant assembly snippet for copy_int128:

       leaq    int128_ops(%rip), %rcx
       movq    280(%rax), %rsi
       call    caml_alloc_custom
       movaps  %xmm6, 8(%rax)

however, this assumes that 8(%rax) is 16-byte aligned and there's absolutely no guarantee of that, as OCaml only guarantees 8-byte alignment on a 64-bit platform. If you "do something" with the 128 bit int, then gcc stops using xmm6 to hold it (I found that by coincidence, since a printf of the two 64-bit components of it "fixes" it!).

Unfortunately, this cast is invalid:

#define Int128_val(v) (*((int128 *)Data_custom_val(v)))

For whatever reason, in 4.07.1 it appears that the allocator much more readily allocated 16-byte aligned custom values (I think this is simply an image layout coincidence).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants