Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGILL executing AVX code on processor that doesn't support it on thread tests #500

Closed
derekbruening opened this issue Nov 28, 2014 · 13 comments

Comments

@derekbruening
Copy link
Contributor

From [email protected] on June 20, 2011 16:08:44

I'm recording weird behavior in the thread tests.
I have 3 records of things, one of which was filed in the past.
They all involve AVX code being executed on a process that doesn't support it.

I believe this is a native bug and has nothing to do with DR.
xref http://sourceware.org/bugzilla/show_bug.cgi?id=11214 http://lists.mandriva.com/bugs/2011-04/msg00814.php --------------------------------------
this was PR 538130: linux.threadexit test fails with LD_BIND_NOW unset

in the past I always had LD_BIND_NOW set for easier
debugging (gdb has issues w/ lazy binding)

when I unset it, the 64-bit linux.threadexit test fails:

/work/dr/build_suite/build_debug-internal-64/bin/runstats -s 60 -silent -env LD_LIBRARY_PATH /work/dr/build_suite/install/lib64/debug:/work/dr/build_suite/install/lib32/debug:/usr/local/lib -env LD_PRELOAD libdrpreload.so\ libdynamorio.so -env DYNAMORIO_OPTIONS -stderr_mask\ 0xf\ -dumpcore_mask\ 0\ \ -code_api\ /work/dr/build_suite/build_debug-internal-64/suite/tests/bin/linux.threadexit
=>

in the logs:

interp: start_pc = 0x0000003566a14950
0x0000003566a14950 83 3d f9 a5 20 00 00 cmp 0x0000003566c1ef50 $0x00000000
0x0000003566a14957 78 51 js $0x0000003566a149aa
end_pc = 0x0000003566a14959

exit_branch_type=0x0 bb->exit_target=0x0000003566a14959
exit_branch_type=0x11 target=0x0000003566a149aa l->flags=0x8011
exit_branch_type=0x0 target=0x0000003566a14959 l->flags=0x9001
final exit sharing prev exit's stub @ 0x000000004194dedd
Fragment 199, tag 0x0000003566a14950, flags 0x9000630, shared, size 46:

Entry into F199(0x0000003566a14950).0x000000004191f024 (shared)

Exit from F199(0x0000003566a14950).0x000000004191f04d (shared)
(target 0x0000003566a14959 not in cache)

dispatch: target = 0x0000003566a14959

interp: start_pc = 0x0000003566a14959
SYSLOG_WARNING: Invalid opcode encountered
Error decoding 0x0000003566a14959 == 0x64 0xc5 0xfd
decode: invalid instr at 0x0000003566a14959
0x0000003566a14959 64 c5 fd...??
0x0000003566a1495c 6f outs %ds:(%rsi) %dx %rsi -> %rsi
0x0000003566a1495d 04 25 add $0x25 %al -> %al
0x0000003566a1495f 80 00 00 add $0x00 (%rax) -> (%rax)
0x0000003566a14962 00 64 c5 fd add %ah 0xfffffffd(%rbp,%rax,8) -> 0xfffffffd(%rbp,%rax,8)


xref issue #396 , though this seems unrelated:

in linux.threadexit2 64-bit test, non-deterministically, I see SIGILL on a fedora12 box:

master_signal_handler: sig=4, retaddr=0x000000007119b799
record_pending_signal(4) from cache pc 0x000000007362f434

(gdb) x/10i 0x0000003633014959
0x3633014959 <_dl_x86_64_restore_sse+9>: vmovdqa %fs:0x80,%ymm0
0x3633014963 <_dl_x86_64_restore_sse+19>: vmovdqa %fs:0xa0,%ymm1
0x363301496d <_dl_x86_64_restore_sse+29>: vmovdqa %fs:0xc0,%ymm2
0x3633014977 <_dl_x86_64_restore_sse+39>: vmovdqa %fs:0xe0,%ymm3
0x3633014981 <_dl_x86_64_restore_sse+49>: vmovdqa %fs:0x100,%ymm4
0x363301498b <_dl_x86_64_restore_sse+59>: vmovdqa %fs:0x120,%ymm5
0x3633014995 <_dl_x86_64_restore_sse+69>: vmovdqa %fs:0x140,%ymm6
0x363301499f <_dl_x86_64_restore_sse+79>: vmovdqa %fs:0x160,%ymm7
0x36330149a9 <_dl_x86_64_restore_sse+89>: retq
(gdb) x/10i 0x000000007362f434
0x7362f434: vmovdqa %fs:0x80,%ymm0
0x7362f43e: vmovdqa %fs:0xa0,%ymm1
0x7362f448: vmovdqa %fs:0xc0,%ymm2
0x7362f452: vmovdqa %fs:0x5eb02e0,%ymm3
0x7362f45c: jmpq 0x7365dd28
0x7362f461: jmpq 0x7365dd28
(gdb) x/10bx 0x363301496d
0x363301496d <_dl_x86_64_restore_sse+29>: 0x64 0xc5 0xfd 0x6f 0x14 0x25 0xc0 0x00
0x3633014975 <_dl_x86_64_restore_sse+37>: 0x00 0x00
(gdb) x/10bx 0x3633014977
0x3633014977 <_dl_x86_64_restore_sse+39>: 0x64 0xc5 0xfd 0x6f 0x1c 0x25 0xe0 0x00
0x363301497f <_dl_x86_64_restore_sse+47>: 0x00 0x00
(gdb) x/10bx 0x7362f452
0x7362f452: 0x64 0xc5 0xfd 0x6f 0x1c 0x25 0xe0 0x02
0x7362f45a: 0xeb 0x05


now I'm seeing AVX instructions on x64 Lucid, causing failures in the thread tests:

bin64/drrun -debug suite/tests/bin/linux.thread

Illegal instruction

interp: start_pc = 0x00007fbb9f8c0999
check_thread_vm_area: pc = 0x00007fbb9f8c0999
SYSLOG_WARNING: Invalid opcode encountered
Error decoding 0x00007fbb9f8c0999 == 0x64 0xc5 0xfd
decode: invalid instr at 0x00007fbb9f8c0999
0x00007fbb9f8c0999 64 c5 fd...??
0x00007fbb9f8c099c 6f outs %ds:(%rsi) %dx %rsi -> %rsi
0x00007fbb9f8c099d 04 25 add $0x25 %al -> %al
wrote all 6 flags now!
0x00007fbb9f8c099f 80 00 00 add $0x00 (%rax) -> (%rax)
0x00007fbb9f8c09a2 00 64 c5 fd add %ah 0xfffffffd(%rbp,%rax,8) -> 0xfffffffd(%rbp,%rax,8)

module ld-linux-x86-64.so.2 [0x00007fbb9f8ab000,0x00007fbb9facd000] added

(gdb) x/4i 0x00007fbb9f8c0999 - 0x00007fbb9f8ab000 + 0x00007f30fccde000
0x7f30fccf3999 <_dl_x86_64_restore_sse+9>: vmovdqa %fs:0x80,%ymm0
0x7f30fccf39a3 <_dl_x86_64_restore_sse+19>: vmovdqa %fs:0xa0,%ymm1
0x7f30fccf39ad <_dl_x86_64_restore_sse+29>: vmovdqa %fs:0xc0,%ymm2
(gdb) x/8bx 0x00007fbb9f8c0999 - 0x00007fbb9f8ab000 + 0x00007f30fccde000
0x7f30fccf3999 <_dl_x86_64_restore_sse+9>: 0x64 0xc5 0xfd 0x6f 0x04 0x25 0x80 0x00

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=500

@derekbruening
Copy link
Contributor Author

From [email protected] on August 22, 2011 18:27:53

This affects my desktop Linux box at home, so I'd like to at least find a work around for this. I'll try to describe the underlying race here.

Inside dl_fixup, there are these RTLD*_FOREIGN_CALL macros. The idea is that if they ever need to make a "foreign call" (one that requires saving SSE regs), they save SSE regs in TLS and set a flag so that RTLD_FINALIZE_FOREIGN_CALL will restore them on the way back out. The problem is that using clone with CLONE_VM without setting up a fresh TLS space makes the flag thread-shared. You get the following sequence of events:

  • CLONE_VM, two threads using same TLS
  • simultaneously enter _dl_fixup
  • A enters RTLD_ENABLE_FOREIGN_CALL
  • A loads rtld_must_xmm_save 0
  • A stores rtld_must_xmm_save 1
  • B enters RTLD_ENABLE_FOREIGN_CALL
  • B loads rtld_must_xmm_save 1
  • B stores rtld_must_xmm_save 1
  • A enters RTLD_FINALIZE_FOREIGN_CALL
  • B loads rtld_must_xmm_save 1, does no restore
  • A stores rtld_must_xmm_save 0
  • B enters RTLD_FINALIZE_FOREIGN_CALL
  • B loads rtld_must_xmm_save 0, attempts restore
  • B loads L(have_avx) 0, since uninitialized
  • B js falls through to ymm instrs, because have_avx is non-negative

The L(have_avx) global is supposed to be tri-valued, 1 if AVX is enabled, 0 if uninitialized, and -1 if AVX is disabled. The problem is that with the race condition, the first call of the pair does not happen, have_avx is zero, and the branch goes the wrong way into the AVX instructions.

Perhaps the solution is to find all of our usage of the raw clone syscall and pass along TLS space and set the CLONE_SETTLS flag. Is this reasonable? The manpage claims it has been available since Linux 2.5.32, which is pretty old.

@derekbruening
Copy link
Contributor Author

From [email protected] on August 22, 2011 18:34:43

For tests that are just trying to create threads w/o using pthreads and were just written so long ago that all this support wasn't there yet (hmmm now I feel old), that's reasonable. But, some of the tests are deliberately testing weird clone parameters: though maybe none of those are hitting this race. Can you enumerate the tests you've seen this on?

@derekbruening
Copy link
Contributor Author

From [email protected] on August 22, 2011 18:41:34

The following tests fail form me:
10 - code_api|linux.clone (Failed)
42 - code_api|linux.thread (Failed)
43 - code_api|linux.threadexit (Failed)
44 - code_api|linux.threadexit2 (Failed)

I only set off to debug this because much more of the test suite was failing for me, but now it's just these 4 that fail. On the other hand, it is a race condition, so it could be affecting other tests and we just don't know it.

I'm looking into fixing linux.clone now.

Status: Started
Owner: [email protected]

@derekbruening
Copy link
Contributor Author

From [email protected] on August 23, 2011 13:37:25

The underlying problem is that if we do the clone syscall ourself without going through nptl, lots of state in glibc becomes inconsistent.

My first attempt at this was to take one page from the far end of the stack and use that for TLS. This broke down because core/linux/loader.c assumes that the TLS pointer passed to clone is nptl's TCB, which the private loader then copies into its own TLS area so it can call functions in libc.

I "fixed" that problem by copying the parent thread's TLS space into the child's TLS space, and adjusting the child's %fs:0 slot to point to the child's TCB. That made linux.clone pass under DR 90% of the time. The failures are likely caused by races on inadvertently shared structures hanging off of the TLS areas.

I don't think we can avoid the use of libc in the application to solve this problem because DynamoRIO uses libc at arbitrary points anyway. Before copying the parent's TLS data into the child's TLS area, I was crashing inside our_vsscanf, which uses libc vsscanf.

It might be possible to make this test work reliably by finding a way to call into the internal nptl TLS preparation functions from the application before calling clone.

Alternatively, we could use pthreads. As far as thread.c is concerned, we would either use pthreads or delete it since we already have a pthreads test. If we really want to test clone with CLONE_VM without CLONE_THREAD, we could create a test using pthreads that modifies __clone to mask out CLONE_THREAD from flags.

I'm going to unassign since I'm not sure how to proceed.

Owner: ---

@derekbruening
Copy link
Contributor Author

From [email protected] on October 07, 2011 11:24:05

I am running into this issue on my 64 bit Ubuntu box. Just managed to compile dynamorio and even a hello world program fails with following error,

bin64/drrun -debug suite/tests/bin/linux.thread

I dont illegal instruction issue though. Any inputs ?

@derekbruening
Copy link
Contributor Author

From [email protected] on October 07, 2011 11:29:16

This should be a native bug that only occurs if you use clone the way we do in our test suite. If you use regular pthreads, you should never see this problem. I'm working on cleaning up our test suite at the moment so flaky tests like this one aren't run in the default set, and eventually I plan to fix it.

We're probably just going to rewrite that test to use pthreads. It was written when thread support on Linux was extremely different from platform to platform and wasn't really reliable. We'd still like to keep coverage of this particular style of clone call in suite/tests/linux/clone.c, though, and I need to figure out how to make that test pass reliably.

@derekbruening
Copy link
Contributor Author

From [email protected] on October 08, 2011 02:26:43

I think I was not very clear in my earlier comment.

I see the issue even when I run an unithreaded hello world program, I commented here instead of opening a new bug since the failures here looked identical superficially.

Should I file a new bug ?

@derekbruening
Copy link
Contributor Author

From [email protected] on October 08, 2011 08:06:29

Those messages are unrelated syslog warnings that I always see. They indicate that DR failed to allocate memory in the space that it requested, so there might be problems. It isn't generally an issue, though.

Does the application continue as expected after that? You can silence them with -stderr_mask 0 I believe.

@derekbruening
Copy link
Contributor Author

From [email protected] on October 08, 2011 10:25:58

Here is the sample run, and the application does not run, this seems to be happening before main() gets called.

$ ./bin64/drrun -debug -64 -logdir /tmp/d -client ./api/samples/bin/libbbsize.so 0x1 "" ./hello.exe
<Starting application perf.exe (11960)>
<Application perf.exe (11960). Internal Error Internal DynamoRIO Error: /opt/SVM/dynamorio-read-only/core/linux/os.c:5576 ok
(Error occurred @0 frags)
version 3.0.1028, custom build
-client_lib '/opt/SVM/dynamorio-read-only/api/samples/bin/libbbsize.so;0x1;' -code_api -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -no_native_exec -no_indcall2direct
0x00007fff38880310 0x00000000710ff180
0x00007fff38880470 0x000000007128479a
0x00007fff388804a0 0x00000000711aa3d9
0x00007fff388804d0 0x000000007118877f
0x00007fff38880500 0x000000007118b8b1
0x00007fff38880540 0x00000000712a133b
0x00007fff38880570 0x00000000711de4b8
0x00007fff388805a0 0x00000000711de83a
0x00007fff388805e0 0x00000000711dd7fd
0x00007fff388806d0 0x00000000710819f5>

@derekbruening
Copy link
Contributor Author

From [email protected] on October 12, 2011 09:01:45

split comment 9 as issue #579 as it is unrelated to this issue

@derekbruening
Copy link
Contributor Author

From [email protected] on October 24, 2011 12:08:34

My proposal for actually fixing this issue is instead of setting up TLS, to completely avoid using libc in the child thread. I've implemented this for linux.clone, and it works perfectly. Does it seem reasonable to fix the rest of the Linux threading tests to follow suit? I'm putting the non-libc syscall wrappers in tools.c to share them between tests.

@derekbruening
Copy link
Contributor Author

From [email protected] on October 24, 2011 12:08:55

Owner: [email protected]

@derekbruening
Copy link
Contributor Author

From [email protected] on October 25, 2011 12:30:07

This issue was closed by revision r1094 .

Status: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant