Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APP CRASH post-detach after re-attach #2175

Closed
derekbruening opened this issue Feb 14, 2017 · 2 comments
Closed

APP CRASH post-detach after re-attach #2175

derekbruening opened this issue Feb 14, 2017 · 2 comments

Comments

@derekbruening
Copy link
Contributor

My #2157 re-attach tree tries to add a re-attach loop to the burst_threads
test, but it sometimes fails for 32-bit:

Output:
----------------------------------------------------------
Running |/work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads|
CMake Error at /work/dr/git/src/suite/tests/runmulti.cmake:98 (message):
  *** cmd failed (1): pre-DR init
  pre-DR start
  pre-DR detach
  pre-DR start
  pre-DR detach
  <ERROR: master_signal_handler with no siginfo (i#26?): tid=12885, sig=11>
  ***
Call Stack (most recent call first):
  /work/dr/git/src/suite/tests/runmulti.cmake:107 (process_cmdline)
<end of output>
Test time =   3.39 sec

I can repro this manually every couple of attempts:

> clients/bin32/tool.drcacheoff.burst_threads
pre-DR init
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<unable to determine lib path for cross-arch execve>
<Initial options = -client_lib ';;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Detaching from process, entering final cleanup>
<Starting application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<Initial options = -client_lib ';;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K;;;-offline -max_trace_size 256K' -msgbox_mask 12 -pause_via_loop -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
pre-DR start
<Attached to 9 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (21653)>
<ERROR: master_signal_handler with no siginfo (i#26?): tid=21659, sig=11>
<Detaching from process, entering final cleanup>

There are only 2 threads (there's no synch stopping the other app threads
from exiting while I attach the debugger, so it's expected):

(gdb) thread apply all bt

Thread 2 (Thread 0xf4b56b40 (LWP 21659)):
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
#1  0xf4b56b40 in ?? ()
#2  0x0812dd7b in notify (priority=SYSLOG_ERROR, internal=true, synch=true, substitution_num=0, prefix=0x8370d2d "SYSLOG_ERROR", 
    fmt=0x8370ce8 "ERROR: master_signal_handler with no siginfo (i#26?): tid=%d, sig=%d") at /work/dr/git/src/core/utils.c:1983
#3  0x082b53ed in master_signal_handler_C (xsp=0xf4b55d3c "\320\vv\367\v") at /work/dr/git/src/core/unix/signal.c:4431
#4  0x08283c67 in master_signal_handler () at /work/dr/git/src/core/arch/x86/x86.asm:1451
#5  0xf4b55d3c in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 0xf735d700 (LWP 21653)):
#0  0xf7760be9 in __kernel_vsyscall ()
#1  0xf771e20d in pthread_join () from /lib/libpthread.so.0
#2  0x08057061 in main ()

(gdb) p /x *sc
$1 = {
  gs = 0x63, 
  __gsh = 0x0, 
  fs = 0x73, 
  __fsh = 0x0, 
  es = 0x2b, 
  __esh = 0x0, 
  ds = 0x2b, 
  __dsh = 0x0, 
  edi = 0xf4b56b40, 
  esi = 0x0, 
  ebp = 0x84370e1b, 
  esp = 0xf4b56238, 
  ebx = 0xf7731000, 
  edx = 0x0, 
  ecx = 0xf3325930, 
  eax = 0x0, 
  trapno = 0xe, 
  err = 0x14, 
  eip = 0x3d3424d0, 
  cs = 0x23, 
  __csh = 0x0, 
  eflags = 0x10203, 
  esp_at_signal = 0xf4b56238, 
  ss = 0x2b, 
  __ssh = 0x0, 
  fpstate = 0xf4b55e50, 
  oldmask = 0x0, 
  cr2 = 0x3d3424d0
}
(gdb) x/4i sc->eip
   0x3d3424d0:  Cannot access memory at address 0x3d3424d0
(gdb) p heapmgmt
No symbol "heapmgmt" in current context.
(gdb) p heapmgt
$2 = (heap_management_t *) 0x83ea700 <temp_heapmgt>

(gdb) dps sc->esp sc->esp+180
0xf4b56238  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5623c  0x3ff00000  No symbol matches (void *)$retaddr.
0xf4b56240  0x77feab6d  No symbol matches (void *)$retaddr.
0xf4b56244  0x39db8cf4  No symbol matches (void *)$retaddr.
0xf4b56248  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5624c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b56250  0x88000000  No symbol matches (void *)$retaddr.
0xf4b56254  0x3d3424d0  No symbol matches (void *)$retaddr.
0xf4b56258  0x00000000  No symbol matches (void *)$retaddr.
0xf4b5625c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b56260  0x02000000  No symbol matches (void *)$retaddr.
0xf4b56264  0x41a00000  No symbol matches (void *)$retaddr.
0xf4b56268  0xf7597000  _GLOBAL_OFFSET_TABLE_ in section .got.plt of /lib/libm.so.6
0xf4b5626c  0xf755675c  sin + 6156 in section .text of /lib/libm.so.6

(gdb) x/4i 0xf755675c-5
   0xf7556757 <sin+6151>:       call   0xf757b930 <__dubsin>
   0xf755675c <sin+6156>:       add    $0x20,%esp
   0xf755675f <sin+6159>:       fldl   0x10(%esp)

So the app thread is in do_some_work() calling sin() presumably in the code
cache.

Yet here's another one -- it's at a native PC but eax seems to be messed up:

(gdb) up 3
#3  0x082b5501 in master_signal_handler_C (xsp=0xf732cd3c "\320+s\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/4i sc->eip
   0xf754dc3c <__dubsin+780>:   fldl   -0xe5e0(%eax)
   0xf754dc42 <__dubsin+786>:   fldl   -0x13ab8(%eax)
   0xf754dc48 <__dubsin+792>:   fmul   %st(1),%st
   0xf754dc4a <__dubsin+794>:   fld    %st(1)
(gdb) x/4wx sc->eax
0x2000c006:     Cannot access memory at address 0x2000c006
(gdb) x/4wx sc->eax-0xe5e0
0x1fffda26:     Cannot access memory at address 0x1fffda26
(gdb) p /x *sc
$2 = {
  edi = 0xf732db40, 
  esi = 0xf732d2d0, 
  ebp = 0xf732d338, 
  esp = 0xf732d208, 
  ebx = 0xf732d2d0, 
  edx = 0x0, 
  ecx = 0xe31de8e8, 
  eax = 0x2000c006, 
  eip = 0xf754dc3c, 
}

translating thread 29600 to 0xf754dc08
thread 29600 will resume as native
thread 29600 detaching to PC 0xf754dc08
<...>
waiting for thread 29601
<ERROR: master_signal_handler with no siginfo (i#26?): tid=29600, sig=11>

c3c is not far from c08: one cbr in between, reasonable that it got that
far and then crashed.
All floats: no eax ref in between.

Another one has a bad ebp:

(gdb) x/1i sc->eip
   0x8056e3e <do_some_work(int)+51>:    fstpl  -0x10(%ebp)
(gdb) x/4wx sc->ebp-0x10
0xf75f2a60 <s7>:        0x5816c78d      0xbf2a019f      0xda488820      0xbc21ac06

f73c9000-f758f000 r-xp 00000000 09:00 1355064                            /usr/lib/libc-2.21.so

OK I got the bad PC again with more info now:

translating thread 31961 to 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208
thread 31961 will resume as native
thread 31961 detaching to PC 0xf7542993 eax=0xf755e000 ecx=0xf330d270 ebx=0xf4b1d2d0 edx=0x00000000 edi=0xf4b1db40 esi=0xf4b1d2d0 ebp=0xf4b1d240 esp=0xf4b1d208

<ERROR: master_signal_handler with no siginfo (i#26?): tid=31961, sig=11>

(gdb) x/2i sc->eip
   0xb9c97f20:  Cannot access memory at address 0xb9c97f20
(gdb) x/10i 0xf7542993
   0xf7542993 <__dubsin+99>:    fmul   %st(2),%st
   0xf7542995 <__dubsin+101>:   fxch   %st(2)
   0xf7542997 <__dubsin+103>:   fmul   %st(1),%st
(gdb) p /x *sc
  edi = 0xf4b1db40, 
  esi = 0x0, 
  ebp = 0x60004b1d, 
  esp = 0xf4b1d248, 
  ebx = 0xf76f8000, 
  edx = 0x0, 
  ecx = 0xf330d270, 
  eax = 0x0, 
  eip = 0xb9c97f20, 
  eflags = 0x10242, 
(gdb) dps 0xf4b1d208 0xf4b1d248+16
0xf4b1d208  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d20c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d210  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d214  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d218  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d21c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d220  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d224  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d228  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d22c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d230  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d234  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d238  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d23c  0x00000000  No symbol matches (void *)$retaddr.
0xf4b1d240  0x46505845  No symbol matches (void *)$retaddr.
0xf4b1d244  0xb9c97f20  No symbol matches (void *)$retaddr.
0xf4b1d248  0x00000000  No symbol matches (void *)$retaddr.

So it did a ret to get to the bad PC.
Except the stack frame is bigger than 0x3c! It's 0x64, as shown below. ?!?!

(gdb) disas 0xf7542993
Dump of assembler code for function __dubsin:
   0xf7542930 <+0>:     push   %ebx
   0xf7542931 <+1>:     call   0xf7517162 <__x86.get_pc_thunk.ax>
   0xf7542936 <+6>:     add    $0x1b6ca,%eax
   0xf754293b <+11>:    sub    $0x60,%esp
   <...>
   0xf7542fd6 <+1702>:  add    $0x60,%esp
   0xf7542fd9 <+1705>:  pop    %ebx
   0xf7542fda <+1706>:  ret    

Running in DR -loglevel 3 in a single thread: the next OP_ret after
entering __subsin is that 0x60-frame ret.

Another one, this one with an even smaller frame shift:

translating thread 2376 to 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
thread 2376 will resume as native
thread 2376 detaching to PC 0xf7512a4e eax=0xf752e000 ecx=0xf32db220 ebx=0xf72f22d0 edx=0x00000000 edi=0xf72f2b40 esi=0xf72f22d0 ebp=0xf72f2218 esp=0xf72f2208
<ERROR: master_signal_handler with no siginfo (i#26?): tid=2376, sig=11>
(gdb) x/2i sc->eip
   0x36d18f7a:  Cannot access memory at address 0x36d18f7a
(gdb) x/4i 0xf7512a4e
   0xf7512a4e <__dubsin+286>:   faddp  %st,%st(2)
   0xf7512a50 <__dubsin+288>:   fld    %st(0)
   0xf7512a52 <__dubsin+290>:   fadd   %st(2),%st
   0xf7512a54 <__dubsin+292>:   fsubr  %st,%st(1)
(gdb) p /x *sc
  edi = 0xf72f2b40, 
  esi = 0x0, 
  ebp = 0x2bd4f354, 
  esp = 0xf72f2220, 
  ebx = 0xf76c8000, 
  edx = 0x0, 
  ecx = 0xf32db220, 
  eax = 0x1, 
  eip = 0x36d18f7a, 
  eflags = 0x10202, 
(gdb) dps 0xf72f2208 0xf72f2220+8
0xf72f2208  0x40cfb0ee  No symbol matches (void *)$retaddr.
0xf72f220c  0x3d14f217  No symbol matches (void *)$retaddr.
0xf72f2210  0x7fffd024  No symbol matches (void *)$retaddr.
0xf72f2214  0x39a8d3eb  No symbol matches (void *)$retaddr.
0xf72f2218  0x2bd4f354  No symbol matches (void *)$retaddr.
0xf72f221c  0x36d18f7a  No symbol matches (void *)$retaddr.
0xf72f2220  0x00000000  No symbol matches (void *)$retaddr.

Look at ebp as well, for both: it's very different from esp, when __dubsin
does not write or even read it.

Here's another bad-ebp one, once again pointing into a library code
section, but this one has the bad ebp value in the xl8:

thread 4857 detaching to PC 0xf7556a6a eax=0xf7572000 ecx=0xf771fee0 ebx=0xf5b332d0 edx=0x00000000 edi=0xf5b33b40 esi=0xf5b332d0 ebp=0xf7563a70 esp=0xf5b33208

(gdb) x/2i sc->eip
   0x8056e3e <do_some_work(int)+51>:    fstpl  -0x10(%ebp)
   0x8056e41 <do_some_work(int)+54>:    addl   $0x1,-0x14(%ebp)
(gdb) p /x *sc
  edi = 0xf5b33b40, 
  esi = 0x0, 
  ebp = 0xf7563a70, 
  esp = 0xf5b33320, 
  ebx = 0xf770c000, 
  edx = 0x0, 
  ecx = 0xf7564180, 
  eax = 0x1, 
  eip = 0x8056e3e, 

Removing the re-attach loop: I cannot repro any crashes in 200+ runs.
With re-attach: I can get a crash just about every other time.
Yet the detach context, etc. should all be per-thread: there's no global
state here, right?

Is it corruption from some other thread?
Or is it some sigreturn thing?
1 thread instead of 8: 200 iters w/ no crashes.

Check out this one: I added an extra re-attach, and on the final detach, we
only have 3 other threads, and 2 of them both end up crashing, one with a
bad PC and one with a bad eax which it sets itself:

<Attached to 8 threads in application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
pre-DR detach
<Detaching from application /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads (6365)>
synched with 4 threads
translating thread 6365 to 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
thread 6365 will resume as native
thread 6365 detaching sc=0xe32feba0 to PC 0xf7787be9 eax=0xfffffffc ecx=0x00000000 ebx=0xf6b81ba8 edx=0x000018df edi=0xf6b81b40 esi=0x00000000 ebp=0x000000f0 esp=0xffed2380
translating thread 6367 to 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
thread 6367 will resume as native
thread 6367 detaching sc=0xe3396ba0 to PC 0xf75a2dff eax=0x2000c006 ecx=0xf335bad0 ebx=0xf6b812d0 edx=0x00000000 edi=0xf6b81b40 esi=0xf6b812d0 ebp=0xf6b81338 esp=0xf6b81208
translating thread 6373 to 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
thread 6373 will resume as native
waiting for thread 6365
waiting for thread 6367
thread 6373 detaching sc=0xe34c6ba0 to PC 0x082dc670 eax=0x00000000 ecx=0x00000000 ebx=0x083a43ec edx=0x00000049 edi=0xf3b7bb40 esi=0x00000000 ebp=0xf3b7a490 esp=0xf3b7a470
waiting for thread 6373
<ERROR: master_signal_handler with no siginfo (i#26?): tid=6367, sig=11>
cleaning up thread 6365
cleaning up thread 6367
cleaning up thread 6373
<Detaching from process, entering final cleanup>

(gdb) thread 2
[Switching to thread 2 (Thread 0xf3b7bb40 (LWP 6373))]
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176             pop      REG_XDI
(gdb) up 3
#3  0x082b559e in master_signal_handler_C (xsp=0xf3b7ad3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
   0x3d2466dc:  Cannot access memory at address 0x3d2466dc
(gdb) thread 3
[Switching to thread 3 (Thread 0xf6b81b40 (LWP 6367))]
#0  syscall_0args () at /work/dr/git/src/core/arch/x86/x86_shared.asm:176
176             pop      REG_XDI
(gdb) up 3
#3  0x082b559e in master_signal_handler_C (xsp=0xf6b80d3c "\320{x\367\v") at /work/dr/git/src/core/unix/signal.c:4431
4431                SYSLOG_INTERNAL_ERROR("ERROR: master_signal_handler with no siginfo "
(gdb) x/2i sc->eip
   0xf75a3024 <__dubsin+1780>:  faddl  -0xe5b8(%eax)
   0xf75a302a <__dubsin+1786>:  jmp    0xf75a2e23 <__dubsin+1267>
(gdb) p /x $eax
$1 = 0x0
(gdb) x/2i 0xf75a2dff
   0xf75a2dff <__dubsin+1231>:  fucomi %st(4),%st
   0xf75a2e01 <__dubsin+1233>:  fstp   %st(4)
(gdb) x/2i 0xf7787be9 
   0xf7787be9 <__kernel_vsyscall+9>:    pop    %ebp
   0xf7787bea <__kernel_vsyscall+10>:   pop    %edx
(gdb) x/2i 0x082dc670 
   0x82dc670 <syscall_0args+6>: pop    %edi
   0x82dc671 <syscall_0args+7>: pop    %esi

The bad PC one was sent to a DR address?!?

(gdb) x/10i 0x082dc670 -2
   0x82dc66e <syscall_0args+4>: int    $0x80
   0x82dc670 <syscall_0args+6>: pop    %edi
   0x82dc671 <syscall_0args+7>: pop    %esi
   0x82dc672 <syscall_0args+8>: pop    %ebp
   0x82dc673 <syscall_0args+9>: pop    %ebx
   0x82dc674 <syscall_0args+10>:        ret    
(gdb) dps 0xf3b7a470 0xf3b7a470+20
0xf3b7a470  0xf3b7bb40  No symbol matches (void *)$retaddr.
0xf3b7a474  0x00000000  No symbol matches (void *)$retaddr.
0xf3b7a478  0xf3b7a490  No symbol matches (void *)$retaddr.
0xf3b7a47c  0x083a43ec  _GLOBAL_OFFSET_TABLE_ in section .got.plt of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads
0xf3b7a480  0x0829b311  os_thread_yield + 28 in section .text of /work/dr/build_suite/build_debug-internal-32/clients/bin32/tool.drcacheoff.burst_threads

Xref self-interp #2161. Doesn't that require an app timer signal, though?
Would SIGCHLD trigger it? Isn't SIGCHLD only going to the parent?

At this point I'm adding the re-attach best-effort code I have and
disabling this test, to make progress, and filing this as a bug to fix
separately.

@derekbruening
Copy link
Contributor Author

I can no longer reproduce this. @Carrotman42 put in some re-attach fixes over the past few months which perhaps fixed this: not going to take the time to do a binary search. Closing.

@derekbruening
Copy link
Contributor Author

To add more details: I ran burst_threads 500x on 2 different machines including the one where before it reproduced every couple of runs.

derekbruening added a commit that referenced this issue Feb 21, 2018
Instate 4 re-attach iters in the burst_threads and burst_thread*filter
tests.
Add "firstglob" support to runmulti to post-process only one of the 4
re-attach traces to keep the tests simpler and faster.
Add testing of the thread filter callback being reset across re-attach by
passing in a specific user param.

Issue: #2175, #2820
derekbruening added a commit that referenced this issue Feb 22, 2018
Resets callback functions at exit to better support re-attach.

Adds re-attach testing to the burst_threads and burst_thread*filter tests.

Adds "firstglob" support to runmulti to post-process only one of the 4
re-attach traces to keep the tests simpler and faster.

Adds testing of the thread filter callback being reset across re-attach by
passing in a specific user param.

Issue: #2175, #2820
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants