Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test pthreads_exit on ARM #2075

Closed
zhaoqin opened this issue Nov 25, 2016 · 10 comments
Closed

Flaky test pthreads_exit on ARM #2075

zhaoqin opened this issue Nov 25, 2016 · 10 comments

Comments

@zhaoqin
Copy link
Contributor

zhaoqin commented Nov 25, 2016

<Starting application /home/zhaoqin/Workspace/DynamoRIO/nightly/run/build_debug-internal-32/suite/tests/bin/pthreads.pthreads_exit (17431)>
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/home/zhaoqin/Workspace/DynamoRIO/nightly/run/build_debug-internal-32/lib32/debug/libdynamorio.so' 0x710228d8
>
<Initial options = -no_dynamic_options -code_api -msgbox_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<(1+x) Handling our fault in a TRY at 0x71215b3c>
<Stopping application /home/zhaoqin/Workspace/DynamoRIO/nightly/run/build_debug-internal-32/suite/tests/bin/pthreads.pthreads_exit (17431)>
<Application /home/zhaoqin/Workspace/DynamoRIO/nightly/run/build_debug-internal-32/suite/tests/bin/pthreads.pthreads_exit (17431).  Internal Error: DynamoRIO debug check failure: /home/zhaoqin/Workspace/DynamoRIO/nightly/src/core/heap.c:1410 IF_WINDOWS(doing_detach || ) heapmgt->vmheap.num_free_blocks == heapmgt->vmheap.num_blocks - unfreed_blocks || (ever_beyond_vmm && heapmgt->vmheap.num_free_blocks >= heapmgt->vmheap.num_blocks - unfreed_blocks)
(Error occurred @1796 frags)
version 6.2.17127, custom build
-no_dynamic_options -code_api -msgbox_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
0x710d8cc1 0x78af04b0>
@zhaoqin
Copy link
Contributor Author

zhaoqin commented Nov 25, 2016

(gdb) p heapmgt->vmheap.num_free_blocks
$1 = 8180
(gdb) p heapmgt->vmheap.num_blocks     
$2 = 8192
(gdb) p/x heapmgt->vmheap.start_addr
$6 = 0x4e97c000
(gdb) p/x heapmgt->vmheap.blocks    
$7 = {0xfc3fc3ff, 0xffffffff, 0xffffffff, 0x3fffffff, 0xfffffffc, 0xffffffff <repeats 251 times>, 0x0 <repeats 768 times>}
(gdb) p/x heapmgt->vmheap           
$5 = {start_addr = 0x4e97c000, end_addr = 0x5697c000, alloc_start = 0x0, alloc_size = 0x8000000, num_blocks = 0x2000, lock = {lock_requests = 0xffffffff, 
    contended_event = 0xffffffff, name = 0x7128ddd0, rank = 0x4a, owner = 0x0, owning_dcontext = 0x0, prev_owned_lock = 0x0, count_times_acquired = 0xbb, 
    count_times_contended = 0x0, count_times_spin_pause = 0x0, max_contended_requests = 0x0, count_times_spin_only = 0x0, prev_process_lock = 0x7132d654, 
    next_process_lock = 0x7132d17c, callstack = {0x0, 0x0, 0x0, 0x0}, app_lock = 0x0, deleted = 0x0}, num_free_blocks = 0x1ff4, blocks = {0xfc3fc3ff, 0xffffffff, 
    0xffffffff, 0x3fffffff, 0xfffffffc, 0xffffffff <repeats 251 times>, 0x0 <repeats 768 times>}}

It is clear that block starting @ 0x4eb74000 is not freed.

vmm_heap_reserve_blocks: size=65536 blocks=4 p=0x4eb74000
vmm_heap_reserve: size=65536 p=0x4eb74000
os_heap_commit: 57344 bytes @ 0x4eb75000
heap_mmap: 8192 bytes [/ 8192] @ 0x4eb51000

@zhaoqin
Copy link
Contributor Author

zhaoqin commented Nov 25, 2016

vmm_heap_reserve_blocks: size=65536 blocks=4 p=0x4e9a4000
vmm_heap_reserve: size=65536 p=0x4e9a4000
os_heap_commit: 57344 bytes @ 0x4e9a5000
...
vmm_heap_reserve_blocks: size=65536 blocks=4 p=0x4eb74000
vmm_heap_reserve: size=65536 p=0x4eb74000
os_heap_commit: 57344 bytes @ 0x4eb75000
...
new_thread_setup: thread 17799, dstack 0x4eb83000 clone record 0x4eb82bf0
(gdb) info threads
  Id   Target Id         Frame 
* 1    process 17788 "pthreads.pthrea" syscall_0_or_1arg () at /home/zhaoqin/Workspace/DynamoRIO/nightly/src/core/arch/arm/arm_shared.asm:87

So it is the memory for thread 17799's dstack is not freed.
Saw this line, which did not see in a passed run:

Thread 17799 reached initialization point while dynamo exiting, waiting for app to exit

So basically when a thread is newly created and in initialization phase, the app is exiting, the new thread simply waits without freeing its own dstack.
Xref code at: https://github.com/DynamoRIO/dynamorio/blob/master/core/dynamo.c#L2108
It seems that we can only relax the assertion check.

@derekbruening
Copy link
Contributor

So this is a regression? When the thread init code was first written, the dstack was not allocated yet at this exit-check point. We changed it to allocate the dstack at the parent clone point so we could store data on it.

@egrimley
Copy link
Contributor

egrimley commented Mar 1, 2017

I can get this test to fail on AArch64, too. With a debug build of 38920e6 on Fedora 24:

while bin64/drrun -code_api -- suite/tests/bin/pthreads.pthreads_exit ; do echo ; done

After a while:

... Internal Error: DynamoRIO debug check failure: .../dynamorio/core/utils.c:968 lock->lock_requests == LOCK_FREE_STATE
(Error occurred @1771 frags)

It's not the same error every time.

@egrimley
Copy link
Contributor

egrimley commented Mar 2, 2017

On some AArch64 systems, at least, all of these are flaky:

  Test  #50: code_api|linux.threadexit
  Test  #51: code_api|linux.threadexit2
  Test  #67: code_api|pthreads.pthreads
  Test  #68: code_api|pthreads.pthreads_exit

@egrimley egrimley changed the title Flaky test pthread_exit on ARM Flaky test pthreads_exit on ARM Mar 2, 2017
@egrimley
Copy link
Contributor

Here's the stack when the ASSERT in mutex_delete() fails:

#0  mutex_delete (lock=0x557a9e7260) at /.../dynamorio/core/utils.c:969
#1  0x000000557a63c8f8 in dynamo_shared_exit (toexit=0x0)
    at /.../dynamorio/core/dynamo.c:1086
#2  0x000000557a63cca8 in dynamo_process_exit_cleanup ()
    at /.../dynamorio/core/dynamo.c:1274
#3  0x000000557a63cea0 in dynamo_process_exit () at /.../dynamorio/core/dynamo.c:1330
#4  0x000000557a8dbc20 in cat_done_saving_dstack ()
    at /.../dynamorio/core/arch/aarch64/aarch64.asm:322

(gdb) p *lock
$8 = {lock_requests = 0, contended_event = -1, 
  name = 0x557a92beb8 "thread_initexit_lock(mutex)@/.../dynamorio/core/dynamo.c:260", 
  rank = 4, owner = 28134, owning_dcontext = 0x0, prev_owned_lock = 0x557a9e7818, 
  count_times_acquired = 166464950, count_times_contended = 55585188, 
  count_times_spin_pause = 0, max_contended_requests = 2, count_times_spin_only = 21, 
  prev_process_lock = 0x557a9e7890, next_process_lock = 0x557a9e7890, callstack = {0x0, 0x0, 
    0x0, 0x0}, app_lock = false, deleted = true}

@derekbruening
Copy link
Contributor

So thread 28134 still holds the thread_initexit_lock -- what is the callstack of that thread?

@egrimley
Copy link
Contributor

I don't seem to be able to reproduce this now. There's the same failure, but by the time I've managed to connect a debugger, owner is zero. However, I can still give you a backtrace for the other thread when that happens:

#0  dynamorio_syscall () at /.../dynamorio/core/arch/aarch64/aarch64_shared.asm:67
#1  0x0000005581c0de5c in get_sys_thread_id () at /.../dynamorio/core/unix/os.c:2605
#2  0x0000005581c0de9c in get_thread_id () at /.../dynamorio/core/unix/os.c:2620
#3  0x00000055819fe790 in deadlock_avoidance_lock (lock=0x5581d15260, acquired=true, ownable=true)
    at /.../dynamorio/core/utils.c:577
#4  0x00000055819ff3dc in mutex_lock (lock=0x5581d15260)
    at /.../dynamorio/core/utils.c:888
#5  0x000000558196c450 in dynamo_thread_init (dstack_in=0x51b03000 "", mc=0x51b025c0, 
    client_thread=false) at /.../dynamorio/core/dynamo.c:2161
#6  0x0000005581c09828 in new_thread_setup (mc=0x51b025c0)
    at /.../dynamorio/core/arch/x86_code.c:305
#7  0x0000007fa3bb3c3c in ?? ()
#8  0x0000007fa3d03c58 in ?? ()

@derekbruening
Copy link
Contributor

This should have been fixed by #2600's counter, at least when a synchall is done at exit, which includes debug build: thus this assert should never fire anymore, and we can remove the flag added in 45dd931.

@egrimley
Copy link
Contributor

egrimley commented Nov 2, 2017

Both on AArch64 and on x86_64, without DEBUG, pthreads_test sometimes runs for about a second, and sometimes for about a minute. The driver script reports the test as having passed, but is it really working as intended?

derekbruening added a commit that referenced this issue Nov 2, 2017
started_detach for the few checks that need pre-synch querying.

Reminder to change to #2601 and to add a xref to #2075: we've now removed
the dynamo_thread_init_during_process_exit flag that was added by 45dd931
as #2600's counter fixed that race.
derekbruening added a commit that referenced this issue Nov 6, 2017
To handle a thread exiting on attach, adds a timeout to wait_event() and
its corresponding implementations: os_wait_event() on Windows and
ksynch_wait() on UNIX.  Uses the timeout to check whether a thread exited,
and if so, to move on.

Augments the test from #2600 to test this race.

Abandon the api.detach_spawn test on Windows:i#2611 covers fixing the
tricky problems on Windows.  Leaves in place some fixes toward #2611:
+ Fixes a race where we put interception_code hooks in place before marking
  them +x
+ Increases MAX_THREADS_WAITING_FOR_DR_INIT

Fixes clang 32-bit missing __moddi3 by adding it to third_party/libgcc and linking
that into x86 and arm.

To enable adding race checks, moves doing_detach inside the synchall and adds
started_detach for the few checks that need pre-synch querying.

Removes the dynamo_thread_init_during_process_exit flag that was added in
45dd931 for #2075, as the UNIX uninit_thread_count solution from #2600 solves
that problem on its own.

Fixes #2601
fhahn pushed a commit that referenced this issue Dec 4, 2017
To handle a thread exiting on attach, adds a timeout to wait_event() and
its corresponding implementations: os_wait_event() on Windows and
ksynch_wait() on UNIX.  Uses the timeout to check whether a thread exited,
and if so, to move on.

Augments the test from #2600 to test this race.

Abandon the api.detach_spawn test on Windows:i#2611 covers fixing the
tricky problems on Windows.  Leaves in place some fixes toward #2611:
+ Fixes a race where we put interception_code hooks in place before marking
  them +x
+ Increases MAX_THREADS_WAITING_FOR_DR_INIT

Fixes clang 32-bit missing __moddi3 by adding it to third_party/libgcc and linking
that into x86 and arm.

To enable adding race checks, moves doing_detach inside the synchall and adds
started_detach for the few checks that need pre-synch querying.

Removes the dynamo_thread_init_during_process_exit flag that was added in
45dd931 for #2075, as the UNIX uninit_thread_count solution from #2600 solves
that problem on its own.

Fixes #2601
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants