HANG delivering signal that interrupted DR at "safe spot" #2805

derekbruening · 2018-01-17T04:10:28Z

As part of 35a6c54 (#2660) for #2659 I added immediate delivery of a signal at a safe spot, in particular THREAD_SYNCH_NO_LOCKS.

However, this includes signals interrupting DR code. A signal can interrupt a safe spot at its tail end during its call to something like dr_mark_safe_to_suspend() which then calls set_synch_state() which grabs the tsd->synch_lock spinlock before setting the tsd->synch_perm back to an unsafe value. At that interruption point, it is unsafe to deliver a signal, because translate_sigcontext() calls translate_mcontext() which tries to grab tsd->synch_lock and then hangs.

On the one hand, it seems unsafe to deliver a signal when DR was interrupted, period.
OTOH, this THREAD_SYNCH_NO_LOCKS is supposed to mean that no locks are held. Is this synch_lock problem limited to a self-interruption? No, it seems like a SIGUSR2 could interrupt another thread and "suspend" it at a similar point where it holds synch_lock, and the synch_with_all_threads code would then hang when it tried to acquire synch_lock.

So the ideal solution is to eliminate this lock being separate from clearing synch_perm, and we can then keep this signal delivery?

The text was updated successfully, but these errors were encountered:

derekbruening · 2018-01-17T15:44:40Z

My proposal is to fix the missing synchronization in thread_synch_state_no_xfer() and thread_synch_check_state(). It's the latter that is used to check whether immediate signal
delivery is ok. By having both of those use a trylock and fail if they can't get it (just like
waiting_at_safe_spot() does today), this should be solved.

    /* We have a wart in the settings here (i#2805): a caller can set
     * THREAD_SYNCH_NO_LOCKS, yet here we're acquiring locks.  In fact if this thread
     * is suspended in between the lock and the unset of synch_perm from
     * THREAD_SYNCH_NO_LOCKS back to THREAD_SYNCH_NONE, it can cause problems.  We
     * have everyone who might query in such a state use a trylock and assume
     * synch_perm is THREAD_SYNCH_NONE if the lock cannot be acquired.
     */

We need to ensure that another thread doing a synch will work here when
suspending this thread while it claims THREAD_SYNCH_NO_LOCKS yet it holds
synch_lock: synch_with_thread() calls os_thread_suspend(),
thread_get_mcontext() (just copies), and then at_safe_spot().
at_safe_spot() calls waiting_at_safe_spot() which uses a trylock, so that
part will fail. It will then call recreate_app_state(), which b/c the pc
is not in fcache or gencode, will fail. So it looks ok.

derekbruening · 2018-01-17T20:57:12Z

The problem with the trylock solution is that synch_lock has a high rank and we hit rank order violations when checking whether we can deliver a signal that interrupts DR while DR holds some other lock:

 <rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after change_linking_lock(recursive)@/home/bruening/dr/git/src/core/link.c:108 in tid:d24a>

<rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after HTLOCK_RANK(readwrite)@/home/bruening/dr/git/src/core/hashtablex.h:556 in tid:d2ae>

<rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after memory_info_buf_lock(mutex)@/home/bruening/dr/git/src/core/unix/memquery_linux.c:71 in tid:d344>

Possible solutions:

***** TODO split up synch_lock? => doesn't seem possible

Uses:

read/write synch_perm
check_wait_at_safe_spot() holds it across exiting and re-entering DR.
this is the key blocker => can't split the lock
write pending_synch_count
read/write set_{,m}context

***** TODO deliver signal only if pc is in client lib, not DR?

What if client is using dr_file_write()?

***** TODO try-lock-no-rank-check?

***** TODO check perm first w/ double-checked locking?

If perm is > THREAD_SYNCH_NO_LOCKS, then other locks shouldn't be held,
only synch_lock.

Remember to put in memory barriers for double-checked locking.

Adds missing synchronization on reads of tsd->synch_perm in thread_synch_state_no_xfer() and thread_synch_check_state(), using a trylock to avoid hanging on a suspended thread that holds tsd->synch_lock. On trylock failure, both routines fail, which then prevents unsafe signal delivery. thread_synch_check_state() further uses double-checked locking to avoid acquiring the lock when interrupting arbitrary DR locks. Adds a test case to client.timer which reproduces the original issue nearly every time for me, and works 1000x in a row with the fix. Fixes #2805

Fixes a Mac build issue in the new client.timer test code. Issue: #2805

derekbruening added OpSys-Linux Priority-High Bug-Hang labels Jan 17, 2018

derekbruening self-assigned this Jan 17, 2018

derekbruening mentioned this issue Jan 17, 2018

i#2805: use trylock to avoid hang #2807

Merged

derekbruening closed this as completed in #2807 Jan 17, 2018

derekbruening added a commit that referenced this issue Jan 18, 2018

i#2805 synch lock hang: fix Mac test build

cedb255

Fixes a Mac build issue in the new client.timer test code. Issue: #2805

derekbruening mentioned this issue Jan 18, 2018

i#2805 synch lock hang: fix Mac test build #2809

Merged

derekbruening added a commit that referenced this issue Jan 18, 2018

i#2805 synch lock hang: fix Mac test build (#2809)

5a64668

Fixes a Mac build issue in the new client.timer test code. Issue: #2805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HANG delivering signal that interrupted DR at "safe spot" #2805

HANG delivering signal that interrupted DR at "safe spot" #2805

derekbruening commented Jan 17, 2018

derekbruening commented Jan 17, 2018

derekbruening commented Jan 17, 2018

HANG delivering signal that interrupted DR at "safe spot" #2805

HANG delivering signal that interrupted DR at "safe spot" #2805

Comments

derekbruening commented Jan 17, 2018

derekbruening commented Jan 17, 2018

derekbruening commented Jan 17, 2018