Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HANG delivering signal that interrupted DR at "safe spot" #2805

Closed
derekbruening opened this issue Jan 17, 2018 · 2 comments
Closed

HANG delivering signal that interrupted DR at "safe spot" #2805

derekbruening opened this issue Jan 17, 2018 · 2 comments

Comments

@derekbruening
Copy link
Contributor

As part of 35a6c54 (#2660) for #2659 I added immediate delivery of a signal at a safe spot, in particular THREAD_SYNCH_NO_LOCKS.

However, this includes signals interrupting DR code. A signal can interrupt a safe spot at its tail end during its call to something like dr_mark_safe_to_suspend() which then calls set_synch_state() which grabs the tsd->synch_lock spinlock before setting the tsd->synch_perm back to an unsafe value. At that interruption point, it is unsafe to deliver a signal, because translate_sigcontext() calls translate_mcontext() which tries to grab tsd->synch_lock and then hangs.

On the one hand, it seems unsafe to deliver a signal when DR was interrupted, period.
OTOH, this THREAD_SYNCH_NO_LOCKS is supposed to mean that no locks are held. Is this synch_lock problem limited to a self-interruption? No, it seems like a SIGUSR2 could interrupt another thread and "suspend" it at a similar point where it holds synch_lock, and the synch_with_all_threads code would then hang when it tried to acquire synch_lock.

So the ideal solution is to eliminate this lock being separate from clearing synch_perm, and we can then keep this signal delivery?

@derekbruening
Copy link
Contributor Author

My proposal is to fix the missing synchronization in thread_synch_state_no_xfer() and thread_synch_check_state(). It's the latter that is used to check whether immediate signal
delivery is ok. By having both of those use a trylock and fail if they can't get it (just like
waiting_at_safe_spot() does today), this should be solved.

    /* We have a wart in the settings here (i#2805): a caller can set
     * THREAD_SYNCH_NO_LOCKS, yet here we're acquiring locks.  In fact if this thread
     * is suspended in between the lock and the unset of synch_perm from
     * THREAD_SYNCH_NO_LOCKS back to THREAD_SYNCH_NONE, it can cause problems.  We
     * have everyone who might query in such a state use a trylock and assume
     * synch_perm is THREAD_SYNCH_NONE if the lock cannot be acquired.
     */

We need to ensure that another thread doing a synch will work here when
suspending this thread while it claims THREAD_SYNCH_NO_LOCKS yet it holds
synch_lock: synch_with_thread() calls os_thread_suspend(),
thread_get_mcontext() (just copies), and then at_safe_spot().
at_safe_spot() calls waiting_at_safe_spot() which uses a trylock, so that
part will fail. It will then call recreate_app_state(), which b/c the pc
is not in fcache or gencode, will fail. So it looks ok.

@derekbruening
Copy link
Contributor Author

The problem with the trylock solution is that synch_lock has a high rank and we hit rank order violations when checking whether we can deliver a signal that interrupts DR while DR holds some other lock:

 <rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after change_linking_lock(recursive)@/home/bruening/dr/git/src/core/link.c:108 in tid:d24a>

<rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after HTLOCK_RANK(readwrite)@/home/bruening/dr/git/src/core/hashtablex.h:556 in tid:d2ae>

<rank order violation synch_lock(mutex)@/home/bruening/dr/git/src/core/synch.c:149 acquired after memory_info_buf_lock(mutex)@/home/bruening/dr/git/src/core/unix/memquery_linux.c:71 in tid:d344>

Possible solutions:

***** TODO split up synch_lock? => doesn't seem possible

Uses:

  • read/write synch_perm
  • check_wait_at_safe_spot() holds it across exiting and re-entering DR.
    this is the key blocker => can't split the lock
  • write pending_synch_count
  • read/write set_{,m}context

***** TODO deliver signal only if pc is in client lib, not DR?

What if client is using dr_file_write()?

***** TODO try-lock-no-rank-check?

***** TODO check perm first w/ double-checked locking?

If perm is > THREAD_SYNCH_NO_LOCKS, then other locks shouldn't be held,
only synch_lock.

Remember to put in memory barriers for double-checked locking.

derekbruening added a commit that referenced this issue Jan 17, 2018
Adds missing synchronization on reads of tsd->synch_perm in
thread_synch_state_no_xfer() and thread_synch_check_state(), using a
trylock to avoid hanging on a suspended thread that holds tsd->synch_lock.
On trylock failure, both routines fail, which then prevents unsafe signal
delivery.  thread_synch_check_state() further uses double-checked locking
to avoid acquiring the lock when interrupting arbitrary DR locks.

Adds a test case to client.timer which reproduces the original issue nearly
every time for me, and works 1000x in a row with the fix.

Fixes #2805
derekbruening added a commit that referenced this issue Jan 18, 2018
Fixes a Mac build issue in the new client.timer test code.

Issue: #2805
derekbruening added a commit that referenced this issue Jan 18, 2018
Fixes a Mac build issue in the new client.timer test code.

Issue: #2805
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant