-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HANG delivering signal that interrupted DR at "safe spot" #2805
Comments
My proposal is to fix the missing synchronization in thread_synch_state_no_xfer() and thread_synch_check_state(). It's the latter that is used to check whether immediate signal
We need to ensure that another thread doing a synch will work here when |
The problem with the trylock solution is that synch_lock has a high rank and we hit rank order violations when checking whether we can deliver a signal that interrupts DR while DR holds some other lock:
Possible solutions: ***** TODO split up synch_lock? => doesn't seem possible Uses:
***** TODO deliver signal only if pc is in client lib, not DR? What if client is using dr_file_write()? ***** TODO try-lock-no-rank-check? ***** TODO check perm first w/ double-checked locking? If perm is > THREAD_SYNCH_NO_LOCKS, then other locks shouldn't be held, Remember to put in memory barriers for double-checked locking. |
Adds missing synchronization on reads of tsd->synch_perm in thread_synch_state_no_xfer() and thread_synch_check_state(), using a trylock to avoid hanging on a suspended thread that holds tsd->synch_lock. On trylock failure, both routines fail, which then prevents unsafe signal delivery. thread_synch_check_state() further uses double-checked locking to avoid acquiring the lock when interrupting arbitrary DR locks. Adds a test case to client.timer which reproduces the original issue nearly every time for me, and works 1000x in a row with the fix. Fixes #2805
Fixes a Mac build issue in the new client.timer test code. Issue: #2805
Fixes a Mac build issue in the new client.timer test code. Issue: #2805
As part of 35a6c54 (#2660) for #2659 I added immediate delivery of a signal at a safe spot, in particular THREAD_SYNCH_NO_LOCKS.
However, this includes signals interrupting DR code. A signal can interrupt a safe spot at its tail end during its call to something like dr_mark_safe_to_suspend() which then calls set_synch_state() which grabs the tsd->synch_lock spinlock before setting the tsd->synch_perm back to an unsafe value. At that interruption point, it is unsafe to deliver a signal, because translate_sigcontext() calls translate_mcontext() which tries to grab tsd->synch_lock and then hangs.
On the one hand, it seems unsafe to deliver a signal when DR was interrupted, period.
OTOH, this THREAD_SYNCH_NO_LOCKS is supposed to mean that no locks are held. Is this synch_lock problem limited to a self-interruption? No, it seems like a SIGUSR2 could interrupt another thread and "suspend" it at a similar point where it holds synch_lock, and the synch_with_all_threads code would then hang when it tried to acquire synch_lock.
So the ideal solution is to eliminate this lock being separate from clearing synch_perm, and we can then keep this signal delivery?
The text was updated successfully, but these errors were encountered: