Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trace signal placement in rseq region is wrong due to abort handler redirect #4041

Open
derekbruening opened this issue Jan 22, 2020 · 5 comments · Fixed by #4047 or #5970
Open

trace signal placement in rseq region is wrong due to abort handler redirect #4041

derekbruening opened this issue Jan 22, 2020 · 5 comments · Fixed by #4047 or #5970

Comments

@derekbruening
Copy link
Contributor

Gathering an offline drcachesim trace on the rseq test, for #4019, shows a signal marker too late -- it should be right after the ud2a:

$ bin64/drrun -t drcachesim -offline -- suite/tests/bin/linux.rseq
$ bin64/drrun -t drcachesim -simulator_type view -indir drmemtrace.linux.rseq.123467.3754.dir/
=>
  0x00007f7a6c9e78f3  0f 0b                ud2a
  0x00007f7a6c9e78f5  83 05 4c 38 00 00 01 addl   $0x01, <rel> 0x00007f7a6c9eb148
<marker: kernel xfer to handler>
<marker: timestamp 13223619762485105>
<marker: tid 123467 on core 0>
  0x00007f7a6c9e785c  55                   push   %rbp
<...>
<marker: syscall xfer>
<marker: timestamp 13223619762492787>
<marker: tid 123467 on core 0>
  0x00007f7a6c9e7901  83 05 44 38 00 00 01 addl   $0x01, <rel> 0x00007f7a6c9eb14c
  0x00007f7a6c9e7908  c6 05 41 38 00 00 00 movb   $0x00, <rel> 0x00007f7a6c9eb150

This is caught by the invariants test, which I was hoping to add as a regression test for #4019:

$ bin64/drrun -t drcachesim -offline -- suite/tests/bin/linux.rseq
$ bin64/drrun -t drcachesim -test_mode -indir drmemtrace.linux.rseq.123467.3754.dir/
drcachesim: /home/bruening/dr/git/src/clients/drcachesim/tests/trace_invariants.cpp:143: virtual bool trace_invariants_t::process_memref(const memref_t&): Assertion `memref.instr.tid != pre_signal_instr.instr.tid || memref.instr.addr == pre_signal_instr.instr.addr || memref.instr.addr == pre_signal_instr.instr.addr + pre_signal_instr.instr.size || memref.instr.addr == app_handler_pc || type_is_instr_branch(pre_signal_instr.instr.type) || pre_signal_instr.instr.type == TRACE_TYPE_INSTR_SYSENTER' failed.
Aborted

Looks like the recorded offset is wrong:

[drmemtrace]: Appending 2 instrs in bb 0x7fcb228a48f3 in mod 5 +0x18f3 = /home/bruening/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.rseq
  0x00007fdb501da8f3 ud2a
[drmemtrace]: Checking whether reached signal/exception +0x1901 vs cur +0x18f3
  0x00007fdb501da8f5 add    $0x00000001 <rel> 0x00007fdb501de198[4byte] -> <rel> 0x00007fdb501de198[4byte]
[drmemtrace]: Checking whether reached signal/exception +0x1901 vs cur +0x18f5
[drmemtrace]: Filling in elided rip-rel addr with 0x7fdb501de198
[drmemtrace]: Appended memref type 0 size 4 to 0x7fdb501de198
[drmemtrace]: Checking whether reached signal/exception +0x1901 vs cur +0x18fc
[drmemtrace]: Filling in elided rip-rel addr with 0x7fdb501de198
[drmemtrace]: Appended memref type 1 size 4 to 0x7fdb501de198
[drmemtrace]: Checking whether reached signal/exception +0x1901 vs cur +0x18fc
[drmemtrace]: Signal/exception between bbs
[drmemtrace]: Appended marker type 0 value 0x1901

Ah here we go:

recreate_app : looking for 0x00007fdb103cf90b in frag @ 0x00007fdb103cf8d4 (tag 0x00007fdb501da8f3)
recreate_app -- found valid state pc 0x00007fdb501da8f3
        translation 0x00007fdb501da8f3 is post-walk 0x0000000000000000 so not fixing xsp
recreate_app: moving 0x00007fdb501da8f3 inside rseq region to handler 0x00007fdb501da901
recreate_app -- found ok pc 0x00007fdb501da901

Hmm, how can we solve that? The kernel does not record the actual signal
PC (unlike rseq v1), so DR does not either, and thus the tracer does not
know when the signal was triggered.

@derekbruening
Copy link
Contributor Author

Update: the invariant test is actually not complaining about the ud2a: it has no logic where it knows that instruction raises a signal. Rather, it is complaining about the signal handler returning to the abort handler instead of the interruption point:

::123467.123467::  @0x7f7a6f37fd87 instr x2
::123467:123467::  @0x7f7a6f37fd87 instr x2
[file_reader] Thread #0 timestamp 0x002efacfe696777b
[file_reader] Next thread in timestamp order is #0 @0x002efacfe696777b
::123467.123467:: marker type 2 value 13223619762485115
::123467.123467:: marker type 3 value 0
::123467.123467:: marker type 1 value 224649
::123467:123467:: marker type 1 value 224649
[file_reader] Thread #0 timestamp 0x002efacfe6969573
[file_reader] Next thread in timestamp order is #0 @0x002efacfe6969573
::123467.123467:: marker type 2 value 13223619762492787
::123467.123467:: marker type 3 value 0
::123467.123467::  @0x7f7a6c9e7901 instr x7
::123467:123467::  @0x7f7a6c9e7901 instr x7
drcachesim: /home/bruening/dr/git/src/clients/drcachesim/tests/trace_invariants.cpp:143: virtual bool trace_invariants_t::process_memref(const memref_t&): Assertion `memref.instr.tid != pre_signal_instr.instr.tid || memref.instr.addr == pre_signal_instr.instr.addr || memref.instr.addr == pre_signal_instr.instr.addr + pre_signal_instr.instr.size || memref.instr.addr == app_handler_pc || type_is_instr_branch(pre_signal_instr.instr.type) || pre_signal_instr.instr.type == TRACE_TYPE_INSTR_SYSENTER' failed.

That raises another question: should DR raise yet another kernel xfer event for that, since it is surprising? Though the signal handler could longjmp anywhere (this invariant test assumes an app that doesn't do that).

derekbruening added a commit that referenced this issue Jan 22, 2020
When a migration or context switch happens during rseq native
execution, we now raise a kernel xfer event.  The event is of a new
type DR_XFER_RSEQ_ABORT.

To implement this, the native abort handler cannot be linked and must
return to dispatch.  The special-exit-reason feature is used for this
purpose.

Adds a test.  To force a migration we use a system call, which we do
not normally allow inside an rseq region.  I added a debug-build
exception for this particular test by executable name, along with a
syscall discovery workaround for the attach test.

Adds a client via static DR to api.rseq to test that the event is
raised.

Adds handling to drmemtrace in the tracer and raw2trace.  For
raw2trace we walk backward to undo the committing store that was
recorded, since a real rseq abort would happen before the final store.
I would like to add on offline trace rseq regression test, but it hits

Issue; #2350, #4019, #4041
Fixes #4019
@derekbruening
Copy link
Contributor Author

My proposal is to keep the 1st translation around somewhere and pass it to the kernel xfer event, but not the signal event (since that should match the kernel). Then memtrace can use the kernel xfer event source to place the signal marker.

I would also add logic to the trace_invariants that ud2a should be immediately followed by a signal marker.

This issue covers including a regression test for #4019, since it's blocking doing so and they both involve an rseq memtrace. I don't think that will require any extra work: once an rseq abort is present, the existing marker checks should cover it.

derekbruening added a commit that referenced this issue Jan 22, 2020
When a migration or context switch happens during rseq native
execution, we now raise a kernel xfer event.  The event is of a new
type DR_XFER_RSEQ_ABORT.

To implement this, the native abort handler cannot be linked and must
return to dispatch.  The special-exit-reason feature is used for this
purpose.

Adds a test.  To force a migration we use a system call, which we do
not normally allow inside an rseq region.  I added a debug-build
exception for this particular test by executable name, along with a
syscall discovery workaround for the attach test.

Adds a client via static DR to api.rseq to test that the event is
raised.

Adds handling to drmemtrace in the tracer and raw2trace.  For
raw2trace we walk backward to undo the committing store that was
recorded, since a real rseq abort would happen before the final store.
I would like to add on offline trace rseq regression test, but it hits
#4041 and so the test will be added as part of that issue.

Issue; #2350, #4019, #4041
Fixes #4019
derekbruening added a commit that referenced this issue Jan 23, 2020
When a signal occurs inside an rseq region, DR now provides the actual
interruption PC to the kernel xfer event (but not the signal event, to
match kernel behavior).  This is required to accurately place the
signal in a drmemtrace offline trace.

What this looks like:
Before:
      0x00007fe36b1ea8f1  74 02                jz     $0x00007fe36b1ea8f5
      0x00007fe36b1ea8f3  0f 0b                ud2a
      0x00007fe36b1ea8f5  83 05 74 38 00 00 01 addl   $0x01, <rel> 0x00007fe36b1ee170
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196441528495>
    <marker: tid 74400 on core 0>
      0x00007fe36b1ea85c  55                   push   %rbp
After:
      0x00007f15e71d78f1  74 02                jz     $0x00007f15e71d78f5
      0x00007f15e71d78f3  0f 0b                ud2a
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196672834337>
    <marker: tid 78727 on core 1>
      0x00007f15e71d785c  55                   push   %rbp

Adds a trace_invariants check for a signal immediately after UD2A.  I
first went through and added module_mapper and decoding to
trace_invariants: but then we have to link in static DR (even if we do
our own raw decoding as module_mapper uses DR) and thus we can't run
the test anymore on AArch64 (#2007) or Mac.  Instead, I went with a
simpler solution of putting the annotations that signal_invariants.c
uses into rseq.c.

I had to fix 2 bugs to get the marker in the right place and to get
this invariants check to fire:

1) raw2trace was adding to cur_modoffs before memrefs and then having
  an extra check for where to put the the signal, which doesn't make
  sense.  It ended up putting the marker too early.  After fixing, the
  other tests still pass, so it is not clear why I had it that way.

2) op_offline was off inside trace_invariants, so the assert on a
  signal after ud2a was not firing.  I now have analyzer_multi setting
  op_offline up front as a synthetic option for post-processing usage.

I also added a direct jump as an annotation as the start of the abort
handler in rseq.c to avoid trace_invariants complaining about a signal not
returning to the interruption point.

I added a drcachesim offline regression test tracing the rseq test's
code.  This same test will be separately used for #4038.

Issue: #4038, #4041
Fixes #4041
derekbruening added a commit that referenced this issue Jan 23, 2020
When a signal occurs inside an rseq region, DR now provides the actual
interruption PC to the kernel xfer event (but not the signal event, to
match kernel behavior).  This is required to accurately place the
signal in a drmemtrace offline trace.

What this looks like:
Before:
      0x00007fe36b1ea8f1  74 02                jz     $0x00007fe36b1ea8f5
      0x00007fe36b1ea8f3  0f 0b                ud2a
      0x00007fe36b1ea8f5  83 05 74 38 00 00 01 addl   $0x01, <rel> 0x00007fe36b1ee170
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196441528495>
    <marker: tid 74400 on core 0>
      0x00007fe36b1ea85c  55                   push   %rbp
After:
      0x00007f15e71d78f1  74 02                jz     $0x00007f15e71d78f5
      0x00007f15e71d78f3  0f 0b                ud2a
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196672834337>
    <marker: tid 78727 on core 1>
      0x00007f15e71d785c  55                   push   %rbp

Adds a trace_invariants check for a signal immediately after UD2A.  I
first went through and added module_mapper and decoding to
trace_invariants: but then we have to link in static DR (even if we do
our own raw decoding as module_mapper uses DR) and thus we can't run
the test anymore on AArch64 (#2007) or Mac.  Instead, I went with a
simpler solution of putting the annotations that signal_invariants.c
uses into rseq.c.

I had to fix 2 bugs to get the marker in the right place and to get
this invariants check to fire:

1) raw2trace was adding to cur_modoffs before memrefs and then having
  an extra check for where to put the the signal, which doesn't make
  sense.  It ended up putting the marker too early.  After fixing, the
  other tests still pass, so it is not clear why I had it that way.

2) op_offline was off inside trace_invariants, so the assert on a
  signal after ud2a was not firing.  I now have analyzer_multi setting
  op_offline up front as a synthetic option for post-processing usage.

I also added a direct jump as an annotation as the start of the abort
handler in rseq.c to avoid trace_invariants complaining about a signal not
returning to the interruption point.

I added a drcachesim offline regression test tracing the rseq test's
code.  This same test will be separately used for #4038.

Issue: #4038, #4041
Fixes #4041
derekbruening added a commit that referenced this issue Jan 23, 2020
Makes the following changes to ensure the drcachesim rseq test from

+ Confirms that the gap is found by trace_invariants.  It wasn't in
  HEAD: here I fix a bug to only honor pref_xfer_marker if it was not
  cleared.

+ Adds ending of a bb right after an rseq committing store, to
  simplify the native-rseq-mangled bb and drmemtrace's handling:
  raw2trace wants to include the subsequent instructions in the bb;
  they are removed by #4038's rollback but it seems cleaner to just
  not have them there in the bb in the first place.

Issue: #4038, #4041
derekbruening added a commit that referenced this issue Jan 24, 2020
When a signal occurs inside an rseq region, DR now provides the actual
interruption PC to the kernel xfer event (but not the signal event, to
match kernel behavior).  This is required to accurately place the
signal in a drmemtrace offline trace.

What this looks like:
Before:
      0x00007fe36b1ea8f1  74 02                jz     $0x00007fe36b1ea8f5
      0x00007fe36b1ea8f3  0f 0b                ud2a
      0x00007fe36b1ea8f5  83 05 74 38 00 00 01 addl   $0x01, <rel> 0x00007fe36b1ee170
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196441528495>
    <marker: tid 74400 on core 0>
      0x00007fe36b1ea85c  55                   push   %rbp
After:
      0x00007f15e71d78f1  74 02                jz     $0x00007f15e71d78f5
      0x00007f15e71d78f3  0f 0b                ud2a
    <marker: kernel xfer to handler>
    <marker: timestamp 13224196672834337>
    <marker: tid 78727 on core 1>
      0x00007f15e71d785c  55                   push   %rbp

Adds a trace_invariants check for a signal immediately after UD2A.  I
first went through and added module_mapper and decoding to
trace_invariants: but then we have to link in static DR (even if we do
our own raw decoding as module_mapper uses DR) and thus we can't run
the test anymore on AArch64 (#2007) or Mac.  Instead, I went with a
simpler solution of putting the annotations that signal_invariants.c
uses into rseq.c.

I had to fix 2 bugs to get the marker in the right place and to get
this invariants check to fire:

1) raw2trace was adding to cur_modoffs before memrefs and then having
  an extra check for where to put the the signal, which doesn't make
  sense.  It ended up putting the marker too early.  After fixing, the
  other tests still pass, so it is not clear why I had it that way.

2) op_offline was off inside trace_invariants, so the assert on a
  signal after ud2a was not firing.  I now have analyzer_multi setting
  op_offline up front as a synthetic option for post-processing usage.

I also added a direct jump as an annotation as the start of the abort
handler in rseq.c to avoid trace_invariants complaining about a signal not
returning to the interruption point.

I added a drcachesim offline regression test tracing the rseq test's
code.  This same test will be separately used for #4019.

Issue: #4019, #4041
Fixes #4041
derekbruening added a commit that referenced this issue Jan 25, 2020
Makes the following changes to ensure the drcachesim rseq test from

+ Confirms that the gap is found by trace_invariants.  It wasn't in
  HEAD: here I fix a bug to only honor pref_xfer_marker if it was not
  cleared.

+ Adds ending of a bb right after an rseq committing store, to
  simplify the native-rseq-mangled bb and drmemtrace's handling:
  raw2trace wants to include the subsequent instructions in the bb;
  they are removed by #4019's rollback but it seems cleaner to just
  not have them there in the bb in the first place.

Issue: #4019, #4041
@derekbruening
Copy link
Contributor Author

Note that the work here solved this problem for offline traces only, not for online: probably we should reopen for online.

@derekbruening
Copy link
Contributor Author

Re-opening as the solution here is not ideal:

  1. The mid-region PC used is not computed "honestly": it's the closest app translation to the kernel's passed PC which is the native rseq handler PC in the fragment (I'm surprised it results in a valid translation: isn't it past the committing store?)

  2. The signal marker's indicated interruption PC should match the handler resumption PC for a sigreturn that doesn't mess with the context.

So the new proposal is to revert what was done here, and then:

@derekbruening derekbruening reopened this Apr 10, 2023
derekbruening added a commit that referenced this issue Apr 11, 2023
With our run-twice rseq solution, we first run an instrumented path
through the rseq region which goes into the recorded drmemtrace trace.
But the native rseq execution might take a side exit or abort.  We
already had code to try to roll back the committing store on an abort,
but it was fragile as a timestamp in between thwarted it.  We had no
code to make a side exit match and as a result we'd see PC
discontinuities.

Here, we solve both problems with the following approach:

First, we have DR add a new DR_NOTE_RSEQ_ENTRY label to the top of the
first block in an rseq region.  This label's data stores the end PC
and the abort handler PC (we do not currently use the latter in
drmemtrace: once we revert #4041 we may need to do so).

Next, the tracer turns that label into a new
TRACE_MARKER_TYPE_RSEQ_ENTRY marker with the end PC as the marker
value.

Finally, rawtrace uses the new label to fix up the trace.  When it
sees the entry marker, it starts buffering instead of writing out
records.  While buffering, it remembers the decode PC of each
insruction (for encoding entries), plus the PC + target of each
branch.  When it sees the rseq end PC, it sets a flag, and on the next
instruction it has the rseq exit PC.  If it sees an rseq abort marker
or a signal marker with the flag set, it uses the interrupted PC as
the rseq exit PC (this avoids having to wait for the other side of the
signal handler).

With this rseq exit PC, raw2trace takes the following actions
depending on the PC value:
+ End PC fall-through: completed; emit entire buffer.
+ Committing store PC: roll back final instr in buffer.
  (Once #4041 reverts this to the handler PC we'll use the handler PC here
  and will add a new marker to hold that data.)
+ Target of a branch in the buffer: roll back to that branch.
  If multiple: take most recent.
+ None of the above: there cannot be an indirect branch we observed so
  we assume this is an inverted cbr.  If there is a forward direct cbr
  which was taken and stayed in the region, we synthesize a direct
  jump in the gap it jumped over.  This is not perfect but is the
  best we can readily do.

We leave the old rollback method (which didn't handle a timestamp in
between) in place but it is only enabled if we see an rseq abort
marker without the new rseq entry marker first.  We decided to not
bump the trace version nor add a new filetype to indicate the presence
of rseq entry markers as we can support legacy traces this way (at
least as well as they were supported before).

We give up on identifying which of multiple indirect rseq side exits were taken.
We give up on distinguishing a side exit whose target is the end PC fall-through from a
completion.

For emitting the accumulated rseq buffer, the delayed branch code
added for #5941 will handle chunk boundaries.

A side improvements along the way: we add a branch target pc to
instr_summary_t and avoided increasing its space by removing the
previously-stored next_pc and computing it from the length.

Adds a number of unit tests for rseq rollback and side exits, along
with a more-real instance in the linux.rseq test.  This was also
tested on a large proprietary application with multiple real instances
of rseq aborts and side exits.

Issue: #5953, #5953, #4041
Fixes #5953
Fixes #5954
derekbruening added a commit that referenced this issue Apr 12, 2023
The Linux kernel presents the rseq abort handler PC as the interrupted
(really, the resume point) for a signal generated while inside an rseq
region.  Yet DR was modifying that to instead show clients the rseq
committing store PC.  Similarly, DR presented that commit PC for its
rseq abort event if an rseq region aborted through some other
non-signal means like a thread migration.  Since this is a little
misleading and doesn't match what the app sees, and results in the
signal interrutped PC not matching the later resume PC, we change that
behavior here to have both signal and rseq events use the handler PC.
There is no way to obtain the precise interruption PC from the kernel
so we follow that lead and do not try to present it from DR.

Furthemore, DR was not previously generating an rseq abort event if a
signal was the cause of the abort.  We change that here, raising the
event prior to the signal event.

Adds a new drmemtrace label TRACE_MARKER_TYPE_RSEQ_HANDLER to
communicate the handler PC to raw2trace.  Updates raw2trace to use
this to detect an rseq abort.

Adds a new test case to raw2trace_unit_tests.

Updates the api.rseq test.

Fixes #4041
derekbruening added a commit that referenced this issue Apr 12, 2023
With our run-twice rseq solution, we first run an instrumented path
through the rseq region which goes into the recorded drmemtrace trace.
But the native rseq execution might take a side exit or abort. We
already had code to try to roll back the committing store on an abort,
but it was fragile as a timestamp in between thwarted it. We had no code
to make a side exit match and as a result we'd see PC discontinuities.

Here, we solve both problems with the following approach:

First, we have DR add a new DR_NOTE_RSEQ_ENTRY label to the top of the
first block in an rseq region. This label's data stores the end PC and
the abort handler PC (we do not currently use the latter in drmemtrace:
once we revert #4041 we may need to do so).

Next, the tracer turns that label into a new
TRACE_MARKER_TYPE_RSEQ_ENTRY marker with the end PC as the marker value.

Finally, rawtrace uses the new label to fix up the trace. When it sees
the entry marker, it starts buffering instead of writing out records.
While buffering, it remembers the decode PC of each insruction (for
encoding entries), plus the PC + target of each branch. When it sees the
rseq end PC, it sets a flag, and on the next instruction it has the rseq
exit PC. If it sees an rseq abort marker or a signal marker with the
flag set, it uses the interrupted PC as the rseq exit PC (this avoids
having to wait for the other side of the signal handler).

With this rseq exit PC, raw2trace takes the following actions depending
on the PC value:
+ End PC fall-through: completed; emit entire buffer.
+ Committing store PC: roll back final instr in buffer. (Once #4041
reverts this to the handler PC we'll use the handler PC here and will
add a new marker to hold that data.)
+ Target of a branch in the buffer: roll back to that branch. If
multiple: take most recent.
+ None of the above: there cannot be an indirect branch we observed so
we assume this is an inverted cbr. If there is a forward direct cbr
which was taken and stayed in the region, we synthesize a direct jump in
the gap it jumped over. This is not perfect but is the best we can
readily do.

We leave the old rollback method (which didn't handle a timestamp in
between) in place but it is only enabled if we see an rseq abort marker
without the new rseq entry marker first. We decided to not bump the
trace version nor add a new filetype to indicate the presence of rseq
entry markers as we can support legacy traces this way (at least as well
as they were supported before).

We give up on identifying which of multiple indirect rseq side exits
were taken. We give up on distinguishing a side exit whose target is the
end PC fall-through from a completion.

For emitting the accumulated rseq buffer, the delayed branch code added
for #5941 will handle chunk boundaries.

A side improvements along the way: we add a branch target pc to
instr_summary_t and avoided increasing its space by removing the
previously-stored next_pc and computing it from the length.

Adds a number of unit tests for rseq rollback and side exits, along with
a more-real instance in the linux.rseq test. This was also tested on a
large proprietary application with multiple real instances of rseq
aborts and side exits.

Issue: #5953, #5953, #4041
Fixes #5953
Fixes #5954
derekbruening added a commit that referenced this issue Apr 13, 2023
The Linux kernel presents the rseq abort handler PC as the interrupted
(really, the resume point) for a signal generated while inside an rseq
region. Yet DR was modifying that to instead show clients the rseq
committing store PC. Similarly, DR presented that commit PC for its rseq
abort event if an rseq region aborted through some other non-signal
means like a thread migration. Since this is a little misleading and
doesn't match what the app sees, and results in the signal interrupted
PC not matching the later resume PC, we change that behavior here to
have both signal and rseq events use the handler PC. There is no way to
obtain the precise interruption PC from the kernel so we follow that
lead and do not try to present it from DR.

Furthemore, DR was not previously generating an rseq abort event if a
signal was the cause of the abort. We change that here, raising the
event prior to the signal event.

Updates raw2trace to use the handler PC that is now in the rseq
abort marker to detect an rseq abort, rather than looking for the
commit PC.

Adds a new test case to raw2trace_unit_tests and updates others.

Updates the api.rseq test.

Fixes #4041
@derekbruening derekbruening reopened this Apr 13, 2023
@derekbruening
Copy link
Contributor Author

We have a problem: a fault in the instrumented rseq run is no longer placed correctly!

Trace invariant failure in T1724466 at ref # 94702: Interruption marker mis-placed

       94701       92472:     1724466 ifetch       4 byte(s) @ 0x0000ffffb58eea58 00000000   udf    $0x0000
       94702       92473:     1724466 ifetch       4 byte(s) @ 0x0000ffffb58eea5c f94000c1   ldr    (%x6)[8byte] -> %x1
       94703       92474:     1724466 ifetch       4 byte(s) @ 0x0000ffffb58eea60 91000421   add    %x1 $0x0001 lsl $0x00 -> %x1
       94704       92474:     1724466 <marker: rseq abort xfer to handler 0xffffb58eea70

The rseq test raises SIGILL the first time through that code (then in the rseq abort
handler it changes a conditional and avoids the udf the 2nd time). But raw2trace put
all instrs from the block there and can't figure out where the abort or signal should go
b/c their PC is the handler. This is a SIGILL in the instrumented execution so before
i#4041 we actually did have the precise interrupted PC here! (Passed on x86 b/c there
was just one instr after the sigill and we rolled that one back; arm needs more instrs
so 1-instr rollback not enough.)

But why did this pass in PR #5970 and is only failing now on the filter PR #5971? Unclear.

The proposed solution is to have the rseq abort event contain the precise interrupted PC
for a signal in the instrumented execution (but keep the signal event with the handler).

derekbruening added a commit that referenced this issue Apr 13, 2023
Fixes problems with the new rseq adjustment code with filtered traces:
+ Load and update the buffer pointer when emitting the rseq entry label
+ Do not try to adjust for i-filtered

Adds 2 new tests for filtering the linux.rseq app.

The new tests revealed a problem with the instr counts in the schedule
file: raw2trace was counting the PC-only entries as instructions; we fix that
here.

Reverts the rseq abort event and marker value back to the interrupted
PC for a signal in the instrumented execution.  Leaves all other cases
(all aborts in the native execution) as the handler.  This lets us
properly place the abort+signal for the instrumented interruption
case.
    
Adjusts the invariant checker abort check to use the new end PC for a
more accurate test to avoid false positives and negatives.

Fixes #5953, #4041
derekbruening added a commit that referenced this issue Apr 15, 2023
Fixes 3 more problems with rseq in DR and in drmemtraces:

1) Side exit in instrumented run: avoids unbounded buffer growth by
detecting by looking at the region bounds.  Adds a buffer large size
sanity check.  Adds a test to linux.rseq.

2) Try to handle double-abort: couldn't repro in test though so the
fix of disabling a later signal from reporting an abort when we report
an abort fragment exit is not fully vetted.

3) Asynch signal in instrumented run: this was not being adjusted so
the interruption/continuation PC is the abort handler.  Adds a
test to linux.rseq.

Adds more invariant checks around rseq region exits.

Issue: #5953, #4041, #5954
derekbruening added a commit that referenced this issue Apr 17, 2023
Fixes 3 more problems with rseq in DR and in drmemtraces related to
exits and signals:

1) Side exit in instrumented run: avoids unbounded buffer growth by
detecting by looking at the region bounds. Adds a buffer large size
sanity check. Adds a test to linux.rseq.

2) Try to handle double-abort: couldn't repro in test though so the fix
of disabling a later signal from reporting an abort when we report an
abort fragment exit is not fully vetted.

3) Asynch signal in instrumented run: this was not being adjusted so the
interruption/continuation PC is the abort handler. Adds a test to
linux.rseq.

Adds more invariant checks around rseq region exits.

Issue: #5953, #4041, #5954, #5980
Fixes #5980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment