Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support restartable regions for per-cpu critical regions #2350

Closed
derekbruening opened this issue Apr 14, 2017 · 8 comments
Closed

support restartable regions for per-cpu critical regions #2350

derekbruening opened this issue Apr 14, 2017 · 8 comments

Comments

@derekbruening
Copy link
Contributor

This issue covers support for Linux kernel extensions for restartable regions for per-cpu critical regions. Xref https://lwn.net/Articles/649288/

@derekbruening
Copy link
Contributor Author

This is finally coming to the official kernel: torvalds/linux@d82991a

@derekbruening
Copy link
Contributor Author

The best approach that we've come up with to support instrumentation of restartable regions is to run them twice: first, run them instrumentated in the code cache, but with their memory stores removed (but still instrumented); then, at the committing store point, invoke the sequence from its start point natively. This requires some assumptions on identifying the committing store (this actually gets easier with the implementation being pushed to mainline Linux) and is not ideal with mixing in native execution, but it is much faster than the 2nd-best approach of serializing all restartable sequences that operate on the same data structures (which requires app knowledge...else lumping all sequences together) with a global lock, which also has as assumption that the current cpu is not read twice during the sequence.

@derekbruening
Copy link
Contributor Author

Xref the #1698 load-store exclusive problem which is from a high level a similar type of platform issue complicating instrumentation by restricting what can happen within a PC range.

@Carrotman42
Copy link
Contributor

I wrote a doc as a summary of the current state of the world: https://github.com/DynamoRIO/dynamorio/wiki/Restartable-Sequences

@compudj
Copy link

compudj commented Jun 9, 2019

Please refer to the two following upstream Linux selftests commits, which are relevant for DR. They replace the __rseq_table with a more complete alternative:

commit 4fe2088e164d2ec44530fe2840f6be5906fbc650
Author: Mathieu Desnoyers [email protected]
Date: Mon Apr 29 11:27:53 2019 -0400

rseq/selftests: Add __rseq_exit_point_array section for debuggers

Knowing all exit points is useful to assist debuggers stepping over the
rseq critical sections without requiring them to disassemble the content
of the critical section to figure out the exit points.

commit a3e3131f94aa1daeb978ed66d0b4e61156ef2c2a
Author: Mathieu Desnoyers [email protected]
Date: Mon Apr 29 11:27:54 2019 -0400

rseq/selftests: Introduce __rseq_cs_ptr_array, rename __rseq_table to __rseq_cs

The entries within __rseq_table are aligned on 32 bytes due to
linux/rseq.h struct rseq_cs uapi requirements, but the start of the
__rseq_table section is not guaranteed to be 32-byte aligned. It can
cause padding to be added at the start of the section, which makes it
hard to use as an array of items by debuggers.

Considering that __rseq_table does not really consist of a table due to
the presence of padding, rename this section to __rseq_cs.

Create a new __rseq_cs_ptr_array section which contains 64-bit packed
pointers to entries within the __rseq_cs section.

derekbruening added a commit that referenced this issue Jul 16, 2019
For the application module, we use the application path obtained from
early injection or /proc/self/exe on Linux, rather than
/proc/self/maps comments.  The maps comments can be unreliable in the
face of anonymous or deleted-file mremaps used for hugepage backing
and other features.

Adds a test case to the existing "burst_maps" test.

Having the right module full_path helps many cases, including the
forthcoming restartable sequences ("rseq") support for #2350.

Issue: #2566, #2350
derekbruening added a commit that referenced this issue Jul 16, 2019
For the application module, we use the application path obtained from
early injection or /proc/self/exe on Linux, rather than
/proc/self/maps comments.  The maps comments can be unreliable in the
face of anonymous or deleted-file mremaps used for hugepage backing
and other features.

Adds a test case to the existing "burst_maps" test.

Having the right module full_path helps many cases, including the
forthcoming restartable sequences ("rseq") support for #2350.

Issue: #2566, #2350
derekbruening added a commit that referenced this issue Jul 19, 2019
Adds initial handling for the restartable sequence ("rseq") feature
that is now in the mainline Linux kernel.

We identify rseq regions by looking for ELF sections with established
names according to upstream conventions.  Unfortunately this requires
going to disk for most libraries, so we avoid this for
full-control-mode if we have never seen an rseq system call, and for
attach if no thread has registered for rseq.

For blocks inside rseq regions, mangling removes all memory stores.
For the final commit instruction, we append a native call back to the
abort handler.  We assume this extra frame is ok, and we require the
rseq sequence to end in a return.  Future work will improve these
assumptions.

Updates the 3 Linux syscall lists up through SYS_rseq.

Adds 3 RSTATS for rseq operation.

Documents the current limitations on rseq region support:
- The application must store an rseq_cs struct for each rseq region in a
  section of its binary with an established name.
- Each rseq region's code must never be also executed as a non-restartable sequence.
- Each rseq region must make forward progress if its abort handler is always
  called the first time it is executed.
- Each memory store instruction inside an rseq region must have no other side
  effects.
- Each rseq region must end with a return instruction, and each abort handler
  plus rseq code must combine into a callee following normal call-return
  semantics.
- Any helper function called from within an rseq region must have no side effects.

Adds two tests for x86_64 Linux, one for full control and one for
attach.  However, these require a 4.18+ kernel and so are not
exercised by Travis.  The Fedora CDash machine does have 4.18 so we do
have some automated coverage.

Once this is in place, the old and now obsolete rseq support will be removed.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 19, 2019
Adds initial handling for the restartable sequence ("rseq") feature
that is now in the mainline Linux kernel.

We identify rseq regions by looking for ELF sections with established
names according to upstream conventions.  Unfortunately this requires
going to disk for most libraries, so we avoid this for
full-control-mode if we have never seen an rseq system call, and for
attach if no thread has registered for rseq.

For blocks inside rseq regions, mangling removes all memory stores.
For the final commit instruction, we append a native call back to the
abort handler.  We assume this extra frame is ok, and we require the
rseq sequence to end in a return.  Future work will improve these
assumptions.

Updates the 3 Linux syscall lists up through SYS_rseq.

Adds 3 RSTATS for rseq operation.

Documents the current limitations on rseq region support:
- The application must store an rseq_cs struct for each rseq region in a
  section of its binary with an established name.
- Each rseq region's code must never be also executed as a non-restartable sequence.
- Each rseq region must make forward progress if its abort handler is always
  called the first time it is executed.
- Each memory store instruction inside an rseq region must have no other side
  effects.
- Each rseq region must end with a return instruction, and each abort handler
  plus rseq code must combine into a callee following normal call-return
  semantics.
- Any helper function called from within an rseq region must have no side effects.

Adds two tests for x86_64 Linux, one for full control and one for
attach.  However, these require a 4.18+ kernel and so are not
exercised by Travis.  The Fedora CDash machine does have 4.18 so we do
have some automated coverage.

Once this is in place, the old and now obsolete rseq support will be removed.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 19, 2019
Reverts the now-obsolete run-native approach for an older version of
the restartable sequence ("rseq") feature. That version never made it
to the mainline kernel, and the run-native approach failed to allow
tools to see rseq code.  Reverts most of commits cda88be and 0935136.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 19, 2019
Reverts the now-obsolete run-native approach for an older version of
the restartable sequence ("rseq") feature. That version never made it
to the mainline kernel, and the run-native approach failed to allow
tools to see rseq code.  Reverts most of commits cda88be and 0935136.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 20, 2019
Fixes the lazy rseq support to handle code cache pre-population.
Previously rseq code blocks could be created without rseq handling due
to the lazy checks not triggered until after pre-population.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 20, 2019
Fixes the lazy rseq support to handle code cache pre-population between
setup and start.  Previously rseq code blocks could be created without rseq
handling due to the lazy checks not triggering until after taking over the app.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 25, 2019
The __rseq_cs_ptr_array will be relocated, so we should not add the
load offset.

Adds an array to the suite test (previously arrays were only tested
manually using a librseq app).  Creates 2 separate tests to test all 3
section types.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 25, 2019
The __rseq_cs_ptr_array will be relocated, so we should not add the
load_offset but rather the entry_offset.

Documents that we require these rseq sections to be located in loaded
segments.  Adds release-build fatal errors if this is not the case.

Adds an array to the suite test (previously arrays were only tested
manually using a librseq app).  Creates 2 separate tests to test all 3
section types.

Issue: #2350
hgreving2304 pushed a commit that referenced this issue Jul 25, 2019
The __rseq_cs_ptr_array will be relocated, so we should not add the
load_offset but rather the entry_offset.

Documents that we require these rseq sections to be located in loaded
segments.  Adds release-build fatal errors if this is not the case.

Adds an array to the suite test (previously arrays were only tested
manually using a librseq app).  Creates 2 separate tests to test all 3
section types.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 26, 2019
On any translation, and in particular on detach, we translate from
inside an rseq region to the abort handler.  This is necessary to
avoid problems with a cpu migration earlier in the region while
running the instrumented version.

Augments the api.rseq test with a thread that sits in a loop in an
rseq region to test translation on detach: without the translation, it
loops forever.

Issue: #2350
derekbruening added a commit that referenced this issue Jul 26, 2019
On any translation, and in particular on detach, we translate from
inside an rseq region to the abort handler.  This is necessary to
avoid problems with a cpu migration earlier in the region while
running the instrumented version.

Augments the api.rseq test with a thread that sits in a loop in an
rseq region to test translation on detach: without the translation, it
loops forever.

Issue: #2350
hgreving2304 pushed a commit that referenced this issue Jul 31, 2019
On any translation, and in particular on detach, we translate from
inside an rseq region to the abort handler.  This is necessary to
avoid problems with a cpu migration earlier in the region while
running the instrumented version.

Augments the api.rseq test with a thread that sits in a loop in an
rseq region to test translation on detach: without the translation, it
loops forever.

Issue: #2350
derekbruening added a commit that referenced this issue Aug 12, 2019
Adds a new option -disable_rseq, which returns -ENOSYS on any SYS_rseq
system call.  This is intended as a workaround for applications that
do not satisfy DR's limitations for full rseq support.

Adds a test that fails unless -disable_rseq is passed.

Moves the rseq limitations list to a new section on rseq in the
documentation.

Issue: #2350
derekbruening added a commit that referenced this issue Sep 10, 2019
Eliminates the call-return reliance for the native execution step of
rseq support.  Makes a local copy of the sequence right inside the
sequence-ending block and executes it.  The sequence is inserted as
additional instructions and is then mangled normally (mangling changes
are assumed to be restartable), but it is not passed to clients.  Any
exits are regular block exits, resulting in a block with many exits.

The prior call-return scheme is left under a temporary option
-rseq_assume_call, as a failsafe in case there are stability problems
discovered with this native execution implementation.  Once we are
happy with the new scheme we can remove the option.

To make the local copy an rseq region, the per-thread rseq_cs address
is identified by watching system calls.  For attach, it is identified
by searching the possible static TLS offsets.  The assumption of a
constant offset is documented and verified.

The rseq_cs's abort handler is a new exit added with the app's
signature as data just before it, hidden in the operands of a nop
instruction to avoid problems with decoding the fragment.  A local
jump skips over the data and exit.

A new rseq_cs structure is allocated for each sequence-ending
fragment.  It is stored in a hashtable in the rseq module, to avoid
complexities and overhead of adding an additional fragment_t or
"subclass" field.  A new flag is set to trigger calling into the rseq
module on fragment deletion.

The rseq_cs fields are filled in via a new post-emit control point,
using information stored in labels during mangling.  The pointer to
the rseq_cs is inserted with a dummy value and patched in this new
control point using a new utility routine patch_mov_immed_ptrsz().

To avoid crashing due to invalid rseq bounds after freeing the rseq_cs
structure, the rseq pointer is cleared explicitly on completion, and
on midpoint exit by the fragment deletion hook along with a hook on
the shared fragment flushtime update, to ensure all threads are
covered.

The rseq test is augmented and expanded.  An invalid instruction is
added to properly test the abort handler, under a conditional to allow
testing each sequence both to completion and on abort.

Future work is properly handling a midpoint exit during the
instrumentation execution: we need to invoke the native version as
well.

Adding aarchxx support is also future work: the
patch_mov_immed_ptrsz(), the writes to the rseq struct in TLS, and the
rseq tests are currently x86-only.

Issue: #2350
derekbruening added a commit that referenced this issue Sep 12, 2019
Eliminates the call-return reliance for the native execution step of
rseq support.  Makes a local copy of the sequence right inside the
sequence-ending block and executes it.  The sequence is inserted as
additional instructions and is then mangled normally (mangling changes
are assumed to be restartable), but it is not passed to clients.  Any
exits are regular block exits, resulting in a block with many exits.

The prior call-return scheme is left under a temporary option
-rseq_assume_call, as a failsafe in case there are stability problems
discovered with this native execution implementation.  Once we are
happy with the new scheme we can remove the option.

To make the local copy an rseq region, the per-thread rseq_cs address
is identified by watching system calls.  For attach, it is identified
by searching the possible static TLS offsets.  The assumption of a
constant offset is documented and verified.

The rseq_cs's abort handler is a new exit added with the app's
signature as data just before it, hidden in the operands of a nop
instruction to avoid problems with decoding the fragment.  A local
jump skips over the data and exit.

A new rseq_cs structure is allocated for each sequence-ending
fragment.  It is stored in a hashtable in the rseq module, to avoid
complexities and overhead of adding an additional fragment_t or
"subclass" field.  A new flag is set to trigger calling into the rseq
module on fragment deletion.

The rseq_cs fields are filled in via a new post-emit control point,
using information stored in labels during mangling.  The pointer to
the rseq_cs is inserted with a dummy value and patched in this new
control point using a new utility routine patch_mov_immed_ptrsz().

To avoid crashing due to invalid rseq bounds after freeing the rseq_cs
structure, the rseq pointer is cleared explicitly on completion, and
on midpoint exit by the fragment deletion hook along with a hook on
the shared fragment flushtime update, to ensure all threads are
covered.

The rseq test is augmented and expanded.  An invalid instruction is
added to properly test the abort handler, under a conditional to allow
testing each sequence both to completion and on abort.

Future work is properly handling a midpoint exit during the
instrumentation execution: we need to invoke the native version as
well.

Adding aarchxx support is also future work: the
patch_mov_immed_ptrsz(), the writes to the rseq struct in TLS, and the
rseq tests are currently x86-only.

Issue: #2350
@derekbruening
Copy link
Contributor Author

https://github.com/DynamoRIO/dynamorio/wiki/Restartable-Sequences contains a writeup of the implementation details in the series of commits above for the run-twice solution.

There are a number of corner cases left to cover, but they are lower priority. These are things like:

  • Invoking the 2nd execution on a sequence midpoint exit (maybe leveraging __rseq_exit_point_array)
  • Handling indirect branch exits out of a sequence
  • Remove the -rseq_assume_call option once we're satisfied we'll never go back to it
  • Add more sanity checks for existing requirements/assumptions

@derekbruening
Copy link
Contributor Author

We are currently ignoring the flags which indicate which triggers (out of preempt, signal, and migrate) should cause the abort handler to be called. We blindly run a second time even the preempt and migrate bits are not set, which the application may not expect without a signal arriving or may expect to only happen in a fatal error condition.

derekbruening added a commit that referenced this issue Jan 17, 2020
For i#731 with automatic re-relativization of absolute PC's, in
d6f5fca we simply kept the hardcoded offset for intra-region branch
targets in our native rseq copy.  However, with subsequent mangling
that offset can become incorrect and target the middle of an
instruction, leading to a crash.  We instead take the time to convert
these PC targets to instr_t* targets.

We also tweak the disassembly output to show the instr_t pointer value
for level 3 instructions too, since jumps can target them as well as
synthetic instructions.  This helped with verifying and debugging this
change.

Tested on an inserted system call for locally forcing rseq restarts,
which leads to system call mangling and crashes without this fix.

Issue: #731, #2350
derekbruening added a commit that referenced this issue Jan 17, 2020
For i#731 with automatic re-relativization of absolute PC's, in
d6f5fca we simply kept the hardcoded offset for intra-region branch
targets in our native rseq copy.  However, with subsequent mangling
that offset can become incorrect and target the middle of an
instruction, leading to a crash.  We instead take the time to convert
these PC targets to instr_t* targets.

We also tweak the disassembly output to show the instr_t pointer value
for level 3 instructions too, since jumps can target them as well as
synthetic instructions.  This helped with verifying and debugging this
change.

Tested on an inserted system call for locally forcing rseq restarts,
which leads to system call mangling and crashes without this fix.

Issue: #731, #2350
derekbruening added a commit that referenced this issue Jan 17, 2020
Adds translation support for the register restores used in rseq mangling.

Adds a test of a fault/signal in native rseq code by taking advantage
of the lack of xmm support to have different behavior in the
instrumented vs native executions.

I hit this while trying to force a restart for i#4019 in a custom
test, but it could happen in regular execution with an asynchronous
signal.

Issue: #2350
derekbruening added a commit that referenced this issue Jan 17, 2020
Adds translation support for the register restores used in rseq mangling.

Adds a test of a fault/signal in native rseq code by taking advantage
of the lack of xmm support to have different behavior in the
instrumented vs native executions.

I hit this while trying to force a restart for i#4019 in a custom
test, but it could happen in regular execution with an asynchronous
signal.

Issue: #2350
derekbruening added a commit that referenced this issue Jan 17, 2020
Updates a now-stale detail in the rseq limitation docs: we no longer
try to analyze read-write sequences for restoring state for the second
rseq execution.  We do still limit our checkpointing to
general-purpose registers.

Issue: #2350
derekbruening added a commit that referenced this issue Jan 21, 2020
Updates a now-stale detail in the rseq limitation docs: we no longer
try to analyze read-write sequences for restoring state for the second
rseq execution.  We do still limit our checkpointing to
general-purpose registers.

Issue: #2350
derekbruening added a commit that referenced this issue Jan 22, 2020
When a migration or context switch happens during rseq native
execution, we now raise a kernel xfer event.  The event is of a new
type DR_XFER_RSEQ_ABORT.

To implement this, the native abort handler cannot be linked and must
return to dispatch.  The special-exit-reason feature is used for this
purpose.

Adds a test.  To force a migration we use a system call, which we do
not normally allow inside an rseq region.  I added a debug-build
exception for this particular test by executable name, along with a
syscall discovery workaround for the attach test.

Adds a client via static DR to api.rseq to test that the event is
raised.

Adds handling to drmemtrace in the tracer and raw2trace.  For
raw2trace we walk backward to undo the committing store that was
recorded, since a real rseq abort would happen before the final store.
I would like to add on offline trace rseq regression test, but it hits

Issue; #2350, #4019, #4041
Fixes #4019
derekbruening added a commit that referenced this issue Jan 22, 2020
When a migration or context switch happens during rseq native
execution, we now raise a kernel xfer event.  The event is of a new
type DR_XFER_RSEQ_ABORT.

To implement this, the native abort handler cannot be linked and must
return to dispatch.  The special-exit-reason feature is used for this
purpose.

Adds a test.  To force a migration we use a system call, which we do
not normally allow inside an rseq region.  I added a debug-build
exception for this particular test by executable name, along with a
syscall discovery workaround for the attach test.

Adds a client via static DR to api.rseq to test that the event is
raised.

Adds handling to drmemtrace in the tracer and raw2trace.  For
raw2trace we walk backward to undo the committing store that was
recorded, since a real rseq abort would happen before the final store.
I would like to add on offline trace rseq regression test, but it hits
#4041 and so the test will be added as part of that issue.

Issue; #2350, #4019, #4041
Fixes #4019
@derekbruening
Copy link
Contributor Author

I'm marking this as completed since it is working well in practice.
I split the corner cases listed in #2350 (comment) off to #4315. I filed #4316 on adding aarch64 support.

derekbruening added a commit that referenced this issue Apr 16, 2021
Removes the option -rseq_assume_call and its code in favor of the
more-general native-copy approach, which has been the default for a
while now and has not shown any stability issues.

Issue: #2350
derekbruening added a commit that referenced this issue Apr 17, 2021
Removes the option -rseq_assume_call and its code in favor of the
more-general native-copy approach, which has been the default for a
while now and has not shown any stability issues.

Issue: #2350
derekbruening added a commit that referenced this issue Jun 8, 2021
Clarifies when instrumentation will see an rseq abort.
Fixes a typo.

Issue: #2350
derekbruening added a commit that referenced this issue Jun 8, 2021
Clarifies when instrumentation will see an rseq abort.
Fixes a typo.

Issue: #2350
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants