Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interoperability with W^X policies #3556

Open
derekbruening opened this issue Apr 19, 2019 · 9 comments
Open

Add interoperability with W^X policies #3556

derekbruening opened this issue Apr 19, 2019 · 9 comments

Comments

@derekbruening
Copy link
Contributor

This is a feature request to add support in DR for working in the presence of W^X policies. These PaX-like policies typically disallow execution from any page that came from mmap +wx or ever had mprotect +x applied to it, which rules out writing and later marking read-only. The proposal here is to use a dual-mapping strategy: having two virtual pages backed by the same physical page, with one +rw for writing and the other +rx for execution.

This will likely be limited to 64-bit Linux but could be extended to Windows in the future if desired.

With #1132 in place, we can make a new reservation the same size as vmcode and have a constant offset between each pair of pages. We would then keep all variables tracking addresses as scalars, and when we want to write, we would simply add the offset.

It may be best to first extend #774 and #1132 to make the size of vmcode 2G by default, to avoid any issues with out-of-reservation allocations.

@derekbruening
Copy link
Contributor Author

derekbruening commented May 10, 2019

I am observing an interesting performance issue: I have to use mmap to incrementally commit instead of mprotect, since most W^X policies do not like mprotect. With mmap, I see a huge (4-5 second!) delay at init time apparently due to the kernel IMA (Integrity Measurement Architecture) going and computing sha256 digests of every page in our anonymous mmap. Doing this for an anonymous mmap seems bizarre. It happens for both memfd_create and /dev/shm as the backing file.

We can see the delay scaling by vm size:

$ for i in 8M 32M 64M 128M 256M 512M 1G; do /usr/bin/time bin64/drrun -vm_size $i -satisfy_w_xor_x -- suite/tests/bin/simple_app 2>&1 | grep elapsed; done
0.01user 0.07system 0:00.10elapsed 83%CPU (0avgtext+0avgdata 4304maxresident)k
0.01user 0.15system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 4316maxresident)k
0.00user 0.30system 0:00.31elapsed 100%CPU (0avgtext+0avgdata 4268maxresident)k
0.01user 0.59system 0:00.60elapsed 99%CPU (0avgtext+0avgdata 4224maxresident)k
0.01user 1.20system 0:01.23elapsed 98%CPU (0avgtext+0avgdata 4212maxresident)k
0.01user 2.36system 0:02.39elapsed 99%CPU (0avgtext+0avgdata 4312maxresident)k
0.01user 4.72system 0:04.75elapsed 99%CPU (0avgtext+0avgdata 4336maxresident)k

Note how it's all kernel time. perf shows things like:

-   98.33%     0.00%  simple_app  [kernel.kallsyms]  [k] ima_file_mmap
     ima_file_mmap
     process_measurement
     ima_collect_measurement
   + ima_calc_file_hash
+   98.33%     0.00%  simple_app  [kernel.kallsyms]  [k] process_measurement
+   98.33%     0.00%  simple_app  [kernel.kallsyms]  [k] ima_collect_measurement
+   98.33%     0.01%  simple_app  [kernel.kallsyms]  [k] ima_calc_file_hash
+   93.76%     0.06%  simple_app  [kernel.kallsyms]  [k] crypto_shash_update
+   93.70%     0.06%  simple_app  [kernel.kallsyms]  [k] crypto_sha256_update
+   93.64%     0.22%  simple_app  [kernel.kallsyms]  [k] sha256_generic_block_fn
+   93.51%    89.91%  simple_app  [kernel.kallsyms]  [k] sha256_transform

On kernels without IMA enabled the delay is not there.

@derekbruening
Copy link
Contributor Author

A simple reproducer shows that the hash is only performed on the first mmap +x of any size, and covers the full file size. Subsequent mmaps do not incur extra overhead. So this is limited to DR init and will not show up later during app execution.

$ gcc -o ima_overhead ima_overhead.c  && /usr/bin/time ./ima_overhead 
                 Created memfd:        5 usec elapsed
                Truncated size:       27 usec elapsed
                    Mapped +rw:       34 usec elapsed
                    Mapped +rx:  9661297 usec elapsed
            Mapped another +rx:  9661339 usec elapsed
                       Success:  9777377 usec elapsed
0.00user 9.77system 0:09.77elapsed 99%CPU (0avgtext+0avgdata 1296maxresident)k

@derekbruening
Copy link
Contributor Author

I messed the number up in my commits so manually adding references here:

@derekbruening
Copy link
Contributor Author

I failed to think about how to handle fork. We have to use MAP_SHARED to mirror the views, so we will not get copy-on-write (even if we did we wouldn't get mirroring in a copy). I think the child has to make a new memfd file and copy all the old content into new views? For a large vmcode that could take a while. Fortunately most forks are done early in a process and rarely late in a large (esp multi-threaded) app?

@derekbruening
Copy link
Contributor Author

We also have trouble working with dr_nonheap_alloc() with +wx memory. Since I don't think we can turn this on by default yet I'm going to just issue a usage error in that case.

derekbruening added a commit that referenced this issue May 16, 2019
Adds proper handling of encoding an instr_t immed to a copy.
Adds a test: client.reachability with -satisfy_w_xor_x.

Issue: #3556
Fixes #3615
derekbruening added a commit that referenced this issue May 16, 2019
Adds proper handling of encoding an instr_t immed to a copy.
Adds a test: client.reachability with -satisfy_w_xor_x.

Issue: #3556
Fixes #3615
@derekbruening
Copy link
Contributor Author

The fork copy solution of course has a race. I'm not sure there's a better solution. For now I may live with the race as the alternative is having the parent wait.

Summarizing where we are:

    /* XXX i#3566: Support for W^X has some current limitations:
     * + It is not implemented for Windows or Mac.
     * + Fork is not perfectly supported: there is overhead and a race.
     * + Pcaches are not supported.
     * + -native_exec_list is not supported.
     * + dr_nonheap_alloc(rwx) is not supported.
     *   Clients using other non-vmcode sources of +wx memory will also not comply.
     */

derekbruening added a commit that referenced this issue May 17, 2019
Adds W^X-aware handling of several generated code cases missed in the
original implementation:
+ mangle_syscall_code()
+ shift_ctis_in_fragment()

Adds best-effort handling of fork (there is still a race, and
potentially noticeable overhead).

Adds a fix for the proper heap type when extending reachable heap
units (which showed up as a trace encoding bug).

Adds a usage error when using dr_nonheap_alloc() with +wx memory,
which we do not support with W^X.

Changes the single -satisfy_w_xor_x test into a cross-cutting option
set, expanding into multiple tests to cover more behavior.  To include
drcachesim tests here, a new test feature _self_serial is added which
adds dependences for copies of the same test run under different
options.

Issue: #3556
derekbruening added a commit that referenced this issue May 17, 2019
Adds W^X-aware handling of several generated code cases missed in the
original implementation:
+ mangle_syscall_code()
+ shift_ctis_in_fragment()

Adds best-effort handling of fork (there is still a race, and
potentially noticeable overhead).

Adds a fix for the proper heap type when extending reachable heap
units (which showed up as a trace encoding bug).

Adds a usage error when using dr_nonheap_alloc() with +wx memory,
which we do not support with W^X.

Changes the single -satisfy_w_xor_x test into a cross-cutting option
set, expanding into multiple tests to cover more behavior.  To include
drcachesim tests here, a new test feature _self_serial is added which
adds dependences for copies of the same test run under different
options.

Issue: #3556
@derekbruening
Copy link
Contributor Author

I see two solutions to the fork race:

  1. Parent waits for child to finish copying.
  2. Parent makes a temp copy and continues. When child is done copying from
    the temp it sets some flag so the parent can lazily free the temp copy.

I'm not sure which is better.
The copy size: for -no_reachable_heap (which maybe we should disallow with
-satisfy_w_xor_x) it's just vmcode but that can be 200MB+ for very large apps.
As mentioned above, hopefully such apps do not fork often.

derekbruening added a commit that referenced this issue May 30, 2019
Changes how fork is handled to avoid races.  Now the parent makes an
anonymous private temporary copy of vmcode pre-fork, freeing it
post-fork.  The child sets up its new mappings from the temporary
copy, avoiding any issues with copying the parent's live data.

Adds a sanity fork regression test.

Issue: #3556
derekbruening added a commit that referenced this issue May 30, 2019
Changes how fork is handled to avoid races.  Now the parent makes an
anonymous private temporary copy of vmcode pre-fork, freeing it
post-fork.  The child sets up its new mappings from the temporary
copy, avoiding any issues with copying the parent's live data.

Adds a sanity fork regression test.

Issue: #3556
@derekbruening
Copy link
Contributor Author

DR_ALLOC_CACHE_REACHABLE is in the same boat as dr_nonheap_alloc: we'd have to expose the mapping function to support it. It should also have an assert for now.

hgreving2304 pushed a commit that referenced this issue Jul 25, 2019
Fixes a bug introduced in 1beb546, missing the translation to the executable address
view, leading to unreachable 32-bit displacements.

Adds vm_base options to an existing w^x test in order to expose future bugs like above.

Issues: #1312, #3556
hgreving2304 pushed a commit that referenced this issue Jul 25, 2019
Fixes a bug introduced in 1beb546, missing the translation to the executable address
view, leading to unreachable 32-bit displacements.

Adds vm_base options to an existing w^x test in order to expose future bugs like above.

Issues: #1312, #3556
@derekbruening
Copy link
Contributor Author

Just clarifying one reason this is left open: how to enable clients to operate properly wrt W^X.
See the above comments on DR_ALLOC_CACHE_REACHABLE, etc. Exposing the mapping function sounds confusing to clients (and it should have a different name for data: s/executable/reachable/).

Better would be to shrink the vmcode mapping and have a 2nd one there for
reachable heap so nobody has to do any mapping, but that has sizing issues.
Xref the client lib being in the same space and the default 1G vs 2G issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant