This directory contains a dynamic analysis framework built using DynamoRio. All together, these tools and techniques are called YARN.
Please note that this README and the one at tracetools/README.md may be out-of-date.
This is messy research code. Use at your own risk.
- Docker, installed and running.
- Git LFS extension installed
- Binary Ninja licence (see "Installing" section for information on
where these should be copied). In order to avoid api version
mismatch issues, Binary Ninja binaries are included in this
repository. This is not 100% required unless you plan on
developing/using anything that relies on
signatures/moment-of-recognition -- either
tracetools/tools/pt_tracker.py
ortracetools/tools/poppler_jpeg.py
The DynamoRio-based YARN instrumentation tool does not work with macOS, even when run indirectly via docker.
Make sure you have the git LFS extension installed before cloning anything. If it isn't installed then none of the *.zip files in ./parsers will be valid zip files.
If you have one, copy your Binary Ninja license to
third-party/binaryninja/license.dat
(both headless and regular
Binary Ninja packages work).
After you have properly installed docker on your system, build the docker
./build.sh
This will build a docker for the DynamoRIO-based tools by default. The
image is named mr_memtrace-analysis-dev
.
If you find yourself needing to debug the build.sh
file, run it with
the --no-cache
option to force docker to rebuild the image from
scratch.
Starting the YARN docker container:
docker run -it --rm mr_memtrace-analysis-dev:latest
If you plan on doing any tool development, you can mount your local
memtrace directory (repository root) to /processor
.
docker run -it --rm -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest
This should be run from the root of the memtrace directory in which
you will be working on memtrace or memtrace-tool scripts. If you edit
any files that relate to the DynamoRIO instrumentation (e.g.,
mem-trace.c), you will need to run make
from your container's
/processor
directory.
Note: if the filesystem where your docker containers live has limited
storage you may wish to tell docker to store the results/logs
generated by memtrace (stored in the container's /results/
directory)
elsewhere using the -v
(volume) option to specify the host directory, e.g.,
docker run -it --rm -v"/media/largedisk/results:/results" -v"$(pwd):/processor" mr_memtrace-analysis-dev:latest
To quickly get started, run these three commands:
mkdir results
./build.sh
docker run -it --rm -v "$(pwd)/results:/results" mr_memtrace-analysis-dev:latest
./run_trace.py /pdfs/hello.pdf
Next proceed to the Postprocessing instrumentation results section to learn how to run analyses on the tracer's results. If you are interested in generating and viewing a parse tree for the results continue on to the Parse tree analysis and viewing section.
Use run_trace.py
within a mr_memtrace-analysis-dev
container to
execute a instrumented run of a parser. It supports a small number of
parser/parser families including poppler's pdftotext and pdftops as
well as mupdf's mutool conversion to ps and text.
run_trace.py
wraps the output generated by the DynamoRIO tools in a
structured manner with which all the processing tools in memtrace-tools
understand.
run_trace.py
still has a lot of hard-coded cruft in there, so for now
it is best to run it inside the provided docker container.
run_trace.py
can be run on the included example pdf:
./run_trace.py /pdfs/hello.pdf
You must specify the path to at least 1 input to be processed as
arguments to run_trace.py
, i.e.,
./run_trace.py path/to/foo.pdf path/to/bar.pdf
By default, run_trace.py
will execute poppler's pdftops. To see
what other parser families/binaries are supported, execute
./run_trace.py --list
If you would like to run a non-default parser,
specify the parser family using the -p
option, version using -v
,
and binary using -b
, e.g.,
> ./run_trace.py --list
Parser family mupdf:
input type: pdf:
version: 1.18.0
supported binaries: (name/command)
- mutool: mutool clean -s -ggg {in_file} out.pdf
- mutops: mutool convert -F ps -o out.ps {in_file}
- mutotext: mutool convert -F txt -o out.txt {in_file}
- mutotext-decrypt-user: mutool convert -p user -F txt -o out.txt {in_file}
- mutotext-decrypt-owner: mutool convert -p owner -F txt -o out.txt {in_file}
Parser family poppler:
input type: pdf:
version: 0840
supported binaries: (name/command)
- pdftops: utils/pdftops {in_file} out.ps
- pdf-fullrewrite: test/pdf-fullrewrite {in_file} out.pdf
- pdftocairo: utils/pdftocairo -png {in_file} out
- pdftotext: utils/pdftotext {in_file} out.txt
- pdftotext-decrypt-user: utils/pdftotext -upw user {in_file} out.txt
- pdftotext-decrypt-owner: utils/pdftotext -opw owner {in_file} out.txt
version: eval1_sri
supported binaries: (name/command)
- pdftops: utils/pdftops {in_file} out.ps
> ./run_trace.py -p mupdf -v 1.18.0 -b mutops path/to/foo.pdf path/to/bar.pdf
(Note: if supported needs to be added for a different parser family,
version, and/or binary, it needs to be added to a json configuration
file in ./parser-settings
. The contents/semantics/format of these
files are currently undocumented)
./run_trace.py
will create a directory containing the run's results
under /results
(or directory specified by -r
option). The
subdirectory will be given a randomly generated name that starts with
res_
. You may use-t <name>
to tag the generated results with a
more memorable name (this merely creates a symbolic link). Most
memtrace postproccings tools require the path (or symbolic link) to
the result directory to be processed.
The generated results directory contains information including:
- process's address space layout (address map, in mmap.*.log)
- Binary event log generated by instrumentation (in memcalltrace.*.log, one per thread)
- command's standard output/error content (in
subprocess.out
) - command invoked, exit value, runtime, etc (in info.txt)
- a copy of the input file
All binaries/libraries loaded by the parser will be cached in a
/results/bins_*
directory (by default) -- this is done once per
instrumented parser binary. Each /results/bins_*
directory contains
all results directories generated by its corresponding parser binary
(cached in the /results/bins_*/data
directory). The
/results/res_*
directories are merely symbolic links.
Tools for postprocessing instrumentation results live in the memtrace subrepo/directory.
After a successful run of run_trace.py
in the docker container,
it'll create a directory named res_*
in /results
containing the
memory tracing log and other artifacts. The path to this directory is
passed as an argument to the --parse_result
/-R
option by the
postprocesing tools (which live in tracetools/tools).
The memtrace source tree contains a run-analysis.sh
script (in the
root directory) that is a handy wrapper for running the postprocessing
tools within the mr_memtrace-analysis-dev
docker container. (It
invokes postprocessing tools withing a pypy environment which is
significantly faster than using the default python interpreter).
For example, suppose you ran an instance of run_trace.py
in the
docker container that saved its results to
/results/res_40a7286ba6614a33ba658115ec8c719c
and you wanted to
print out its memtrace log in a human readable format. You can use
the tracetools/tools/print_log.py
tool to do this. Within the
docker container, invoke the tool this way:
./run-analysis.sh print_log.py -R /results/res_40a7286ba6614a33ba658115ec8c719c
./run-analysis.sh
is merely a wrapper to
tracetools/tools/print_log.py
, so you can use it view the tool's
documentation:
./run-analysis.sh print_log.py --help
See tracetools/README.md for more information on analyzing YARN's instrumentation's output with yarn's postprocessing tools.
Although parse tree analysis and viewing somewhat requires a binary ninja license to calculate addresses of important parsing code, we've included some files containing pre-generated addresses in this repository so that one can bypass the Binary Ninja requirements. Installing these pre-generated files in the proper location isn't a straightforward task.
First spin up a docker container with a directory of sample PDFs
mounted at /pdfs
, e.g.,
docker run -it --rm -v"$HOME/pdfs:/pdfs" -v"$(pwd)/results:/results" mr_memtrace-analysis-dev:latest
E.g., via
./run_trace.py /pdfs/sample.pdf
(This will trace /pdfs/sample.pdf
as its parsed by poppler's
pdftops)
If this runs successfully, then at the end of stdout you should see something like the following:
Results saved to /results/bins_9ad307edb9ca430e814bef40d09fd232/res_40a7286ba6614a33ba658115ec8c719c
This is where the tracing results from the tracer were saved.
Note that ./run_trace.py also creates a symbolic link to this
directory at /results/res_40a7286ba6614a33ba658115ec8c719c
.
Next, copy the pre-generated address metadata in parser-metadata
to
the /results/bins_*/data
directory, e.g.,
cp parser-metadata/* /results/bins_9ad307edb9ca430e814bef40d09fd232/data/
Finally you should now be able to use the parse tree postprocessing tool to generate and save a copy of the parse tree:
./run-analysis.sh pt_tracker.py -R /results/res_40a7286ba6614a33ba658115ec8c719c -s
If when you run this tool you see output that looks similar to the
following, then you haven't copied all the parser metadata to the
proper bins_*/data
directory.
running pt_tracker.py
ERROR:root:Cannot import bin_info/binary ninja
ERROR:root:No module named 'binaryninja'
ERROR:root:Note: binja is not supported by pypy3
WARNING:root:Address version cache for these results have not been createed yet for libraries: pdftops, libpoppler.so.94, libc-2.31.so. Try rerunning pt tracker with '-g' option to generate cache
INFO:root:adding tracing for library /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94 at /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94.otherdb
INFO:root:adding tracing for library /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2 at /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2.otherdb
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/pdftops.otherdb
INFO:root:... done
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libpoppler.so.94.otherdb
INFO:root:... done
INFO:root:importing cache of symbol info /results/bins_2d4c6dc5341c4defadd0bdea43b85f60/data/libjpeg.so.8.2.2.otherdb
INFO:root:... done
Traceback (most recent call last):
File "/processor/tracetools/tools/pt_tracker.py", line 134, in <module>
run(args)
File "/processor/tracetools/tools/pt_tracker.py", line 111, in run
p = PTTracker(a)
File "/processor/tracetools/tools/pt_tracker.py", line 69, in __init__
print_offset=a.print_offset)
File "/processor/tracetools/tracetools/signatures/versions.py", line 550, in create_tracker
return tracker_cls(ml, unique_only, **kwargs)
File "/processor/tracetools/tracetools/signatures/xpdf_poppler.py", line 842, in __init__
**kwargs)
File "/processor/tracetools/tracetools/signatures/evaluator.py", line 268, in __init__
super(SigPTEval, self).__init__(parse_log, **kwargs)
File "/processor/tracetools/tracetools/signatures/evaluator.py", line 34, in __init__
self.signatures.setup_sig_classes(self, self.ml)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
self.setup_sig_classes(manager, parselog, subcls)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
self.setup_sig_classes(manager, parselog, subcls)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 253, in setup_sig_classes
self.setup_sig_classes(manager, parselog, subcls)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 259, in setup_sig_classes
self.setup_sig_classes(manager, parselog, subcls)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 246, in setup_sig_classes
cls.setup_sig_class(manager, parselog, callback)
File "/processor/tracetools/tracetools/signatures/signatures.py", line 89, in setup_sig_class
cls._setup()
File "/processor/tracetools/tracetools/signatures/signatures.py", line 688, in _setup
super(NewFrameMoment, cls)._setup()
File "/processor/tracetools/tracetools/signatures/signatures.py", line 473, in _setup
cls.setup()
File "/processor/tracetools/tracetools/signatures/xpdf_poppler.py", line 902, in setup
cls.xref_fetch_objs = cls.addrs_of("xref_fetch_obj")
File "/processor/tracetools/tracetools/signatures/signatures.py", line 479, in addrs_of
absolute)
File "/processor/tracetools/tracetools/signatures/versions.py", line 353, in addrs_of
+ str(addrs))
tracetools.signatures.utils.BinaryInfoException: Issue looking up addrs for libpoppler.so.94:xref_fetch_obj, found []
A successful run will save the derived parse tree in res_*/derived-pt.json. You can then view a textual representation of the parse tree via:
./run-analysis.sh pt_tracker.py -R /results/res_40a7286ba6614a33ba658115ec8c719c -j
You can also browse the parse tree via a GUI if you execute docker from the host so that it has access to the display:
xhost +local:
docker run -it --rm -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix:ro -v"$(pwd)/results:/results" mr_memtrace-analysis-dev:latest
Inside the docker container:
./run-analysis.sh view_parsetree.py -R /results/res_40a7286ba6614a33ba658115ec8c719c
The left panel of the GUI displays each parse tree, you can browse the parse trees using the up/down arrows and expand/hide the tree by a generation using the right and left arrow keys. Rows are sorted and assigned an ID in parsing order. For each node, you can see the object type and value (if it is a leaf node). The right panel shows details about the object highlighted in the left column, including a unique ID number, object type, object value (if it isn't a parent object), File Taint -- the byte offset in input file from which the node was build.
Use ./test_trace.py
directly to apply memtrace instrumentation to arbitrary
executables. E.g., to instrument a binary located at ./ls
,
./test_trace.py -R -b --parser ./ls --parser-args '' .
This will perform an instrumented run of ./ls
(because --parser ./ls
) called with no arguments (--parser-args ''
), tracing will
include basic block information (b/c -b
argument is specified).
Tracing will being when main
is called and end when it returns (use
-e [fn]
argument to override), and then it will print out the path
to the result directory (because '-R' is specified). The trailing dot (.
)
is treated as the binary's input file by the instrumentation. If the
binary doesn't process any input files, this final positional argument
can be any arbitrary file. If the binary does process an input file,
this argument should be the the path to the input file -- if the
binary needs to take the path as a command-line argument, update the
value --parser-args
to reflect this. E.g., if you want ./ls -l /root
to be called, then specify the argument using the {in_file}
placeholder in --parser-args
, i.e.,
./test_trace.py -b -R --parser ./ls --parser-args '-l {in_file}' /root
If you get the following error: "tracetools.results_data.ResultsException: Something went wrong and no mmap log exists. Did memory tracker log ever get enabled/populated?"
This means that the nothing ever got logged. This is likely due to the
entrypoint (by default "main", otherwise specified using the -e
parameter) never being invoked. Check the spelling of the symbol name
and try running the application within gdb to determine what
functions do get invoked.
This is left as an exercise for the reader.
The Makefile builds the DynamoRIO-based memory and callstack tracing tools. It also has three "test" targets (test1, test2, test3) that runs pdfto{text,html} against PDFs in ../tests. Output is saved in ./build/memcalltrace.pdfto*.log. Be aware that output generated by these tests can be several hundred megabytes up through several hundred gigabytes (and possibly larger)
tracetools/tools/print_log.py is a standalone python3 tool that simply parses the output generated by the memcalltrace tool and prints out the contents in a human-readable format.
Please see traetools/README.md for more information
This code is released under the MIT License
The MIT License (MIT)
Copyright (c) 2022 Narf Industries LLC
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.