fn2hash - Function hashing and code similarity
fn2hash [--min-instructions=NUMBER] [--basic-blocks] [--json=FILENAME] [--pretty-json[=INDENT]] [...Pharos options...] EXECUTABLE_FILE
fn2hash --help
fn2hash --rose-version
fn2hash calculates various function hashes for the functions in a program and dumps the data to stdout in the following CSV format:
filemd5, fn_addr, num_basic_blocks, num_basic_blocks_in_cfg, num_instructions, num_bytes, exact_hash, pic_hash, composite_pic_hash, mnemonic_hash, mnemonic_count_hash, mnemonic_category_hash, mnemonic_category_counts_hash, mnemonic_count_string, mnemonic_category_count_string [,opt_basic_block_data, opt_bb_cfg]
where those columns are:
- filemd5
The md5 sum of the input file
- fn_addr
The address of this function
- num_basic_blocks
The number of basic blocks that comprise the function.
- num_basic_blocks_in_cfg
The number of those basic blocks that are actually in the control flow graph of the function.
- num_instructions
The number of instructions in the function.
- num_bytes
The number of bytes that make up the instructions in the function.
- exact_hash
The md5 of the bytes of the function concatenated in flow order.
- pic_hash
Basically the same as the exact_hash, but address references (except local relative ones) are replaced with 0 values before hashing. The goal is to account for functions that are effectively exactly identical except for references to locations in memory (other functions, imports, global data addresses, etc) that might change with occurances in different programs.
- composite_pic_hash
A variant of the pic_hash that does not include bytes for control flow related instructions, and the hash is computed by computing the md5 of each basic block separately (minus the control flow related bytes), and those basic block md5s are ordered and concatenated, and that resulting string is hashed (md5). The goal is to account for minor differences in output at compile time, like for instance the compiler deciding to use jz instead of jnz and reordering the otherwise identical basic blocks because of that.
- mnemonic_hash
Like the exact_hash but instead of concatenating the bytes of the instructions to hash, the mnemonics for the instructions are concatenated instead (without operands) and hashed.
- mnemonic_count_hash
This is a hash of a vector of ordered pairs of mnemonics and the number of occurances of that mnemonic in the function.
- mnemonic_category_hash
Like the mnemonic_hash but the mnemonics are mapped to a smaller set of categories instead. The categories are:
Data transfer insns (eg: mov, push, xchg).
Arithmetic insns (eg: add, sub, lea).
Bitwise operations (eg: and, or, not, xor, shl, ror).
Comparison insns (eg: test, cmp).
- BR
Branching insns (eg: jmp, jcc, call).
Floating point insns (eg: fadd, fmul, fld).
SIMD (MMX/SSE* related) insn (eg: addps, mulss, psadbw).
Insn to aid in cryptography (AES and SHA) calculations (eg: aesdec, sha256rnds2).
Virtual Machine Monitory (hypervisor) related insns.
Various "system" level and privileged insns (eg: int, sysenter).
String related functions (eg: movsb).
- I/O
Port related insn (eg: in, out, insb, outsb).
Any insns that haven't been assigned to one of the above categories.
- mnemonic_category_counts_hash
Like mnemonic_count_hash but using the mnemonic categories instead of mnemonics.
- mnemonic_count_string
The actual vector used in mnemonic_count_hash.
- mnemonic_category_count_string
The actual vector used in mnemonic_category_count_hash.
- opt_basic_block_data
Metadata about each basic block in the function, formatted like so:
For each block that is starting address for the basic block, followed by number of instructions in it, the PIC hash of just that block, the CPIC (composite PIC variant, so no control flow insns included) hash of that block, and the mnemonic and mnemonic category for each insn in the block (in order).
- opt_bb_cfg
This describes the edges of the control flow graph of the basic blocks for this function with the format like so:
Each address pair is an edge in the CFG, the addresses are the starting addresses of the basic blocks in each edge (so in the above the block starting at addr1 can flow into the block starting at addr2 and BB @ addr3 can flow into BB @ addr4). Note that if there is only one basic block in the function, this data will be blank.
Note that since the file md5 is the first column in the output, that the fn2hash output for multiple files can be combined easily, if desired. Might be convenient for working with data from related sets of files.
The following options are specific to the fn2hash program.
- --min-instructions=NUMBER, -m=NUMBER
Minimum number of instructions needed to output data for a function, so functions below this instruction count will not appear in the output.
- --basic-blocks, -B
The -B option adds basic block data over two new fields, the first with basic block individual data, the 2nd with the edges of the function control flow graph, formatted like so:
Output hash information in JSON format to FILENAME. If FILENAME is
, JSON will output to stdout. - --pretty-json[=INDENT], -p[=INDENT]
When outputting JSON, use newlines and indentation, making the output human-readable. INDENT is the indentation level, and defaults to
$ fn2hash --stockpart --no-semantics win7_calc.exe > win7_calc.fn2hash.txt
$ head -1 win7_calc.fn2hash.txt
$ fn2hash --stockpart --no-semantics --basic-blocks win7_notepad.exe > win7_notepad.fn2hash.bb.txt
$ head -1 win7_notepad.fn2hash.bb.txt
Written by the Software Engineering Institute at Carnegie Mellon University. The primary author was Charles Hines.
Copyright 2018 Carnegie Mellon University. All rights reserved. This software is licensed under a "BSD" license. Please see LICENSE.txt for details.
See fse.py and possibly fn2yara.