This is an implementation of the paper in Rust. Nino Stephen made an implementation in Python (https://github.com/ninostephen/fbhash). Part of the following description I have adapted from his description.
The paper argues that it is resistant to an adversary that actively tries to hide similarities to other documents. Similarity checking is a means of finding other/previous versions of documents in forensic practice.
The idea behind the FbHashing
scheme is create a new representation of the document that can then
be used to identify other documents that are 'close' to the representation. For this to work,
the software starts to identify frequently occurring parts in the document and use that as a
pattern for similarity to other documents.
Instead of comparing parts of the document, content of the document is
translated to the a TF/IDF vector representation of the document and then taking
the cosine distance to the vectors of other documents.
The whole process occurs as follows.
- Take all the documents in a directory and compute the TF/IDF vectors on those documents.
- For each document do the following
- First find the chunks in the document. The paper suggests making a slice of 7 bytes. Each new chunk is generated by removing the first character from the preceding chunk and adding the next byte at the end from the document supplied. For example, if our document contained ABCDEFGHIJKLM... then the first chunk will be ABCDEFG, the second will be BCDEFGH and so on.
- The chunks are then passed on to the Rabin-Karp Rolling Hash function, to create a hash value for the chunk.
Where "a" is a constant (255), k is the chunk length (7)), n is a large prime number (801,385,653,117,583,579, taken from Nino's implementation),
C1,C2,..,Ck are the input characters (ASCII Values)
Hnew = ( aH - C0ak + incoming byte ) modulus n
EQN : RollingHash(Ch(i+1)) = a*RollingHash(Chi) - Ciak + C(i+k) modulus n
Max value of H can be an unsigned 64 bit number
All possible ASCII values of Ci (input characters) = 256 (ASCII Value)
All possible values of "a" (constant) = 256 (size of alphabet)
Value of k should satisfy the following
264 - 1 >= 256 * 256(k-1)
Let k = 7
264 > 28 * (28)6 = 256 And n = 801385653117583579 : Prime number larger than 2^56 = 72057594037927940. This huge value is necessary to ensure that there is a small chance of collision while generating hashes.
- Using this list of rolling hashes, filter out the unique chunks and find the chunk frequency. By chunk frequency we mean the number of times a chunk repeats itself within the same document. The hash value is used as the index into a hash table and the values associated being the frequency of chunks having that hash value. The implementation uses the
HashMapFrequency
for this goal. - Once the chunk frequency is calculated, the chunk weight is calculated using it. A weight is assigned to the each chunk based on its frequency. The weight is calculated using the following equation.
Where ChifD is the chunk frequency of that ith chunk
- The value of a chunk is determined by the frequency of the chunk in other documents as well. More frequent the chunk is, less valuable it become. To determine the value of a chunk, the rolling hashes of a chunk are checked with the rolling hashes of other documents in the database and its document frequency is calculated just like the chunk frequency is calculated. Document frequency is represented as dfCh
- Using the document frequency, the document weight can be calculated. The document weight is know as the informativeness of the document. As mentioned above, if the document frequency of a chunk is high, then it is less valuable. The less frequent chunks are more interesting for fingerprinting documents. The document weight can be calculated as follows:
Where N is the number of documents the chunk was checked against for calculating the document frequency.
- Using the values we have obtained so far we can calculate the Chunk Score. The chunk score is the final relevance of a chunk. Its calculated as follows :
Where x is the document number (that is 1 and 2 in D1 and D2 respectively)
- All the chunk scores are collectively used to represent the Digest of the document. The digest of the document is the chunk scores in vector form.
First you have to index the files you want to compare to:
fbhash index --state state.json --database=database.json <The directories and/or files to index>
Then you can query the eight files that are closest by using:
fbhash query -n 8 database.json state.json <The files you want to have compared>
Obviously, you can change the number of documents returned with adjusting the 8
to any other number.
Also, to get help on all available options:
fbhash help
[1] FbHash: A New Similarity Hashing Scheme for Digital Forensics, Donghoon Chang, Mohona Ghosh, Somitra Kumar Sanadhya, Monika, Singh, and Douglas R. White.