FBHash : A New Similarity Hashing Scheme for Digital Forensics

[D. Chang et al. / Digital Investigation 29 (2019) S113eS123]

This is an implementation of the mentioned paper. The core idea behind the FbHashing scheme is to identify frequently occuring chunks in document and use it to find the similarity of the document with another document. In another terms, it is used to fingerprint a document to identify interesting pieces of information that will aid in digital forensic investigations.

The whole process occurs as follows.

Take two document of length N. For our ease of demonstration, I've added two simple data objects within the source code itself. Each of them are 52 characters long and include only the uppercase and lowercase ASCII characters.
For each document do the following
First find the chunks in a document. For that we slice the document into numerous pieces with length 7. Each new chunk is generated by removing the first character from the preceeding chunk and adding another one at the end from the document supplied. For example, if our document was ABCDEFGHIJKLM... then the first chunk will be ABCDEFG and the second one will be BCDEFGH and so on.
The chunks are then passed on to the Rabin-Karp Rolling Hash function. Here a unique hash of the chunks are calculated using the following equation :

H = C₁a^(k-1) + C₂a^(k-2) + C₃a^(k-3) + ... + C_ka⁰ modulus n

Where "a" is a constant, k is window size (chunk length), n is a large prime number,

C₁,C₂,..,C_k are the input characters (ASCII Values)

Hnew = ( aH - C₀a^k + incoming byte ) modulus n

EQN : RollingHash(Ch_(i+1)) = a*RollingHash(Ch_i) - C_ia^k + C_(i+k) modulus n

Max value of H can be an unsigned 64 bit number

All possible ASCII values of C_i (input characters) = 256 (ASCII Value)

All possible values of "a" (constant) = 256 (size of alphabet)

Value of k should satisfy the following

2⁶⁴ - 1 >= 256 * 256^(k-1)

Let k = 7

2⁶⁴ > 2⁸ * (2⁸)⁶ = 2⁵⁶ And n = 801385653117583579 : Prime number larger than 2^56 = 72057594037927940. This huge value is necessary to ensure that there is no collision while generating hashes.

Using this list of rolling hashes, filter out the unique chunks and find the chunk frequncy. By chunk frequency we mean the number of times a chunk repeats itself within the same document. This functions a hash table with its indexes being the rolling hashes and the values associated being the frequency of the chunk.
Once the chunk frequency is calculated, the chunk weight is calculated using it. A weight is assigned to the each chunk based on its frequency. The weight is calculated using the following equation.

Chunk Weight of i^th chunk = 1 + log₁₀(Ch_if^D)

Where Ch_if^D is the chunk frequency of that i^th chunk

The value of a chunk is determined by the frequency of the chunk in other documents as well. More frequent the chunk is, less valuable it become. To determine the value of a chunk, the rolling hashes of a chunk is checked with the rolling hashes of other documents in the database and its doucument frequency is calculated just like the chunk frequency is calculated. Docuement frequency is represented as d^fCh
Using the document frequency, the document weight is calculated. The document weight is know as the informativeness fo the document. As mentioned above, if the document frequency of a chunk is high, then it is less valuable. The less frequent chunks are more interesting for fingerprinting documents. The document weight can be canculated as follows.

id^fCh_i = log₁₀(N/d^fCh_i)

Where N is the number of documents the chunk was checked against for calculating the document frequency. For ease of demonstration, we are generating the N documents withing the code itself using the randStr() function on the top of the source code. In real life these documents would be fetched from a database.

NB : In our case we are generating a 1000 data objects using this function and using it to calculate the document frequency. This have a few problems. The results may vary with each run for the same input document. A weird looking edge case will occur as well because 3-4 extra documents were manually added to the set for testing purposes. This will cause a minor difference in the last similary percentage calculated. In one of the test runs which I made, both D1 and D2 were the same. Due to the above mentioned extra documents the similary score was 100.000000003%. This can be fixed by removing the extra datasets. I've left it there for ease of testing.

Using the values we have obtained so far we can calculate the Chunk Score. The chunk score is the final relavence of a chunk. Its calculated as follows :

Chunk Score W^D_xCh = Chwgt * id^fCh

Where x is the document number (that is 1 and 2 in D1 and D2 respectively)

All the chunk scores are callectively used to represent the Digest of the Docuemt. The digest of the document is the chunk scores in vector form.

digest(D1) = W^D₁C₀, W^D₁C², ..., W^D₁C_h-1

All these calculations were carried out to find the similarity between the documents D1 and D2. The equation mentioned below returns the similarity in the scale from 0 to 100 percentage.

$Similarity(D1, D2) = \frac{\sum_{i = 0}^{n - 1 }{W^{D}}_{1}C_{i}*{W^{D}}_{2}C_{i}}{\sqrt{\sum_{i = 0}^{n - 1 }({W^{D}}_{1}C_{i})^2}*\sqrt{\sum_{i = 0}^{n - 1 }({W^{D}}_{2}C_{i})^2}}*100$

Assumptions, Limitations, ToDos and Lazy implementiations

There are a few assumptions and lazy implementations that I've made while implementing the algorithm
Lenght of D1 and D2 are expected to be the same and if they are not, then an out of index error will be thrown by the code.
The rolling hash function doesn't use the mentioned equation for find the rolling hash of the subsequent chunks with reference to the preceeding chunk. Instead the same equation used to calculate the rolling hash of the first chunk is used to generate the hash for the other chunks as well. I found it easier and more convinient to do so instead of implementing the second equation. This does come with a drawback though. It will create a significant cost to generate the hashes when the document is larger.
The calculations are done in memory and is not stored anywhere, so the same computations have to be made every single time. This have another disadvantage as well. It will not be feasible to hold the hashes, hash tables and data objects in the physical memory as the document size grows. Simplest solution would be to somehow store these on disk for later reads.
The documents that are randomly generated in the top of the function will cause a minor difference in the end results during every run.
The extra data objects apppended to the document list generated using the random function will cause a minor difference in the final results as well.
Printing the processed data. I did include print statements to just print out the lists and dictionaries but it is not the prettiest thing to see. Maybe printing it as tables or even as graphs using modules like mathplotlib would solve that issue.
No export function. There is no export function and hence itself the actual results can't be used for anything usefull. Exporting to JSON format would be a good thing to implement.
No control over verbosity of the output. There are no functions in the code to control the verbosity of the outputs on the functions being called. It would be useful if it is implemented as it could aid in understanding out what is happening under the hood.
No command line arguments. All the data that is being used are hardcoded in the source code itself, which is an ugly thing to do. Will have to clean that up in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
fbhash.py		fbhash.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FBHash : A New Similarity Hashing Scheme for Digital Forensics

[D. Chang et al. / Digital Investigation 29 (2019) S113eS123]

H = C₁a^(k-1) + C₂a^(k-2) + C₃a^(k-3) + ... + C_ka⁰ modulus n

Chunk Weight of i^th chunk = 1 + log₁₀(Ch_if^D)

id^fCh_i = log₁₀(N/d^fCh_i)

Chunk Score W^D_xCh = Chwgt * id^fCh

digest(D1) = W^D₁C₀, W^D₁C², ..., W^D₁C_h-1

Assumptions, Limitations, ToDos and Lazy implementiations

About

Releases

Packages

Languages

ninostephen/fbhash

Folders and files

Latest commit

History

Repository files navigation

FBHash : A New Similarity Hashing Scheme for Digital Forensics

[D. Chang et al. / Digital Investigation 29 (2019) S113eS123]

H = C1a(k-1) + C2a(k-2) + C3a(k-3) + ... + Cka0 modulus n

Chunk Weight of ith chunk = 1 + log10(ChifD)

idfChi = log10(N/dfChi)

Chunk Score WDxCh = Chwgt * idfCh

digest(D1) = WD1C0, WD1C2, ..., WD1Ch-1

Assumptions, Limitations, ToDos and Lazy implementiations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

H = C₁a^(k-1) + C₂a^(k-2) + C₃a^(k-3) + ... + C_ka⁰ modulus n

Chunk Weight of i^th chunk = 1 + log₁₀(Ch_if^D)

id^fCh_i = log₁₀(N/d^fCh_i)

Chunk Score W^D_xCh = Chwgt * id^fCh

digest(D1) = W^D₁C₀, W^D₁C², ..., W^D₁C_h-1

Packages