Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

ninostephen/fbhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

FBHash : A New Similarity Hashing Scheme for Digital Forensics

[D. Chang et al. / Digital Investigation 29 (2019) S113eS123]

This is an implementation of the mentioned paper. The core idea behind the FbHashing scheme is to identify frequently occuring chunks in document and use it to find the similarity of the document with another document. In another terms, it is used to fingerprint a document to identify interesting pieces of information that will aid in digital forensic investigations.

The whole process occurs as follows.

  • Take two document of length N. For our ease of demonstration, I've added two simple data objects within the source code itself. Each of them are 52 characters long and include only the uppercase and lowercase ASCII characters.
  • For each document do the following
  • First find the chunks in a document. For that we slice the document into numerous pieces with length 7. Each new chunk is generated by removing the first character from the preceeding chunk and adding another one at the end from the document supplied. For example, if our document was ABCDEFGHIJKLM... then the first chunk will be ABCDEFG and the second one will be BCDEFGH and so on.
  • The chunks are then passed on to the Rabin-Karp Rolling Hash function. Here a unique hash of the chunks are calculated using the following equation :

H = C1a(k-1) + C2a(k-2) + C3a(k-3) + ... + Cka0 modulus n

Where "a" is a constant, k is window size (chunk length), n is a large prime number,

C1,C2,..,Ck are the input characters (ASCII Values)

Hnew = ( aH - C0ak + incoming byte ) modulus n

EQN : RollingHash(Ch(i+1)) = a*RollingHash(Chi) - Ciak + C(i+k) modulus n

Max value of H can be an unsigned 64 bit number

All possible ASCII values of Ci (input characters) = 256 (ASCII Value)

All possible values of "a" (constant) = 256 (size of alphabet)

Value of k should satisfy the following

264 - 1 >= 256 * 256(k-1)

Let k = 7

264 > 28 * (28)6 = 256 And n = 801385653117583579 : Prime number larger than 2^56 = 72057594037927940. This huge value is necessary to ensure that there is no collision while generating hashes.

  • Using this list of rolling hashes, filter out the unique chunks and find the chunk frequncy. By chunk frequency we mean the number of times a chunk repeats itself within the same document. This functions a hash table with its indexes being the rolling hashes and the values associated being the frequency of the chunk.
  • Once the chunk frequency is calculated, the chunk weight is calculated using it. A weight is assigned to the each chunk based on its frequency. The weight is calculated using the following equation.

Chunk Weight of ith chunk = 1 + log10(ChifD)

Where ChifD is the chunk frequency of that ith chunk

  • The value of a chunk is determined by the frequency of the chunk in other documents as well. More frequent the chunk is, less valuable it become. To determine the value of a chunk, the rolling hashes of a chunk is checked with the rolling hashes of other documents in the database and its doucument frequency is calculated just like the chunk frequency is calculated. Docuement frequency is represented as dfCh
  • Using the document frequency, the document weight is calculated. The document weight is know as the informativeness fo the document. As mentioned above, if the document frequency of a chunk is high, then it is less valuable. The less frequent chunks are more interesting for fingerprinting documents. The document weight can be canculated as follows.

idfChi = log10(N/dfChi)

Where N is the number of documents the chunk was checked against for calculating the document frequency. For ease of demonstration, we are generating the N documents withing the code itself using the randStr() function on the top of the source code. In real life these documents would be fetched from a database.

NB : In our case we are generating a 1000 data objects using this function and using it to calculate the document frequency. This have a few problems. The results may vary with each run for the same input document. A weird looking edge case will occur as well because 3-4 extra documents were manually added to the set for testing purposes. This will cause a minor difference in the last similary percentage calculated. In one of the test runs which I made, both D1 and D2 were the same. Due to the above mentioned extra documents the similary score was 100.000000003%. This can be fixed by removing the extra datasets. I've left it there for ease of testing.

  • Using the values we have obtained so far we can calculate the Chunk Score. The chunk score is the final relavence of a chunk. Its calculated as follows :

Chunk Score WDxCh = Chwgt * idfCh

Where x is the document number (that is 1 and 2 in D1 and D2 respectively)

  • All the chunk scores are callectively used to represent the Digest of the Docuemt. The digest of the document is the chunk scores in vector form.

digest(D1) = WD1C0, WD1C2, ..., WD1Ch-1

  • All these calculations were carried out to find the similarity between the documents D1 and D2. The equation mentioned below returns the similarity in the scale from 0 to 100 percentage.

Assumptions, Limitations, ToDos and Lazy implementiations

  • There are a few assumptions and lazy implementations that I've made while implementing the algorithm
  • Lenght of D1 and D2 are expected to be the same and if they are not, then an out of index error will be thrown by the code.
  • The rolling hash function doesn't use the mentioned equation for find the rolling hash of the subsequent chunks with reference to the preceeding chunk. Instead the same equation used to calculate the rolling hash of the first chunk is used to generate the hash for the other chunks as well. I found it easier and more convinient to do so instead of implementing the second equation. This does come with a drawback though. It will create a significant cost to generate the hashes when the document is larger.
  • The calculations are done in memory and is not stored anywhere, so the same computations have to be made every single time. This have another disadvantage as well. It will not be feasible to hold the hashes, hash tables and data objects in the physical memory as the document size grows. Simplest solution would be to somehow store these on disk for later reads.
  • The documents that are randomly generated in the top of the function will cause a minor difference in the end results during every run.
  • The extra data objects apppended to the document list generated using the random function will cause a minor difference in the final results as well.
  • Printing the processed data. I did include print statements to just print out the lists and dictionaries but it is not the prettiest thing to see. Maybe printing it as tables or even as graphs using modules like mathplotlib would solve that issue.
  • No export function. There is no export function and hence itself the actual results can't be used for anything usefull. Exporting to JSON format would be a good thing to implement.
  • No control over verbosity of the output. There are no functions in the code to control the verbosity of the outputs on the functions being called. It would be useful if it is implemented as it could aid in understanding out what is happening under the hood.
  • No command line arguments. All the data that is being used are hardcoded in the source code itself, which is an ugly thing to do. Will have to clean that up in the future.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages