This project is a demonstration for a information retrieval tool.
Textbank is the folder containing the medline files.
These files are passed through the following steps:
-
Stopwords remover based on a stoplist file. It generates a .stp file.
-
Stemmer using the proter algorithym. It generates a .stem file
-
A tfidf generation algorythim that stores the values in a inverted file.
The inverted file has a custom JSON data structure.
{
"word":{
"1(document frequency)":{
"doc1.stp":0.01(tfidf values)
}
}
}
4.A cosine value generation and document retrieval tool. These are the top three results for the following query on the medline collection:
the
crystallin
len
vertebr
includ
human
Document Name : doc72.stp and cosine value : 0.357
Document Name : doc500.stp and cosine value : 0.284
Document Name : doc965.stp and cosine value : 0.265
-
A precision and recall calculation tool. These are the graphs showing the values of the precision and recall for the first query results of the medline collection.
The code performs well on older machines, averaging on 28 seconds for the entire process. The tool is still a work in progress. It will be optimized.
With the custom json data structure, we can convert this project into a firebase project for optimizing performance speed.
The code utilizes multiple coding concepts. Like multithreading, concurrent skip list maps, hashmaps, hastables and iterators.