Skip to content

This project is a demonstration for a information retrieval tool.

Notifications You must be signed in to change notification settings

SamerDiab/java-information-retrieval-project

Repository files navigation

java-information-retrieval-project

This project is a demonstration for a information retrieval tool.

Textbank is the folder containing the medline files.
These files are passed through the following steps:

  1. Stopwords remover based on a stoplist file. It generates a .stp file.

  2. Stemmer using the proter algorithym. It generates a .stem file

  3. A tfidf generation algorythim that stores the values in a inverted file.
    The inverted file has a custom JSON data structure.
    {
       "word":{
           "1(document frequency)":{
                 "doc1.stp":0.01(tfidf values)
                }
               }
           }
    4.A cosine value generation and document retrieval tool.   These are the top three results for the following query on the medline collection:
    the crystallin len vertebr includ human


      Document Name : doc72.stp and cosine value : 0.357
      Document Name : doc500.stp and cosine value : 0.284
      Document Name : doc965.stp and cosine value : 0.265

  4. A precision and recall calculation tool. These are the graphs showing the values of the precision and recall for the first query results of the medline collection.

Precsion and Recall

PrecisionAndRecall

Average precision per recall

AvgPrecisionAndRecall


The code performs well on older machines, averaging on 28 seconds for the entire process. The tool is still a work in progress. It will be optimized.

With the custom json data structure, we can convert this project into a firebase project for optimizing performance speed.
The code utilizes multiple coding concepts. Like multithreading, concurrent skip list maps, hashmaps, hastables and iterators.

About

This project is a demonstration for a information retrieval tool.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages