Seun Suberu
Info: Data Structures Final Project at Southern Methodist University. Remade on December 2022. Remodel of original project.This is a COVID Document Search Engine written in C++ and built using CMAKE that utilizes a self-implemented AVLTree for word storage and self-implemented HashTable for authors. The documents are ranked by term-frequency/inverse document frequency metric and indexed into self-made data structures. There is a command line user interface provided. Implemented own Hash Table and AVL Tree to store authors and stemmed words.
Set up:
Link to Dataset for covid information files
Program argument should be then a path to the directory where the files for indexing are located.
For example:
$ mkdir build
$ cmake ..
$ ./SearchEngine {directory}
From there the program should run as expected. The program loads and indexes all the files and then the program should be ready for any queries.
The keywords recognized by this Search Engine are "AUTHOR", "AND", "OR", and "NOT".
The only keywords that can appear at the beginning of a query are AUTHOR and NOT. The other have to be have a search word before and after.
AUTHOR: returns all results of particular author, but when AUTHOR is preceded by a word then the results are documents that contain that word from that author.
AND: Returns documents that contain the words before and after the KEYWORD.
OR: Returns documents that contain either the word before or the one after.
NOT: Returns all documents that do not contain the particular word. Can be compounded with AND or ORs, but had to be the last keyword used in the query and cannot be the only query.
Examples (THESE QUERIES ARE NOT GUARANTEED TO RETURN RESULTS):
AUTHOR Suess
covid AND chicken
coronavirus OR pizza
pizza NOT covid
pizza AUTHOR grant
pizza
Co-Written and Co-Implemented by Seun Suberu
Used for storing and retrieving author information.
Written and Implemented by Seun Suberu
Used for storing and retrieving stop word values.
Written and Implemented by Seun Suberu
Used for storing and retrieving stemmed words with their associated document identifiers for indexing.
Written and Implemented by Seun Suberu
Used for storing and retrieving document information.
Written and Implemented by Seun Suberu
Used for storing and retrieving author information.
Written and Implemented by Seun Suberu
Wrapper class for stemmed word which also contains a collection of InnerDoc objects for keeping track of documents that contain the stemmed word.