Skip to content

senseiseun/SearchEngine

Repository files navigation

COVID Search Engine

Seun Suberu

Info: Data Structures Final Project at Southern Methodist University. Remade on December 2022. Remodel of original project.

Kick Covid in the Ass

Description

This is a COVID Document Search Engine written in C++ and built using CMAKE that utilizes a self-implemented AVLTree for word storage and self-implemented HashTable for authors. The documents are ranked by term-frequency/inverse document frequency metric and indexed into self-made data structures. There is a command line user interface provided. Implemented own Hash Table and AVL Tree to store authors and stemmed words.

How to use

Set up:
Link to Dataset for covid information files

Program argument should be then a path to the directory where the files for indexing are located.
For example:

$ mkdir build
$ cmake ..
$ ./SearchEngine {directory}

From there the program should run as expected. The program loads and indexes all the files and then the program should be ready for any queries. The keywords recognized by this Search Engine are "AUTHOR", "AND", "OR", and "NOT". The only keywords that can appear at the beginning of a query are AUTHOR and NOT. The other have to be have a search word before and after.
AUTHOR: returns all results of particular author, but when AUTHOR is preceded by a word then the results are documents that contain that word from that author.
AND: Returns documents that contain the words before and after the KEYWORD.
OR: Returns documents that contain either the word before or the one after.
NOT: Returns all documents that do not contain the particular word. Can be compounded with AND or ORs, but had to be the last keyword used in the query and cannot be the only query.

Examples (THESE QUERIES ARE NOT GUARANTEED TO RETURN RESULTS):

AUTHOR Suess

covid AND chicken

coronavirus OR pizza

pizza NOT covid

pizza AUTHOR grant

pizza

HashTable

Co-Written and Co-Implemented by Seun Suberu
Used for storing and retrieving author information.

HashSet

Written and Implemented by Seun Suberu
Used for storing and retrieving stop word values.

AVLTree

Written and Implemented by Seun Suberu
Used for storing and retrieving stemmed words with their associated document identifiers for indexing.

Article Class

Written and Implemented by Seun Suberu
Used for storing and retrieving document information.

Author Class

Written and Implemented by Seun Suberu
Used for storing and retrieving author information.

Word Class

Written and Implemented by Seun Suberu
Wrapper class for stemmed word which also contains a collection of InnerDoc objects for keeping track of documents that contain the stemmed word.

About

Search Engine for COVID Documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published