This project is a web crawler and inverted indexer that extracts information from web pages and creates an inverted index for efficient keyword-based searching.
To compile and run this project, you need to have the following dependencies installed. Clone the repo, and the files below will automatically be included
- Java Development Kit (JDK)
- htmlparser.jar
- jsoup-1.17.2.jar
- jdbm-1.0.jar
To compile the project and move the class files to the appropriate destination, use the following command:
javac -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. *.java
Then, copy class files to the appropriate folder in order to crawl pages
cp *.class ./PROJECT
Also, move all class files to the appropriate folder for the webapp
mv *.class ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/classes/PROJECT
To run the program, use the following command
java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.Inverter ""
mv *.lg *.db ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/database/
Optional: Running other java files for testing (e.g: SearchEngine)
java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.SearchEngine
Firstly, add environment variables.
- Set CATALINA_HOME to {path to this project}/apache-tomcat-10.1.20/
- Set JAVA_HOME to {Path to your JDK}
Next, change directory to the correct folder to start apache tomcat, and run the file
cd ./apache-tomcat-10.1.20/bin
Head over to your browser: http://localhost:8080/comp4321/
You may then perform the searching
You should then shutdown apache-tomcat when you are done
This project is licensed under the MIT License. Feel free to copy and paste the above content into your README file, making any necessary adjustments or additions.