This project is a web crawler and inverted indexer that extracts information from web pages and creates an inverted index for efficient keyword-based searching.
To compile and run this project, you need to have the following dependencies installed. Clone the repo, and the files below will automatically be included
- Java Development Kit (JDK)
- htmlparser.jar
- jsoup-1.17.2.jar
- jdbm-1.0.jar
To compile the project and move the class files to the appropriate destination, use the following command:
javac -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. *.java
Then, copy class files to the appropriate folder in order to crawl pages
cp *.class ./PROJECT
Also, move all class files to the appropriate folder for the webapp
mv *.class ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/classes/PROJECT
To run the program, use the following command
java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.Inverter "https://www.cse.ust.hk/~kwtleung/COMP4321/testpage.htm"
mv *.lg *.db ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/database/
Optional: Running other java files for testing (e.g: SearchEngine)
java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.SearchEngine
Firstly, add environment variables.
- Set CATALINA_HOME to {path to this project}/apache-tomcat-10.1.20/
- Set JAVA_HOME to {Path to your JDK}
Next, change directory to the correct folder to start apache tomcat, and run the startup.sh file
cd ./apache-tomcat-10.1.20/bin
./startup.sh
Head over to your browser: http://localhost:8080/comp4321/
You may then perform the searching
You should then shutdown apache-tomcat when you are done
./shutdown.sh
This project is licensed under the MIT License. Feel free to copy and paste the above content into your README file, making any necessary adjustments or additions.