Web Crawler and Inverted Indexer

This project is a web crawler and inverted indexer that extracts information from web pages and creates an inverted index for efficient keyword-based searching.

Prerequisites

To compile and run this project, you need to have the following dependencies installed. Clone the repo, and the files below will automatically be included

Java Development Kit (JDK)
htmlparser.jar
jsoup-1.17.2.jar
jdbm-1.0.jar

Compilation

To compile the project and move the class files to the appropriate destination, use the following command:

javac -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. *.java

Then, copy class files to the appropriate folder in order to crawl pages

cp *.class ./PROJECT

Also, move all class files to the appropriate folder for the webapp

mv *.class ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/classes/PROJECT

Running the Crawler and Inverter

To run the program, use the following command

java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.Inverter "https://www.cse.ust.hk/~kwtleung/COMP4321/testpage.htm"

Moving the database files to the correct location:

mv *.lg *.db ./apache-tomcat-10.1.20/webapps/comp4321/WEB-INF/database/

Optional: Running other java files for testing (e.g: SearchEngine)

java -cp htmlparser.jar:jsoup-1.17.2.jar:jdbm-1.0.jar:. PROJECT.SearchEngine

Using the Web interface:

Firstly, add environment variables.

Set CATALINA_HOME to {path to this project}/apache-tomcat-10.1.20/
Set JAVA_HOME to {Path to your JDK}

Next, change directory to the correct folder to start apache tomcat, and run the startup.sh file

cd ./apache-tomcat-10.1.20/bin
./startup.sh

Head over to your browser: http://localhost:8080/comp4321/

You may then perform the searching

Shutting down Apache Tomcat

You should then shutdown apache-tomcat when you are done

./shutdown.sh

License

This project is licensed under the MIT License. Feel free to copy and paste the above content into your README file, making any necessary adjustments or additions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Documentation		Documentation
IRUtilities		IRUtilities
apache-tomcat-10.1.20		apache-tomcat-10.1.20
htmlparser1_6_20060610/htmlparser1_6		htmlparser1_6_20060610/htmlparser1_6
lib		lib
Container.java		Container.java
Crawler.java		Crawler.java
DatabaseViewer.java		DatabaseViewer.java
DocMapping.java		DocMapping.java
ForwardIndex.java		ForwardIndex.java
InvertedIndex.java		InvertedIndex.java
Inverter.java		Inverter.java
Metadata.java		Metadata.java
NgramIndex.java		NgramIndex.java
Pair.java		Pair.java
Posting.java		Posting.java
README.md		README.md
README.pdf		README.pdf
SearchEngine.java		SearchEngine.java
StopStem.java		StopStem.java
TitleForwardIndex.java		TitleForwardIndex.java
TitleInvertedIndex.java		TitleInvertedIndex.java
WordMapping.java		WordMapping.java
htmlparser.jar		htmlparser.jar
jdbm-1.0.jar		jdbm-1.0.jar
jsoup-1.17.2.jar		jsoup-1.17.2.jar
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler and Inverted Indexer

Prerequisites

Compilation

Running the Crawler and Inverter

Moving the database files to the correct location:

Using the Web interface:

Shutting down Apache Tomcat

License

About

Releases

Packages

Languages

emmannyyy/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Web Crawler and Inverted Indexer

Prerequisites

Compilation

Running the Crawler and Inverter

Moving the database files to the correct location:

Using the Web interface:

Shutting down Apache Tomcat

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages