This project was developed as part of my Large Data Analysis course.
My objective was to implement Google's pagerank algorithm in the www.uiowa.edu domain and determine which urls are most important and which are least important. Additionally most common words in each url were extracted to simulate a search engine.
For more information about Google's pagerank alogorithm see here.
- Development of a web crawler to extract links and top 9 common words
- Removed stop words
- Considered only the root words
- Storing links and urls(along with top common words in each url) in an SQLite database and 2 csv files
- Preparing dataframes and running the algorithm
- Filter out most important and least important urls
- Python3
- pandas
- numpy
- beautifulsoup
- sklearn
- nltk