pagerank_uiowa_domain

This project was developed as part of my Large Data Analysis course.

My objective was to implement Google's pagerank algorithm in the www.uiowa.edu domain and determine which urls are most important and which are least important. Additionally most common words in each url were extracted to simulate a search engine.

For more information about Google's pagerank alogorithm see here.

Approach

Development of a web crawler to extract links and top 9 common words

Removed stop words
Considered only the root words

Storing links and urls(along with top common words in each url) in an SQLite database and 2 csv files
Preparing dataframes and running the algorithm
Filter out most important and least important urls

Technologies

Python3
pandas
numpy
beautifulsoup
sklearn
nltk

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
link.csv		link.csv
matrix_builder.ipynb		matrix_builder.ipynb
project.ipynb		project.ipynb
test.db		test.db
webpage.csv		webpage.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pagerank_uiowa_domain

Approach

Technologies

Outcome

Most important urls

Least important urls

About

Releases

Packages

Languages

hpitawela/pagerank_uiowa_domain

Folders and files

Latest commit

History

Repository files navigation

pagerank_uiowa_domain

Approach

Technologies

Outcome

Most important urls

Least important urls

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages