The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.
We want you to query the wikipedia API and collect all of the articles under the following wikipedia categories:
The results of the query will be written to PostgreSQL tables, page
and category
.
Use Latent Semantic Analysis to search the pages. Given a search query, find the top 5 related articles to the search query.
Build a predictive model from the data. When a new article from wikipedia comes along, predict what category the article should fall into.
The docker-compose.yml
file was used to build this project.