Text mining and in particular text classification is a mandatory skill for a Data Scientist. In this proyect I develop an algorithm in order to classify news of digital newspapers in different languages, English and Spanish, in order to classify its in the correct label based on their topics.
The pipeline for the current proyect is: Scraping the articles, clean and extract the important features for the task and preform and agglomerative clustering. The main tool for this proyect is Python with the packages of NLTK, News Article3k, Beautiful Soup and scikit-learn.