Abstract — This notebook presents pySpark code that processes and clasifies text data from a corpus of emails. The code implements data wrangling and classification techniques (i.e. logistic regression analysis) to build a process that recognises whether a certain email is spam or not. Alternative specifications of the classification technique used are explored to assess the performance and efficiency of the process.
See IPython Notebook: Ipython Notebook