Skip to content

Latest commit

 

History

History
57 lines (42 loc) · 2.47 KB

README.md

File metadata and controls

57 lines (42 loc) · 2.47 KB

GD Data Analytics competition - Virginia Tech

As a part of Data Analytics course at Virginia Tech, a data analytics competition was organized by General Dynamics.
This repository consists of our source code and the project report which ultimately fetched our team members Silver Medal :)
GD is also a sponsor of the Discovery Analytics Center at Virginia Tech.

Getting Started

Dataset

The original dataset for analysis consisted of 16GB of data.
However, the records in http_info.csv file has been reduced down from 28 to 1 million. All other files remain intact
The records were reduced so that the dataset could be compressed down to ~500MB.
The reduced dataset for the project can be downloaded here.
Dataset Readme

Project report

The project report consists of some interesting discoveries made while analyzing the dataset.
The report can be viewed/downloaded here.

Project structure

The important directories in this repositories include:

  • jupyter-notebooks : contains main source code of our project
  • email-topic-modelling : LDA topic modelling of email contents
  • url-sentiment-analysis : Sentiment analysis using Google Cloud Natural Language API
  • url-topic-modelling : URL content classifier using Google Cloud Natural Language API
  • miscellaneous, utility

Technolgies explored

  1. Google Cloud Natural Language API
  2. Pandas data analytics
  3. Jupyter Notebook
  4. Seaborn: data visualization
  5. Latent Dirichlet Allocation (LDA)
  6. Vader Sentiment Analysis Library

Team Members

License

This project is licensed under the Apache License 2.0 - check the LICENSE.md file for details.

Acknowledgments

Special thanks to Dr. Leman, our DA course instructor and General Dynamics for organizing this competition.