Data-Analytics-for-Big-Data

UCI_Adult_Dataset_Analytics.ipynb

Data Analytics done in Apache PySpark. Dataset used is UCI Adult Data.

Data cleaning
Feature engineering
- Distill and transform the features into vectors.
- Use one-hot encoder to process categorical features
Build a logistic regression and a gradient-boosted tree model to fit the dataset.
Tune and evaluate using Logistic Regression and Gradient-boosted tree
Make predictions on the testing set and display the areaUnderROC.

Retail_Data_Analytics.ipynb

Data Analytics done in Apache PySpark. Dataset used is DataBricks Online Retail Dataset.

Taking measure of items per invoice
Checking total spendings for customers
Analyzing number of products sold for each item
Checking if a returning customer spends less than or greater than their previous purchase