Data Analytics done in Apache PySpark. Dataset used is UCI Adult Data.
- Data cleaning
- Feature engineering
- Distill and transform the features into vectors.
- Use one-hot encoder to process categorical features
- Build a logistic regression and a gradient-boosted tree model to fit the dataset.
- Tune and evaluate using Logistic Regression and Gradient-boosted tree
- Make predictions on the testing set and display the areaUnderROC.
Data Analytics done in Apache PySpark. Dataset used is DataBricks Online Retail Dataset.
- Taking measure of items per invoice
- Checking total spendings for customers
- Analyzing number of products sold for each item
- Checking if a returning customer spends less than or greater than their previous purchase