Skip to content

Latest commit

 

History

History
19 lines (16 loc) · 1.2 KB

README.md

File metadata and controls

19 lines (16 loc) · 1.2 KB

Data-Analytics-for-Big-Data

Data Analytics done in Apache PySpark. Dataset used is UCI Adult Data.

  • Data cleaning
  • Feature engineering
    • Distill and transform the features into vectors.
    • Use one-hot encoder to process categorical features
  • Build a logistic regression and a gradient-boosted tree model to fit the dataset.
  • Tune and evaluate using Logistic Regression and Gradient-boosted tree
  • Make predictions on the testing set and display the areaUnderROC.

Data Analytics done in Apache PySpark. Dataset used is DataBricks Online Retail Dataset.

  • Taking measure of items per invoice
  • Checking total spendings for customers
  • Analyzing number of products sold for each item
  • Checking if a returning customer spends less than or greater than their previous purchase