Miniproject for NTU SC1015
Group 9, Lab Group SC2
Zhang Danxu • Lohia Vardhan • Sannabhadti Shikha Deepak
This miniproject focused on UCI Machine Learning Repository Wine Quality Data Set, trying to use various properties of a wine to predict its quality.
numpy
pandas
matplotlib
seaborn
sklearn
tabulate
torch
-
data/
: datasetsraw/winequality-red.csv
: raw dataset downloaded from UCI Websiterefined_wine.csv
: dataset after cleaningreclassified_wine.csv
: dataset after reclassificationpredictors.csv
: cleaned and scaled predictor variablesresponse.csv
: cleaned response variablewinequality.txt
: description for datasets
-
src/
: jupyter notebooks and python scriptsdata-clean-up.ipynb
: notebook for data cleaning;eda.ipynb
: notebook for exploratory data analysisml.ipynb
: notebook for machine learning;SGD.py
: script for Stochastic Gradient Descent;loss_history.png
: learning curve of SGD, number of iterations w.r.t. loss function
-
.gitigore
: files to be ignored by git -
LICENSE
: lincensing information -
README.md
: basic information about our project
In data clean-up, we first dropped null values, duplicated rows, and outliers. Since our data was imbalanced, we reclassifed our response variable quality
to make it more balanced. Then we did feature scaling to make our predictor variables of similar scales.
Exploratory data anaysis includes histogram with kde plots, boxplots for data before and after reclassificatin, finding corrleation between variables, and preforming Point Biserial Correlation.
In machine learning, we first used decision trees, but since we had a relative large number of features, decision trees was a bit overfitting. The second model we built was stochastic gradient descent, with PyTorch's BCE loss funcition and SGD optimizer. SGD used a simple linear layer, but there could be some non-linear relationships. So, the last model we built is support vector machine with RBF kernel. The final training accuracy for SVM is 90%, and testing accuracy is 89%.
- Sannabhadti Shikha Deepak: Data Clean-up
- Lohia Vardhan: Exploratory Data Analysis
- Zhang Danxu: Machine Learning