Antoine Madrona, Loïc Bruchez, Valentin Bigot, October 2020
The dataset is divided into a training and a testing set composed of 250’000 and 568’238 samples respectively and both having 30 features. The training set is paired with labels where each sample is associated to a category (−1 for background noise and 1 for the presence of a Higgs Boson).
The model implemented in run.py
loads the training set provided in the DATA_TRAIN_PATH (see Code execution) and several preprocessing steps are performed on this dataset in this order:
- Splitting of the features based on the PRI_jet_num categories (0,1 or 2&3)
- logarithmic transformation of selected features
- Polynomial augmentation of the features
- Standardization of the features.
Afterwards, the model is trained using the ridge regression algorithm and weights are obtained. These weights are used to predict each labels of the splitted dataset and the predictions are finally merged and the submission file is created.
The code is separated in 3 distinctive files containing all the functions to reproduce our results:
- implementations.py
- run.py
This file contains all the functions required to reproduce our preprocessing pipeline and to use our regression model. Specifically, the file is separated in 3 sections:
-
"IMPLEMENTATIONS" is composed of the 6 functions
least_squares_GD
,least_squares_SGD
,least_squares
,ridge_regression
,logistic regression
andreg_logistic_regression
constituing a toolbox for development of the regression model. -
"UTILITARIES" contains complementary functions to ensure good working of the methods present in "IMPLEMENTATION" section, as well as functions needed for prediction and loading of datasets.
-
"PREPROCESSING" contains all the preprocessing steps used in this work to optimize the model's performance.
run.py
allows to reproduce the best prediction accuracy stated in the report. The optimal hyperparameters are already provided. Function load_csv_data()
provided by the teachers for loading train set, predict labels and create a submission file in .csv
format is also given in this file.
-
Downoload and unzip the
.zip
folderstrain.csv
andtest.csv
at https://github.com/epfml/ML_course/tree/master/projects/project1/data -
Set the DATA_TRAIN_PATH and DATA_TEST_PATH with your own path (e.g. '../data/train.csv', '../data/test.csv') in the
run.py
file, all the optimal hyperparameters are already provided -
Set the OUTPUT_PATH (e.g. '../sub.csv') to define where the submission file must be saved
-
Run the following command line in the terminal : python3 run.py, to obtain the
.csv
file for submission