Machine learning (CS-433)

Project 1: Higgs Boson detection challenge

Antoine Madrona, Loïc Bruchez, Valentin Bigot, October 2020

Dataset

The dataset is divided into a training and a testing set composed of 250’000 and 568’238 samples respectively and both having 30 features. The training set is paired with labels where each sample is associated to a category (−1 for background noise and 1 for the presence of a Higgs Boson).

Model's basic operations

The model implemented in run.py loads the training set provided in the DATA_TRAIN_PATH (see Code execution) and several preprocessing steps are performed on this dataset in this order:

Splitting of the features based on the PRI_jet_num categories (0,1 or 2&3)
logarithmic transformation of selected features
Polynomial augmentation of the features
Standardization of the features.

Afterwards, the model is trained using the ridge regression algorithm and weights are obtained. These weights are used to predict each labels of the splitted dataset and the predictions are finally merged and the submission file is created.

Useful files

The code is separated in 3 distinctive files containing all the functions to reproduce our results:

implementations.py

run.py

implementations.py:

This file contains all the functions required to reproduce our preprocessing pipeline and to use our regression model. Specifically, the file is separated in 3 sections:

"IMPLEMENTATIONS" is composed of the 6 functions least_squares_GD, least_squares_SGD, least_squares, ridge_regression, logistic regression and reg_logistic_regression constituing a toolbox for development of the regression model.
"UTILITARIES" contains complementary functions to ensure good working of the methods present in "IMPLEMENTATION" section, as well as functions needed for prediction and loading of datasets.
"PREPROCESSING" contains all the preprocessing steps used in this work to optimize the model's performance.

run.py:

run.py allows to reproduce the best prediction accuracy stated in the report. The optimal hyperparameters are already provided. Function load_csv_data() provided by the teachers for loading train set, predict labels and create a submission file in .csv format is also given in this file.

Code execution

Downoload and unzip the .zip folders train.csv and test.csv at https://github.com/epfml/ML_course/tree/master/projects/project1/data
Set the DATA_TRAIN_PATH and DATA_TEST_PATH with your own path (e.g. '../data/train.csv', '../data/test.csv') in the run.py file, all the optimal hyperparameters are already provided
Set the OUTPUT_PATH (e.g. '../sub.csv') to define where the submission file must be saved
Run the following command line in the terminal : python3 run.py, to obtain the .csv file for submission

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
project1_description.pdf		project1_description.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine learning (CS-433)

Project 1: Higgs Boson detection challenge

Dataset

Model's basic operations

Useful files

implementations.py:

run.py:

Code execution

About

Releases

Packages

Contributors 3

Languages

madtoinou/CS-433_HBozonClass

Folders and files

Latest commit

History

Repository files navigation

Machine learning (CS-433)

Project 1: Higgs Boson detection challenge

Dataset

Model's basic operations

Useful files

implementations.py:

run.py:

Code execution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages