Goal

This project aims to test the timing and accuracy of different machine learning and deep learning algorithms using the multi-language engine Apache Spark. It has been realised in collaboration with Prof. Reforgiato for the BigData and AdvancedBigDataArchitectures courses at the University of Cagliari.

Requirements

Spark (optional Hadoop), Python3 and dedicated libraries required.

Models and Datasets

We used three different models, two of machine learning and one of deep learning.
The creditcard dataset is from Kaggle, please download the full dataset from link as the version in this repository is partial.

Machine

The machine we used for these tests has the following characteristics:

Intel® Core™ ¡5-6500 CPU @ 3.20GHz × 4
16,0 GiB
Ubuntu 22.04.4 LTS

Spark Configuration

Spark-submit allows you to add several parameters to customise the configuration, these are the ones we use:

--master spark://master:7077
--executor-memory 8G
--total-executor-cores 4

Results:

Repeated tests yielded the following results:

RandomForest:
- Test Error = 0.36113
- 7,6 minuti
LogisticRegression:
- Test Error = 0.233181
- 28 secondi
KerasModel:
- Test Error = 0.111258
- 24.12903928756714 secondi

These results show both the superiority of a deep learning model over a classical one, and the importance of choosing the right model when making a prediction, with the Random Forest taking 18.24 times longer than the other two and also achieving a higher error rate.

Disclamair

Tests run in client mode as Spark does not support Python files for cluster mode.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.sh		requirements.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal

Requirements

Models and Datasets

Machine

Spark Configuration

Results:

Disclamair

About

Releases

Packages

Languages

TheMastro-11/BigData-Project

Folders and files

Latest commit

History

Repository files navigation

Goal

Requirements

Models and Datasets

Machine

Spark Configuration

Results:

Disclamair

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages