me759-final-project

CUDA and OpenMP Accelerated Implementation of Naive Bayes and it’s variants

This project implements 4 different variants of the Naive Bayes algorithm, in CUDA and OpenMP separately to leverage hardware acceleration for best efficiency. The objective is also to compare how well the two (OpenMP/CUDA) implementations fare against each other.

There are the following steps to run this code, more details on which are available below:

Download the pre-processed dataset (Preferred; 30 seconds) OR Pre-process dataset at your own (30-45 mins).
Compiling and Running the dataset
(Optional) Functionality check against Scikit-learn's Naive Bayes algorithm.

The following sections expand on the above steps.

1. Get the dataset

We used IMDb movie review dataset for the Document Classification Task (Bernoulli NB, Multinomial NB, Complement NB) and the Iris Dataset (GaussianNB) for flower classification. We the text data into suitable format such as onehot and bag of words representation using Python packages such as NLTK and scikit-learn.

Download and extract the already pre-processed data (Option 1)

If you do this step, you need not pre-process your data!

bash download_dataset.sh

Choose [y] anywhere if asked to replace old files.

Preprocess your own data (Option 2)

This is in case you don't like to download random files from the internet...we can understand.

Running on Euler or a Linux based system

In order to run the Data Preprocessing python script on Euler, please follow this link to install anaconda on Euler here to install python and change the algoID variable to the following options to create a data-set for that particular algorithm,

--algoID 1 for GaussianNB
--algoID 2 for BernoulliNB
--algoID 3 for MultinomialNB
--algoID 4 for ComplementNB

Install all dependencies using

pip3 install -r requirements.txt

Run the script using

python preprocessData.py --algoID 2

Note that this may take 30-40 minutes for algos Bernoulli, Multinomial and Complement. It will create .csv files in the data folder. It is taking a long time since we are storing a huge matrix in CSV file and in this project we are more concenered on the HPC using CUDA and OpenMP rather than data preprocessing data so feel free to use download_dataset.sh to fetch the data.

Running on Windows/Mac

If you are running this code on Windows/Mac, install python and install the dependencies by

pip3 install -r requirements.txt

and run the data preprocessing step using (change algoID to generated dataset for different algorithm, it would take some time to store the data in csv format)

python preprocessData.py --algoID 2

Note: For GaussianNB, we used the Iris dataset which is already in numerical format so no data processing step was performed and the data is stored directly in the data folder.

2. Compile and run the algorithms

Running on Euler

OpenMP accelerated version

Go to the OpenMP_NB folder

sbatch NBopenmp.sh

CUDA accelerated version

Go to the Cuda_NB folder

module load cuda/10.0

sbatch NBcuda.sh

You may go into the *.sh script to choose which variant you wish to run. For instance, in the CUDA version:

./CudaNB 0 # Gaussian Naive Bayes on Iris (Flower Classication) Dataset

Naive Bayes Variant	Dataset	AlgoID
Gaussian	Iris (Flower Classication) Dataset	0
Bernoulli	IMDb (Movie Sentiment Classification)	1
Multinomial	IMDb (Movie Sentiment Classification)	2
Complement	IMDb (Movie Sentiment Classification)	3

The results will be generated in a *--.out file in the same folder.

Running on Windows/Mac/Ubuntu

OpenMP accelerated version

Go to the OpenMP_NB folder. Then use the following:

g++ -std=c++0x main.cpp classifier.cpp -Wall -O3 -o OpenMP_NB -fopenmp

./OpenMP_NB #algoID

Example: 

./OpenMP_NB 4

CUDA accelerated version

Go to the Cuda_NB folder. Then use the following:

nvcc main.cu classifier.cu -Xcompiler -O3 -Xcompiler -fopenmp -Xcompiler -Wall -Xptxas -O3 -o CudaNB

./CudaNB #algoID

Example: 

./CudaNB 4

3. (Optional) Functionality check (against Scikit-learn Naive Bayes)

In order to check the functionality of our C++ implementation of Naive Bayes variants, we also run the Python machine learning package scikit-learn to compare our accuracy on the test set. For example, to test ComplementNB use algoID 4

Running on Euler

Install packages if you have not already done by the following the steps above and change the algoID variable in the following command.

pip3 install --user -r requirements.txt

python test_algos.py --algoID 2

NOTE: Please use a SLURM script here if you aren't to run this directly on Euler!

Running on Windows/Mac/Ubuntu

python test_algos.py --algoID 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

me759-final-project

1. Get the dataset

Download and extract the already pre-processed data (Option 1)

Preprocess your own data (Option 2)

Running on Euler or a Linux based system

Running on Windows/Mac

2. Compile and run the algorithms

Running on Euler

OpenMP accelerated version

CUDA accelerated version

Running on Windows/Mac/Ubuntu

OpenMP accelerated version

CUDA accelerated version

3. (Optional) Functionality check (against Scikit-learn Naive Bayes)

Running on Euler

Running on Windows/Mac/Ubuntu

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
Cuda_NB		Cuda_NB
Dataset		Dataset
OpenMP_NB		OpenMP_NB
data		data
ProjectReport.pdf		ProjectReport.pdf
README.md		README.md
download_dataset.sh		download_dataset.sh
preprocessData.py		preprocessData.py
requirements.txt		requirements.txt
test_algos.py		test_algos.py

shashank3959/me759-final-project

Folders and files

Latest commit

History

Repository files navigation

me759-final-project

1. Get the dataset

Download and extract the already pre-processed data (Option 1)

Preprocess your own data (Option 2)

Running on Euler or a Linux based system

Running on Windows/Mac

2. Compile and run the algorithms

Running on Euler

OpenMP accelerated version

CUDA accelerated version

Running on Windows/Mac/Ubuntu

OpenMP accelerated version

CUDA accelerated version

3. (Optional) Functionality check (against Scikit-learn Naive Bayes)

Running on Euler

Running on Windows/Mac/Ubuntu

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages