ds-ml-project

Data Science TU Wien Machine Learning Project (Group 42)

Overview

This project is designed to automate and streamline the process of running machine learning experiments on various datasets using different classification algorithms and evaluation methods. It provides tools for data preprocessing, model training, evaluation, and visualization, facilitating comprehensive analysis and comparison of model performance across datasets.

Datasets

The project includes the following datasets:

Wine Reviews: Contains wine reviews with features like description, country, points, price, variety, and winery.
Amazon Reviews: A dataset of Amazon product reviews with various attributes.
Congressional Voting: Data on U.S. congressional voting patterns.
Traffic Prediction: Traffic data including time, date, day of the week, and traffic situation.

Models

Implemented machine learning models:

Support Vector Machines (SVM): Various kernels (linear, polynomial, RBF, sigmoid) and hyperparameters.
K-Nearest Neighbors (KNN): Different numbers of neighbors, weight functions, algorithms, and distance metrics.
Random Forests (RF): Varying numbers of trees, depths, splitting criteria, and other hyperparameters.

Evaluation Methods

Supported evaluation methods:

Holdout: Splits the dataset into training and validation sets.
Cross-Validation: Performs stratified K-fold cross-validation.

Features and Capabilities

Automated Experimentation: Run experiments across combinations of datasets, models, and evaluation methods.
Data Preprocessing: Load, clean, and preprocess data for model training.
Model Training and Evaluation: Train models with specified hyperparameters and evaluate using various metrics.
Visualization: Generate plots for model performance, including confusion matrices and ROC curves.
Result Aggregation: Collect and aggregate results from experiments.
Performance Metrics: Measure and report execution time and memory usage.

Repository Structure

ds-ml-project/
├── data/
│ ├── raw/
│ │ ├── amazon-reviews/
│ │ ├── congressional-voting/
│ │ ├── traffic-data/
│ │ └── wine-reviews.arff
│ └── processed/
│ └── wine_reviews_processed.csv
├── output_results_holdout/
├── output_results_cross_val/
├── plots/
├── run_all_experiments.sh
├── src/
│ ├── data_processing/
│ │ └── preprocess.py
│ ├── evaluation/
│ │ ├── metrics.py
│ │ └── visualisation.py
│ ├── experiments/
│ │ └── run_experiments.py
│ └── models/
│ ├── knn.py
│ ├── random_forest.py
│ └── svm.py
└── README.md

data/: Contains raw and processed datasets.
output_results_holdout/ and output_results_cross_val/: Directories where experiment results are saved.
plots/: Directory for generated plots.
run_all_experiments.sh: Shell script to automate running all experiments.
src/: Source code directory.
- data_processing/preprocess.py: Functions for loading and preprocessing data.
- evaluation/metrics.py: Functions for evaluating models and saving metrics.
- evaluation/visualisation.py: Scripts for generating visualizations.
- experiments/run_experiments.py: Main script to run experiments.
- models/: Contains model definitions for KNN, Random Forest, and SVM.

How to Use the Repository

Prerequisites

Python 3.x
Required Python packages (see requirements.txt if available)
NLTK data packages: WordNet, stopwords, etc.

Setup

Clone the repository:

git clone https://github.com/woerndle/ds-ml-project.git
cd ds-ml-project

Install the required Python packages:
```
 pip install -r requirements.txt
```
Download NLTK data:
```
this is done via code
```

Running Experiments

To reproduce the results, execute the script:

chmod +x run_all_experiments.sh
./run_all_experiments.sh

Custom Experiments

Run custom experiments with specific arguments:

Detailed Description

Data Processing

Loading Datasets: From various formats like CSV and ARFF.
Handling Missing Values: Data imputation and cleaning.
Text Preprocessing: Tokenization, lemmatization, and stopword removal for textual data.
Feature Encoding: Label encoding and one-hot encoding for categorical features.
Feature Scaling: Standardization of numerical features.
Dimensionality Reduction: Using PCA for high-dimensional data.
Data Splitting: Into training and validation sets or preparing for cross-validation.

Models

SVM (svm.py): Defines SVM models with various kernels and hyperparameters.
KNN (knn.py): Defines KNN models with different configurations.
Random Forest (random_forest.py): Defines Random Forest models with varying hyperparameters.

Experiment Running

Argument Parsing: Determines dataset, model, evaluation method, and subset size.
Data Loading: Calls preprocessing functions to load and prepare data.
Model Retrieval: Gets the list of models based on the specified type.
Training and Evaluation: Trains models and evaluates them using the specified evaluation method.
Result Saving: Saves metrics and generates plots for each model.

Evaluation and Metrics

Metrics Calculation: Accuracy, F1-score, confusion matrix, ROC curve, etc.
Performance Tracking: Measures elapsed time and memory usage.
Result Serialization: Saves metrics to JSON files for analysis.

Visualization

Data Collection: Gathers metrics from result directories.
Plot Generation: Creates plots like box plots, scatter plots, and bar charts.
Summary Tables: Generates tables summarizing model performance.
Output: Saves plots and tables in the plots/ directory.

Contributing

If you wish to contribute to this project, please fork the repository and submit a pull request.

License

MIT License

THIS README WAS CREATED BY A LANGUAGE MODEL. IT WAS PRESENTED THE CODEBASE AND TAKSED WITH CREATING THIS FILE

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
notebooks		notebooks
old		old
src		src
.gitignore		.gitignore
README.md		README.md
all-in-all.ipynb		all-in-all.ipynb
analitycs.ipynb		analitycs.ipynb
process_results.py		process_results.py
report.md		report.md
requirements.txt		requirements.txt
run_all_experiments.sh		run_all_experiments.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ds-ml-project

Overview

Table of Contents

Datasets

Models

Evaluation Methods

Features and Capabilities

Repository Structure

How to Use the Repository

Prerequisites

Setup

Running Experiments

Custom Experiments

Detailed Description

Data Processing

Models

Experiment Running

Evaluation and Metrics

Visualization

Contributing

License

About

Releases

Packages

Contributors 3

Languages

woerndle/ds-ml-project

Folders and files

Latest commit

History

Repository files navigation

ds-ml-project

Overview

Table of Contents

Datasets

Models

Evaluation Methods

Features and Capabilities

Repository Structure

How to Use the Repository

Prerequisites

Setup

Running Experiments

Custom Experiments

Detailed Description

Data Processing

Models

Experiment Running

Evaluation and Metrics

Visualization

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages