Data Science TU Wien Machine Learning Project (Group 42)
This project is designed to automate and streamline the process of running machine learning experiments on various datasets using different classification algorithms and evaluation methods. It provides tools for data preprocessing, model training, evaluation, and visualization, facilitating comprehensive analysis and comparison of model performance across datasets.
- Datasets
- Models
- Evaluation Methods
- Features and Capabilities
- Repository Structure
- How to Use the Repository
- Detailed Description
- Contributing
- License
The project includes the following datasets:
- Wine Reviews: Contains wine reviews with features like description, country, points, price, variety, and winery.
- Amazon Reviews: A dataset of Amazon product reviews with various attributes.
- Congressional Voting: Data on U.S. congressional voting patterns.
- Traffic Prediction: Traffic data including time, date, day of the week, and traffic situation.
Implemented machine learning models:
- Support Vector Machines (SVM): Various kernels (linear, polynomial, RBF, sigmoid) and hyperparameters.
- K-Nearest Neighbors (KNN): Different numbers of neighbors, weight functions, algorithms, and distance metrics.
- Random Forests (RF): Varying numbers of trees, depths, splitting criteria, and other hyperparameters.
Supported evaluation methods:
- Holdout: Splits the dataset into training and validation sets.
- Cross-Validation: Performs stratified K-fold cross-validation.
- Automated Experimentation: Run experiments across combinations of datasets, models, and evaluation methods.
- Data Preprocessing: Load, clean, and preprocess data for model training.
- Model Training and Evaluation: Train models with specified hyperparameters and evaluate using various metrics.
- Visualization: Generate plots for model performance, including confusion matrices and ROC curves.
- Result Aggregation: Collect and aggregate results from experiments.
- Performance Metrics: Measure and report execution time and memory usage.
ds-ml-project/
├── data/
│ ├── raw/
│ │ ├── amazon-reviews/
│ │ ├── congressional-voting/
│ │ ├── traffic-data/
│ │ └── wine-reviews.arff
│ └── processed/
│ └── wine_reviews_processed.csv
├── output_results_holdout/
├── output_results_cross_val/
├── plots/
├── run_all_experiments.sh
├── src/
│ ├── data_processing/
│ │ └── preprocess.py
│ ├── evaluation/
│ │ ├── metrics.py
│ │ └── visualisation.py
│ ├── experiments/
│ │ └── run_experiments.py
│ └── models/
│ ├── knn.py
│ ├── random_forest.py
│ └── svm.py
└── README.md
- data/: Contains raw and processed datasets.
- output_results_holdout/ and output_results_cross_val/: Directories where experiment results are saved.
- plots/: Directory for generated plots.
- run_all_experiments.sh: Shell script to automate running all experiments.
- src/: Source code directory.
- data_processing/preprocess.py: Functions for loading and preprocessing data.
- evaluation/metrics.py: Functions for evaluating models and saving metrics.
- evaluation/visualisation.py: Scripts for generating visualizations.
- experiments/run_experiments.py: Main script to run experiments.
- models/: Contains model definitions for KNN, Random Forest, and SVM.
- Python 3.x
- Required Python packages (see
requirements.txt
if available) - NLTK data packages: WordNet, stopwords, etc.
- Clone the repository:
git clone https://github.com/woerndle/ds-ml-project.git cd ds-ml-project
- Install the required Python packages:
pip install -r requirements.txt
- Download NLTK data:
this is done via code
To reproduce the results, execute the script:
chmod +x run_all_experiments.sh
./run_all_experiments.sh
Run custom experiments with specific arguments:
- Loading Datasets: From various formats like CSV and ARFF.
- Handling Missing Values: Data imputation and cleaning.
- Text Preprocessing: Tokenization, lemmatization, and stopword removal for textual data.
- Feature Encoding: Label encoding and one-hot encoding for categorical features.
- Feature Scaling: Standardization of numerical features.
- Dimensionality Reduction: Using PCA for high-dimensional data.
- Data Splitting: Into training and validation sets or preparing for cross-validation.
- SVM (svm.py): Defines SVM models with various kernels and hyperparameters.
- KNN (knn.py): Defines KNN models with different configurations.
- Random Forest (random_forest.py): Defines Random Forest models with varying hyperparameters.
- Argument Parsing: Determines dataset, model, evaluation method, and subset size.
- Data Loading: Calls preprocessing functions to load and prepare data.
- Model Retrieval: Gets the list of models based on the specified type.
- Training and Evaluation: Trains models and evaluates them using the specified evaluation method.
- Result Saving: Saves metrics and generates plots for each model.
- Metrics Calculation: Accuracy, F1-score, confusion matrix, ROC curve, etc.
- Performance Tracking: Measures elapsed time and memory usage.
- Result Serialization: Saves metrics to JSON files for analysis.
- Data Collection: Gathers metrics from result directories.
- Plot Generation: Creates plots like box plots, scatter plots, and bar charts.
- Summary Tables: Generates tables summarizing model performance.
- Output: Saves plots and tables in the plots/ directory.
If you wish to contribute to this project, please fork the repository and submit a pull request.
THIS README WAS CREATED BY A LANGUAGE MODEL. IT WAS PRESENTED THE CODEBASE AND TAKSED WITH CREATING THIS FILE