Mastering Snowflake: Sentiment Analysis and Performance Experiments

Introduction

This repository hosts all the necessary resources for the Advanced Data Systems Project 1 titled "Mastering Snowflake: Sentiment Analysis and Performance Experiments".

Project Structure

Code and Data

python_udtf/naive_bayes_udtf.py: Implementation of Naive Bayes using a UDTF in Python.
python_udtf/naive_bayes_udtf.sql: Naive Bayes implementation using a UDTF in Python, adapted for Snowflake.
snowflake_sql/naive_bayes_sql.sql: Naive Bayes implementation in SQL.
tpch_benchmark/tpch_benchmark.ipynb: Jupyter notebook where Performance Experiments Using TPC-H are performed.
tpch_benchmark/query_execution_times.csv: Query execution times for all queries across all possible combinations.
tpch_benchmark/average_query_execution_times.csv: Average query execution times after three runs across all possible combinations.

Note: The training and test data from the Yelp Review dataset (https://huggingface.co/datasets/Yelp/yelp_review_full) were uploaded to Snowflake at the beginning of this project and are not available here on GitHub due to their size.

Additional Folders

"plots": Directory containing images and plots used in the report.
"report": Directory containing the report of this assignment in PDF format.

Installation

Prerequisites: Ensure Python 3.12.6 is installed on your machine. Other versions might work, but this project was developed with 3.12.6.

1. Create and Activate a Virtual Environment

Create a Virtual Environment in the root directory of this project by running the following commands:

For macOS/Linux:
- python3 -m venv .venv
- source .venv/bin/activate
For Windows:
- python -m venv .venv
- .venv\Scripts\activate

2. Install Required Packages

When the virtual environment is activated, install all necessary packages by running:

pip install -r requirements.txt

Usage

1. Running the TPCH Benchmark Notebook:

To recreate the results using the tpch_benchmark/tpch_benchmark.ipynb notebook:

Run the whole tpch_benchmark.ipynb notebook, ensuring to use your own “user”, "account" and “password” when connecting to Snowflake (more details inside the notebook).

2. Executing Naive Bayes Implementations:

The Naive Bayes implementations, both in SQL and as a UDTF in Python, need to be executed within Snowflake.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mastering Snowflake: Sentiment Analysis and Performance Experiments

Introduction

Table of Contents

Project Structure

Code and Data

Additional Folders

Installation

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
plots		plots
python_udtf		python_udtf
report		report
snowflake_sql		snowflake_sql
tpch_benchmark		tpch_benchmark
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

veronhoxha/snowflake-sentiment-analysis-and-performance-experiments

Folders and files

Latest commit

History

Repository files navigation

Mastering Snowflake: Sentiment Analysis and Performance Experiments

Introduction

Table of Contents

Project Structure

Code and Data

Additional Folders

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages