Skip to content

This project explores implementing the Naive Bayes algorithm in Snowflake using SQL and user-defined table functions, alongside conducting performance experiments using the TPC-H benchmark. The objective is to measure and understand the precision and efficiency of the algorithm within a modern cloud data platform.

Notifications You must be signed in to change notification settings

veronhoxha/snowflake-sentiment-analysis-and-performance-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mastering Snowflake: Sentiment Analysis and Performance Experiments

Introduction

This repository hosts all the necessary resources for the Advanced Data Systems Project 1 titled "Mastering Snowflake: Sentiment Analysis and Performance Experiments".

Table of Contents

Project Structure

Code and Data

  • python_udtf/naive_bayes_udtf.py: Implementation of Naive Bayes using a UDTF in Python.
  • python_udtf/naive_bayes_udtf.sql: Naive Bayes implementation using a UDTF in Python, adapted for Snowflake.
  • snowflake_sql/naive_bayes_sql.sql: Naive Bayes implementation in SQL.
  • tpch_benchmark/tpch_benchmark.ipynb: Jupyter notebook where Performance Experiments Using TPC-H are performed.
  • tpch_benchmark/query_execution_times.csv: Query execution times for all queries across all possible combinations.
  • tpch_benchmark/average_query_execution_times.csv: Average query execution times after three runs across all possible combinations.

Note: The training and test data from the Yelp Review dataset (https://huggingface.co/datasets/Yelp/yelp_review_full) were uploaded to Snowflake at the beginning of this project and are not available here on GitHub due to their size.

Additional Folders

  • "plots": Directory containing images and plots used in the report.
  • "report": Directory containing the report of this assignment in PDF format.

Installation

Prerequisites: Ensure Python 3.12.6 is installed on your machine. Other versions might work, but this project was developed with 3.12.6.

1. Create and Activate a Virtual Environment

Create a Virtual Environment in the root directory of this project by running the following commands:

  • For macOS/Linux:
    • python3 -m venv .venv
    • source .venv/bin/activate
  • For Windows:
    • python -m venv .venv
    • .venv\Scripts\activate

2. Install Required Packages

When the virtual environment is activated, install all necessary packages by running:

  • pip install -r requirements.txt

Usage

1. Running the TPCH Benchmark Notebook:

To recreate the results using the tpch_benchmark/tpch_benchmark.ipynb notebook:

  • Run the whole tpch_benchmark.ipynb notebook, ensuring to use your own “user”, "account" and “password” when connecting to Snowflake (more details inside the notebook).

2. Executing Naive Bayes Implementations:

The Naive Bayes implementations, both in SQL and as a UDTF in Python, need to be executed within Snowflake.

About

This project explores implementing the Naive Bayes algorithm in Snowflake using SQL and user-defined table functions, alongside conducting performance experiments using the TPC-H benchmark. The objective is to measure and understand the precision and efficiency of the algorithm within a modern cloud data platform.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published