Skip to content

Latest commit

 

History

History
230 lines (158 loc) · 7.97 KB

README.md

File metadata and controls

230 lines (158 loc) · 7.97 KB

Loading Kaggle dataset files to a database

This repository contains an example of using dbd database prototyping tool for loading Kaggle dataset files to a database.

The dbd tool supports Kaggle datasets since its version 0.8.3.

This example supports loading of Kaggle files to SQLite, Postgres, and MySQL databases. The dbd tool supports more database engines. Modifying the example to support Snowflake, Redshift, or BigQuery should be trivial. Let me know (by submitting an Issue) if you need help.

There are two examples available:

Running examples

Here are quick steps that assume that you already have Python 3.8+ and Python virtual environment installed on your computer. If not, please refer to the Installation chapter below.

Also, the instructions below are for SQLite. Read the Using MySQL or Postgres database chapter below if you have MySQL or Postgres.

  1. Modify the KAGGLE_USERNAME and KAGGLE_KEY environment variables in the bin/env.sh (Linux and MacOS) or bin\env.bat scripts.

To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. The file looks like this:

{"username": "<kaggle-username>", "key": "<kaggle-key>"}

Set the KAGGLE_USERNAME and KAGGLE_KEY environment variables in the bin/env.sh or bin\env.bat script to to the <kaggle-username> and <kaggle-key> values.

NOTE: You can double-quote (") these values on Linux or MacOS, but don't quote them on Windows.

  1. Install dbd by running bin/install.sh on Linux or MacOS or bin\install.bat on Windows.

  2. Set your environment by calling bin/setenv.sh or bin\setenv.bat

  3. Run the example using a script in the bin directory

on Linux or MacOS:

cd etl
../bin/omicron_sqlite.sh

on Windows:

cd etl
..\bin\omicron_sqlite.bat

OR run the example directly

on Linux or MacOS:

source bin/setenv.sh
cd etl/omicron
dbd --project sqlite.project run .

on Windows:

bin\setenv.bat
cd etl\omicron
dbd --project sqlite.project run .

You need to execute the bin/setenv.sh or bin\setenv.bat script just once per your terminal session.

Loading data from other Kaggle datasets

You can easily modify files in the model directory of these examples to load your favorite Kaggle dataset. Here are the high-level steps:

  1. Start your terminal or cmd and execute the bin/setenv.sh or bin\setenv.bat script.

  2. Create a new dbd project by executing dbd init command

dbd init my-kaggle-dataset
  1. Modify the contents of the my-kaggle-dataset/model directory that has been generated. You most probably want to delete the demo data files and create a new my-kaggle-dataset/model/name.ref file with a reference to a Kaggle dataset:
kaggle://yamqwe/omicron-covid19-variant-daily-cases>covid-variants.csv

The reference is an URL with kaggle:// schema (dbd supports many other sources), the Kaggle dataset identificator (e.g. yamqwe/omicron-covid19-variant-daily-cases) and finally the name of a file inside the dataset (e.g. covid-variants.csv).

As the reference is lovcated in the file my-kaggle-dataset/model/name.ref, it creates a new name table in the target database and populates it with the data.

  1. Edit database connection in the model/dbd.profile file. The dbd uses SQLite by default as it doesn't require any database server. See the chapter below on how to modify the model/dbd.profile file to load data to MySQL or Postgres.

Here is an example dbd.profile file:

databases:
  mysql:
    db.url: "mysql+pymysql://{{ MYSQL_USER }}:{{ MYSQL_PASSWORD }}@{{ MYSQL_HOST }}/{{ MYSQL_DB }}?charset=utf8mb4"
  pgsql:
    db.url: "postgresql://{{ POSTGRES_USER }}:{{ POSTGRES_PASSWORD }}@{{ POSTGRES_HOST }}/{{ POSTGRES_DB }}"
  sqlite:
    db.url: "sqlite:///omicron.db"

The dbd.profile file can reference your environment variables for database parameters (e.g. host, port, password).

The dbd.profile is usually located in your home directory and contains all database connections that you work with.

  1. Make sure that your dbd.project file references and existing database connection from the dbd.profile config file. For example:
model: model
database: sqlite
  1. Load data from the referenced Kaggle dataset to a database
cd my-kaggle-dataset
dbd run .

dbd picks up the dbd.profile and dbd.project files from the current working directory. You can use --profile and --project commandline options to use different configuration files.

  1. If you want to modify the default column's data types or add database constraints like primary or foreign keys, checks or indexeas, you can create a YAML file with the same base name to override the defaults.

Installation

This chapter describes how to install Python3 and virtual environment on your computer.

Windows

Just install Python from the Microsoft Store. Search for Python 3.9 app and install it.

Then proceed with the steps in the Running the examples chapter.

MacOS

Use Homebrew

# install homebrew itself
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# install homebrew Python
brew install [email protected]
brew install virtualenv

Then proceed with the steps in the Running the examples chapter.

Fedora/RedHat

sudo yum install python3
sudo yum install python3-virtualenv

Then proceed with the steps in the Running the examples chapter.

Ubuntu/Debian

sudo apt install python3
sudo apt install python3-venv

Then proceed with the steps in the Running the examples chapter.

Using MySQL or Postgres database

First, you have to specify database connection parameters in the bin/env.sh script on Linux or MacOS:

# Postgres environment
export POSTGRES_USER=zsvoboda
export POSTGRES_PASSWORD=
export POSTGRES_HOST=localhost
export POSTGRES_DB=postgres

# MySQL environment
export MYSQL_USER=root
export MYSQL_PASSWORD=
export MYSQL_HOST=localhost
export MYSQL_DB=public

or bin/env.bat on Windows:

rem Postgres environment
set POSTGRES_USER=zsvoboda
set POSTGRES_PASSWORD=
set POSTGRES_HOST=localhost
set POSTGRES_DB=postgres

rem MySQL environment
set MYSQL_USER=root
set MYSQL_PASSWORD=
set MYSQL_HOST=localhost
set MYSQL_DB=public

Then you can run the examples by either use a corresponding script in the bin directory. For example, bin/omicron_pgsql.sh / bin\omicron_pgsql.bat or bin/covid_mysql.sh / bin\covid_mysql.bat.

OR

you can run dbd with a corresponding profile file (e.g. pgsql.project or mysql.project) directly.

For example:

on Linux or MacOS:

source bin/setenv.sh
cd etl/omicron
dbd --project pgsql.project run .

on Windows:

bin\setenv.bat
cd etl\omicron
dbd --project mysql.project run .

Documentation

Refer to the complete dbd documentation here

License

These examples are licensed under BSD license.