Skip to content

Commit

Permalink
Merge pull request #5 from gretelai/aw/trainer-module
Browse files Browse the repository at this point in the history
DRAFT - Aw/trainer module
  • Loading branch information
zredlined authored Jun 10, 2022
2 parents 88222ce + 7974930 commit 017dc40
Show file tree
Hide file tree
Showing 9 changed files with 417 additions and 65 deletions.
71 changes: 60 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,67 @@
# Gretel Trainer

This code is designed to help users successfully train synthetic models on complex datasets with high row and column counts. The code works by intelligently dividing a dataset into a set of smaller datasets of correlated columns that can be parallelized and then joined together.
This module is designed to provide a simple interface to help users successfully train synthetic models on complex datasets with high row and column counts, and offers features such as Cloud SaaS based training and multi-GPU based parallelization. Get started for free with an API key from [Gretel.ai](https://console.gretel.cloud).

# Get Started
## Current functionality and features:

## Running the notebook
1. Launch the [Notebook](https://github.com/gretelai/trainer/blob/main/notebooks/gretel-trainer.ipynb) in [Google Colab](https://colab.research.google.com/github/gretelai/trainer/blob/main/notebooks/gretel-trainer.ipynb) or your preferred environment.
2. Add your dataset and [Gretel API](https://console.gretel.cloud) key to the notebook.
3. Generate synthetic data!
* Synthetic data generators for text, tabular, and time-series data with the following
features:
* Balance datasets or boost a minority class using Conditional Data Generation.
* Automated data validation.
* Synthetic data quality reports.
* Privacy filters and optional differential privacy support.
* Multiple [model types supported](https://docs.gretel.ai/synthetics/models):
* `Gretel-LSTM` model type supports text, tabular, time-series, and conditional data generation.
* `Gretel-CTGAN` model type supports tabular and conditional data generation.
* `Gretel-GPT` natural language synthesis based on an open-source implementation of GPT-3 (coming soon).
* `Gretel-DGAN` multi-variate time series based on DoppelGANger (coming soon).

## Try it out now!

**NOTE**: Either delete the existing or choose a new cache file name if you are starting
a dataset run from scratch.
If you want to quickly get started synthesizing data with **Gretel.ai**, simply click the button below and follow the examples. See additional Python3 and Jupyter Notebook examples in the `./notebooks` folder.

# TODOs / Roadmap
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-trainer/blob/master/notebooks/trainer-examples.ipynb)

- [ ] Enable additional sampling from from trained models.
- [ ] Detect and label encode random UIDs (preprocessing).
## Join our Slack Workspace

If you want to be part of the Gretel synthetic data community to receive announcements of the latest releases,
ask questions, suggest new features or participate in the development meetings, please join
our Slack Workspace!

[![Slack](https://img.shields.io/badge/Slack%20Workspace-Join%20now!-36C5F0?logo=slack)](https://gretel.ai/slackinvite)

# Install

**Using `pip`:**

```bash
pip install -U gretel-trainer
```

# Quickstart

### 1. Add your [Gretel API](https://console.gretel.cloud) key via the Gretel CLI.
Use the Gretel client to store your API key to disk. This step is optional, the trainer will prompt for an API key in the next step.
```bash
gretel configure
```

### 2. Train or fine-tune a model using the Gretel API

```python3
from gretel_trainer import trainer

dataset = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"

model = trainer.Trainer()
model.train(dataset)
```

### 3. Generate synthetic data!
```python3
df = model.generate()
```

## TODOs / Roadmap

- [ ] Enable conditional generation via SDK interface (supported in Notebooks currently).
7 changes: 6 additions & 1 deletion notebooks/gretel-trainer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,12 @@
"id": "38e44df3"
},
"outputs": [],
"source": []
"source": [
"# Use the model to generate additional data\n",
"\n",
"run.generate_data(num_records=5000, max_invalid=None, clear_cache=True)\n",
"run.get_synthetic_data()"
]
}
],
"metadata": {
Expand Down
29 changes: 29 additions & 0 deletions notebooks/trainer-examples.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from gretel_trainer import trainer, runner

dataset = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"

# Simplest example
model = trainer.Trainer()
model.train(dataset)
df = model.generate()

# Specify underlying model
#model = trainer.Trainer(model_type="GretelLSTM")
#model.train(dataset)
#df = model.generate()

# Update trainer parameters
#model = trainer.Trainer(max_header_clusters=20, max_rows=50000)
#model.train(dataset)
#df = model.generate()

# Specify synthetic model and update config params
#model = trainer.Trainer(model_type="GretelCTGAN", model_params={'epochs':2})
#model.train(dataset)
#df = model.generate()

# Load and generate data from an existing model
#model = trainer.Trainer.load()
#df = model.generate(num_records=70)

print(df)
18 changes: 16 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,23 @@
local_path = pathlib.Path(__file__).parent
install_requires = (local_path / "requirements.txt").read_text().splitlines()

setup(name="trainer",
setup(name="gretel-trainer",
version="0.0.1",
package_dir={'': 'src'},
install_requires=install_requires,
packages=find_packages("src")
python_requires=">=3.7",
packages=find_packages("src"),
package_data={'': ['*.yaml']},
include_package_data=True,
description="Synthetic Data Generation with optional Differential Privacy",
url="https://github.com/gretelai/gretel-trainer",
license="http://www.apache.org/licenses/LICENSE-2.0",
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
"Operating System :: POSIX :: Linux",
"Operating System :: MacOS",
"Operating System :: Microsoft :: Windows",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]
)
File renamed without changes.
Loading

0 comments on commit 017dc40

Please sign in to comment.