Merge pull request #5 from gretelai/aw/trainer-module

DRAFT - Aw/trainer module
gretelai · Jun 10, 2022 · 017dc40 · 017dc40
2 parents 88222ce + 7974930
commit 017dc40
Show file tree

Hide file tree

Showing 9 changed files with 417 additions and 65 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,67 @@
 # Gretel Trainer
 
-This code is designed to help users successfully train synthetic models on complex datasets with high row and column counts. The code works by intelligently dividing a dataset into a set of smaller datasets of correlated columns that can be parallelized and then joined together.
+This module is designed to provide a simple interface to help users successfully train synthetic models on complex datasets with high row and column counts, and offers features such as Cloud SaaS based training and multi-GPU based parallelization. Get started for free with an API key from [Gretel.ai](https://console.gretel.cloud).
 
-# Get Started
+## Current functionality and features:
 
-## Running the notebook
-1. Launch the [Notebook](https://github.com/gretelai/trainer/blob/main/notebooks/gretel-trainer.ipynb) in [Google Colab](https://colab.research.google.com/github/gretelai/trainer/blob/main/notebooks/gretel-trainer.ipynb) or your preferred environment.
-2. Add your dataset and [Gretel API](https://console.gretel.cloud) key to the notebook.
-3. Generate synthetic data! 
+* Synthetic data generators for text, tabular, and time-series data with the following
+  features:
+    * Balance datasets or boost a minority class using Conditional Data Generation.
+    * Automated data validation.
+    * Synthetic data quality reports.
+    * Privacy filters and optional differential privacy support.
+* Multiple [model types supported](https://docs.gretel.ai/synthetics/models):
+    * `Gretel-LSTM` model type supports text, tabular, time-series, and conditional data generation.
+    * `Gretel-CTGAN` model type supports tabular and conditional data generation.
+    * `Gretel-GPT` natural language synthesis based on an open-source implementation of GPT-3 (coming soon).
+    * `Gretel-DGAN` multi-variate time series based on DoppelGANger (coming soon).
+
+## Try it out now!
 
-**NOTE**: Either delete the existing or choose a new cache file name if you are starting
-a dataset run from scratch.
+If you want to quickly get started synthesizing data with **Gretel.ai**, simply click the button below and follow the examples. See additional Python3 and Jupyter Notebook examples in the `./notebooks` folder.
 
-# TODOs / Roadmap
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-trainer/blob/master/notebooks/trainer-examples.ipynb)
 
-- [ ] Enable additional sampling from from trained models.
-- [ ] Detect and label encode random UIDs (preprocessing).
+## Join our Slack Workspace
+
+If you want to be part of the Gretel synthetic data community to receive announcements of the latest releases,
+ask questions, suggest new features or participate in the development meetings, please join
+our Slack Workspace!
+
+[![Slack](https://img.shields.io/badge/Slack%20Workspace-Join%20now!-36C5F0?logo=slack)](https://gretel.ai/slackinvite)
+
+# Install
+
+**Using `pip`:**
+
+```bash
+pip install -U gretel-trainer
+```
+
+# Quickstart
+
+### 1. Add your [Gretel API](https://console.gretel.cloud) key via the Gretel CLI.
+Use the Gretel client to store your API key to disk. This step is optional, the trainer will prompt for an API key in the next step.
+```bash
+gretel configure
+```
+
+### 2. Train or fine-tune a model using the Gretel API
+
+```python3
+from gretel_trainer import trainer
+
+dataset = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"
+
+model = trainer.Trainer()
+model.train(dataset)
+```
+
+### 3. Generate synthetic data! 
+```python3
+df = model.generate()
+```
+
+## TODOs / Roadmap
+
+- [ ] Enable conditional generation via SDK interface (supported in Notebooks currently).
diff --git a/notebooks/gretel-trainer.ipynb b/notebooks/gretel-trainer.ipynb
@@ -307,7 +307,12 @@
     "id": "38e44df3"
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "# Use the model to generate additional data\n",
+    "\n",
+    "run.generate_data(num_records=5000, max_invalid=None, clear_cache=True)\n",
+    "run.get_synthetic_data()"
+   ]
   }
  ],
  "metadata": {

diff --git a/notebooks/trainer-examples.py b/notebooks/trainer-examples.py
@@ -0,0 +1,29 @@
+from gretel_trainer import trainer, runner
+
+dataset = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"
+
+# Simplest example
+model = trainer.Trainer()
+model.train(dataset)
+df = model.generate()
+
+# Specify underlying model
+#model = trainer.Trainer(model_type="GretelLSTM")
+#model.train(dataset)
+#df = model.generate()
+
+# Update trainer parameters
+#model = trainer.Trainer(max_header_clusters=20, max_rows=50000)
+#model.train(dataset)
+#df = model.generate()
+
+# Specify synthetic model and update config params
+#model = trainer.Trainer(model_type="GretelCTGAN", model_params={'epochs':2})
+#model.train(dataset)
+#df = model.generate()
+
+# Load and generate data from an existing model
+#model = trainer.Trainer.load()
+#df = model.generate(num_records=70)
+
+print(df)
diff --git a/setup.py b/setup.py
@@ -4,9 +4,23 @@
 local_path = pathlib.Path(__file__).parent
 install_requires = (local_path / "requirements.txt").read_text().splitlines()
 
-setup(name="trainer",
+setup(name="gretel-trainer",
       version="0.0.1",
       package_dir={'': 'src'}, 
       install_requires=install_requires, 
-      packages=find_packages("src")
+      python_requires=">=3.7",
+      packages=find_packages("src"),
+      package_data={'': ['*.yaml']},
+      include_package_data=True,
+      description="Synthetic Data Generation with optional Differential Privacy",
+      url="https://github.com/gretelai/gretel-trainer",
+      license="http://www.apache.org/licenses/LICENSE-2.0",
+      classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: Apache Software License",
+        "Operating System :: POSIX :: Linux",
+        "Operating System :: MacOS",
+        "Operating System :: Microsoft :: Windows",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+      ]
 )
diff --git a/src/trainer/__init__.py → src/gretel_trainer/__init__.py b/src/trainer/__init__.py → src/gretel_trainer/__init__.py