Merge branch 'main' into me/RDS-736

merge main to the branch
gretelai · Oct 25, 2023 · eba5420 · eba5420
2 parents 58244a8 + 7f22113
commit eba5420
Show file tree

Hide file tree

Showing 5 changed files with 573 additions and 10 deletions.
diff --git a/config_templates/gretel/synthetics/natural-language.yml b/config_templates/gretel/synthetics/natural-language.yml
@@ -8,14 +8,15 @@ name: "natural-language-gpt"
 models:
   - gpt_x:
       data_source: "__temp__"
-      pretrained_model: "gretelai/mpt-7b"
-      batch_size: 4
-      steps: 750
-      weight_decay: 0.01
-      warmup_steps: 100
-      lr_scheduler: "linear"
-      learning_rate: 0.0002
+      pretrained_model: "meta-llama/Llama-2-7b-chat-hf"
       column_name: null
+      params:
+        batch_size: 4
+        steps: 750
+        weight_decay: 0.01
+        warmup_steps: 100
+        lr_scheduler: "linear"
+        learning_rate: 0.0002
       generate:
         num_records: 80
         maximum_text_length: 100
diff --git a/sdk_blueprints/Gretel_101_Blueprint.ipynb b/sdk_blueprints/Gretel_101_Blueprint.ipynb
@@ -1 +1,245 @@
-{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyNdgbgbwGute8moA1O4zTW7"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)\n","\n","<br>\n","\n","<center><a href=https://gretel.ai/><img src=\"https://global-uploads.webflow.com/5ea8b9202fac2ea6211667a4/62dae7c82eb3a22ac4bd415e_gretel.ai%20logo.svg\" alt=\"Gretel\" width=\"350\"/></a></center>\n","\n","<br>\n","\n","## Welcome to the Gretel 101 Blueprint!\n","\n","In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).\n","\n","Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.\n","\n","## Create your Gretel account\n","\n","To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).\n","\n","<br>\n","\n","#### Ready? Let's go 🚀"],"metadata":{"id":"nwpvdB3Jn5hG"}},{"cell_type":"markdown","source":["## 💾 Install `gretel-client` and its dependencies"],"metadata":{"id":"MPHEAxLufyEo"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"zFeKqpkunEo1"},"outputs":[],"source":["%%capture\n","!pip install gretel-client"]},{"cell_type":"markdown","source":["## 🛜 Configure your Gretel session\n","\n","- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.\n","\n","- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).\n","\n","- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).\n","\n","- With `validate=True`, your login credentials will be validated immediately at instantiation."],"metadata":{"id":"DNdDXiI-Xkf1"}},{"cell_type":"code","source":["from gretel_client import Gretel\n","\n","gretel = Gretel(api_key=\"prompt\", validate=True)"],"metadata":{"id":"5qnVwoPZx4j0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title 🗂️ Pick a tabular dataset 👇 { display-mode: \"form\" }\n","dataset_path_dict = {\n","    \"adult income in the USA (14000 records, 15 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv\",\n","    \"hospital length of stay (9999 records, 18 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv\",\n","    \"customer churn (7032 records, 21 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv\"\n","}\n","\n","dataset = \"adult income in the USA (14000 records, 15 fields)\" # @param [\"adult income in the USA (14000 records, 15 fields)\", \"hospital length of stay (9999 records, 18 fields)\", \"customer churn (7032 records, 21 fields)\"]\n","dataset = dataset_path_dict[dataset]\n"],"metadata":{"id":"uRbY7vk3tSBg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import pandas as pd\n","\n","# explore the data using pandas\n","df = pd.read_csv(dataset)\n","df.head()"],"metadata":{"id":"cW3VKpyPvm6W"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🏋️‍♂️ Train a generative model\n","\n","- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.\n","\n","- You can replace `tabular-actgan` with the path to a custom config or select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).\n","\n","- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.\n","\n","- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console."],"metadata":{"id":"SwROZthrvXil"}},{"cell_type":"code","source":["trained = gretel.submit_train(\"tabular-actgan\", data_source=dataset)"],"metadata":{"id":"i89eGZwIxSCW"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🧐 Evaluate the synthetic data quality\n","\n","- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.\n","\n","- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.\n"],"metadata":{"id":"eljkfb8jb_hK"}},{"cell_type":"code","source":["# view the quality scores\n","print(trained.report)"],"metadata":{"id":"bNZqhFPOclrV"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# display the full report within this notebook\n","trained.report.display_in_notebook()"],"metadata":{"id":"3QMiP7lKecE5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# inspect the synthetic data used to create the report\n","df_synth_report = trained.fetch_report_synthetic_data()\n","df_synth_report.head()"],"metadata":{"id":"2dHuQT_cuIno"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🤖 Generate synthetic data\n","\n","- The `model_id` argument can be the ID of any trained model within the current project.\n"],"metadata":{"id":"ZIeY7TczxvDV"}},{"cell_type":"code","source":["generated = gretel.submit_generate(trained.model_id, num_records=1000)"],"metadata":{"id":"J6XZUuR2eguX"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# inspect the generated synthetic data\n","generated.synthetic_data.head()"],"metadata":{"id":"-_do0Kvvunv2"},"execution_count":null,"outputs":[]}]}
+{
+    "nbformat": 4,
+    "nbformat_minor": 0,
+    "metadata": {
+        "colab": {
+            "provenance": [],
+            "authorship_tag": "ABX9TyNosAwAWvwVU9i43TeCxQrP"
+        },
+        "kernelspec": {
+            "name": "python3",
+            "display_name": "Python 3"
+        },
+        "language_info": {
+            "name": "python"
+        }
+    },
+    "cells": [
+        {
+            "cell_type": "markdown",
+            "source": [
+                "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)\n",
+                "\n",
+                "<br>\n",
+                "\n",
+                "<center><a href=https://gretel.ai/><img src=\"https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg\" alt=\"Gretel\" width=\"350\"/></a></center>\n",
+                "\n",
+                "<br>\n",
+                "\n",
+                "## Welcome to the Gretel 101 Blueprint!\n",
+                "\n",
+                "In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).\n",
+                "\n",
+                "Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.\n",
+                "\n",
+                "## Create your Gretel account\n",
+                "\n",
+                "To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).\n",
+                "\n",
+                "<br>\n",
+                "\n",
+                "#### Ready? Let's go 🚀"
+            ],
+            "metadata": {
+                "id": "nwpvdB3Jn5hG"
+            }
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## 💾 Install `gretel-client` and its dependencies"
+            ],
+            "metadata": {
+                "id": "MPHEAxLufyEo"
+            }
+        },
+        {
+            "cell_type": "code",
+            "execution_count": null,
+            "metadata": {
+                "id": "zFeKqpkunEo1"
+            },
+            "outputs": [],
+            "source": [
+                "%%capture\n",
+                "!pip install gretel-client"
+            ]
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## 🛜 Configure your Gretel session\n",
+                "\n",
+                "- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.\n",
+                "\n",
+                "- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).\n",
+                "\n",
+                "- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).\n",
+                "\n",
+                "- With `validate=True`, your login credentials will be validated immediately at instantiation."
+            ],
+            "metadata": {
+                "id": "DNdDXiI-Xkf1"
+            }
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "from gretel_client import Gretel\n",
+                "\n",
+                "gretel = Gretel(api_key=\"prompt\", validate=True)"
+            ],
+            "metadata": {
+                "id": "5qnVwoPZx4j0"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "# @title 🗂️ Pick a tabular dataset 👇 { display-mode: \"form\" }\n",
+                "dataset_path_dict = {\n",
+                "    \"adult income in the USA (14000 records, 15 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv\",\n",
+                "    \"hospital length of stay (9999 records, 18 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv\",\n",
+                "    \"customer churn (7032 records, 21 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv\"\n",
+                "}\n",
+                "\n",
+                "dataset = \"adult income in the USA (14000 records, 15 fields)\" # @param [\"adult income in the USA (14000 records, 15 fields)\", \"hospital length of stay (9999 records, 18 fields)\", \"customer churn (7032 records, 21 fields)\"]\n",
+                "dataset = dataset_path_dict[dataset]\n"
+            ],
+            "metadata": {
+                "id": "uRbY7vk3tSBg"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "import pandas as pd\n",
+                "\n",
+                "# explore the data using pandas\n",
+                "df = pd.read_csv(dataset)\n",
+                "df.head()"
+            ],
+            "metadata": {
+                "id": "cW3VKpyPvm6W"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## 🏋️‍♂️ Train a generative model\n",
+                "\n",
+                "- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.\n",
+                "\n",
+                "- You can replace `tabular-actgan` with the path to a custom config file, or you can select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).\n",
+                "\n",
+                "- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.\n",
+                "\n",
+                "- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console."
+            ],
+            "metadata": {
+                "id": "SwROZthrvXil"
+            }
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "trained = gretel.submit_train(\"tabular-actgan\", data_source=dataset)"
+            ],
+            "metadata": {
+                "id": "i89eGZwIxSCW"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## 🧐 Evaluate the synthetic data quality\n",
+                "\n",
+                "- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.\n",
+                "\n",
+                "- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.\n"
+            ],
+            "metadata": {
+                "id": "eljkfb8jb_hK"
+            }
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "# view the quality scores\n",
+                "print(trained.report)"
+            ],
+            "metadata": {
+                "id": "bNZqhFPOclrV"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "# display the full report within this notebook\n",
+                "trained.report.display_in_notebook()"
+            ],
+            "metadata": {
+                "id": "3QMiP7lKecE5"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "# inspect the synthetic data used to create the report\n",
+                "df_synth_report = trained.fetch_report_synthetic_data()\n",
+                "df_synth_report.head()"
+            ],
+            "metadata": {
+                "id": "2dHuQT_cuIno"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "markdown",
+            "source": [
+                "## 🤖 Generate synthetic data\n",
+                "\n",
+                "- The `model_id` argument can be the ID of any trained model within the current project.\n"
+            ],
+            "metadata": {
+                "id": "ZIeY7TczxvDV"
+            }
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "generated = gretel.submit_generate(trained.model_id, num_records=1000)"
+            ],
+            "metadata": {
+                "id": "J6XZUuR2eguX"
+            },
+            "execution_count": null,
+            "outputs": []
+        },
+        {
+            "cell_type": "code",
+            "source": [
+                "# inspect the generated synthetic data\n",
+                "generated.synthetic_data.head()"
+            ],
+            "metadata": {
+                "id": "-_do0Kvvunv2"
+            },
+            "execution_count": null,
+            "outputs": []
+        }
+    ]
+}