Skip to content

Commit

Permalink
Merge branch 'main' into me/RDS-736
Browse files Browse the repository at this point in the history
merge main to the branch
  • Loading branch information
marjan_emd committed Oct 25, 2023
2 parents 58244a8 + 7f22113 commit eba5420
Show file tree
Hide file tree
Showing 5 changed files with 573 additions and 10 deletions.
15 changes: 8 additions & 7 deletions config_templates/gretel/synthetics/natural-language.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@ name: "natural-language-gpt"
models:
- gpt_x:
data_source: "__temp__"
pretrained_model: "gretelai/mpt-7b"
batch_size: 4
steps: 750
weight_decay: 0.01
warmup_steps: 100
lr_scheduler: "linear"
learning_rate: 0.0002
pretrained_model: "meta-llama/Llama-2-7b-chat-hf"
column_name: null
params:
batch_size: 4
steps: 750
weight_decay: 0.01
warmup_steps: 100
lr_scheduler: "linear"
learning_rate: 0.0002
generate:
num_records: 80
maximum_text_length: 100
246 changes: 245 additions & 1 deletion sdk_blueprints/Gretel_101_Blueprint.ipynb
Original file line number Diff line number Diff line change
@@ -1 +1,245 @@
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyNdgbgbwGute8moA1O4zTW7"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)\n","\n","<br>\n","\n","<center><a href=https://gretel.ai/><img src=\"https://global-uploads.webflow.com/5ea8b9202fac2ea6211667a4/62dae7c82eb3a22ac4bd415e_gretel.ai%20logo.svg\" alt=\"Gretel\" width=\"350\"/></a></center>\n","\n","<br>\n","\n","## Welcome to the Gretel 101 Blueprint!\n","\n","In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).\n","\n","Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.\n","\n","## Create your Gretel account\n","\n","To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).\n","\n","<br>\n","\n","#### Ready? Let's go 🚀"],"metadata":{"id":"nwpvdB3Jn5hG"}},{"cell_type":"markdown","source":["## 💾 Install `gretel-client` and its dependencies"],"metadata":{"id":"MPHEAxLufyEo"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"zFeKqpkunEo1"},"outputs":[],"source":["%%capture\n","!pip install gretel-client"]},{"cell_type":"markdown","source":["## 🛜 Configure your Gretel session\n","\n","- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.\n","\n","- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).\n","\n","- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).\n","\n","- With `validate=True`, your login credentials will be validated immediately at instantiation."],"metadata":{"id":"DNdDXiI-Xkf1"}},{"cell_type":"code","source":["from gretel_client import Gretel\n","\n","gretel = Gretel(api_key=\"prompt\", validate=True)"],"metadata":{"id":"5qnVwoPZx4j0"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# @title 🗂️ Pick a tabular dataset 👇 { display-mode: \"form\" }\n","dataset_path_dict = {\n"," \"adult income in the USA (14000 records, 15 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv\",\n"," \"hospital length of stay (9999 records, 18 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv\",\n"," \"customer churn (7032 records, 21 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv\"\n","}\n","\n","dataset = \"adult income in the USA (14000 records, 15 fields)\" # @param [\"adult income in the USA (14000 records, 15 fields)\", \"hospital length of stay (9999 records, 18 fields)\", \"customer churn (7032 records, 21 fields)\"]\n","dataset = dataset_path_dict[dataset]\n"],"metadata":{"id":"uRbY7vk3tSBg"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["import pandas as pd\n","\n","# explore the data using pandas\n","df = pd.read_csv(dataset)\n","df.head()"],"metadata":{"id":"cW3VKpyPvm6W"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🏋️‍♂️ Train a generative model\n","\n","- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.\n","\n","- You can replace `tabular-actgan` with the path to a custom config or select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).\n","\n","- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.\n","\n","- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console."],"metadata":{"id":"SwROZthrvXil"}},{"cell_type":"code","source":["trained = gretel.submit_train(\"tabular-actgan\", data_source=dataset)"],"metadata":{"id":"i89eGZwIxSCW"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🧐 Evaluate the synthetic data quality\n","\n","- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.\n","\n","- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.\n"],"metadata":{"id":"eljkfb8jb_hK"}},{"cell_type":"code","source":["# view the quality scores\n","print(trained.report)"],"metadata":{"id":"bNZqhFPOclrV"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# display the full report within this notebook\n","trained.report.display_in_notebook()"],"metadata":{"id":"3QMiP7lKecE5"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# inspect the synthetic data used to create the report\n","df_synth_report = trained.fetch_report_synthetic_data()\n","df_synth_report.head()"],"metadata":{"id":"2dHuQT_cuIno"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## 🤖 Generate synthetic data\n","\n","- The `model_id` argument can be the ID of any trained model within the current project.\n"],"metadata":{"id":"ZIeY7TczxvDV"}},{"cell_type":"code","source":["generated = gretel.submit_generate(trained.model_id, num_records=1000)"],"metadata":{"id":"J6XZUuR2eguX"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["# inspect the generated synthetic data\n","generated.synthetic_data.head()"],"metadata":{"id":"-_do0Kvvunv2"},"execution_count":null,"outputs":[]}]}
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyNosAwAWvwVU9i43TeCxQrP"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/sdk_blueprints/Gretel_101_Blueprint.ipynb)\n",
"\n",
"<br>\n",
"\n",
"<center><a href=https://gretel.ai/><img src=\"https://gretel-public-website.s3.us-west-2.amazonaws.com/assets/brand/gretel_brand_wordmark.svg\" alt=\"Gretel\" width=\"350\"/></a></center>\n",
"\n",
"<br>\n",
"\n",
"## Welcome to the Gretel 101 Blueprint!\n",
"\n",
"In this Blueprint, we will use Gretel to train a deep generative model and use it to generate high-quality synthetic (tabular) data. We will accomplish this by submitting training and generation jobs to the [Gretel Cloud](https://gretel.ai/faqs/gretel-cloud) via [Gretel's Python SDK](https://docs.gretel.ai/guides/environment-setup/cli-and-sdk).\n",
"\n",
"Behind the scenes, Gretel will spin up workers with the necessary compute resources, set up the model with your desired configuration, and perform the submitted task.\n",
"\n",
"## Create your Gretel account\n",
"\n",
"To get started, you will need to [sign up for a free Gretel account](https://console.gretel.ai/).\n",
"\n",
"<br>\n",
"\n",
"#### Ready? Let's go 🚀"
],
"metadata": {
"id": "nwpvdB3Jn5hG"
}
},
{
"cell_type": "markdown",
"source": [
"## 💾 Install `gretel-client` and its dependencies"
],
"metadata": {
"id": "MPHEAxLufyEo"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zFeKqpkunEo1"
},
"outputs": [],
"source": [
"%%capture\n",
"!pip install gretel-client"
]
},
{
"cell_type": "markdown",
"source": [
"## 🛜 Configure your Gretel session\n",
"\n",
"- The `Gretel` object provides a high-level interface for streamlining interactions with Gretel's APIs.\n",
"\n",
"- Each `Gretel` instance is bound to a single [Gretel project](https://docs.gretel.ai/guides/gretel-fundamentals/projects).\n",
"\n",
"- Running the cell below will prompt you for your Gretel API key, which you can retrieve [here](https://console.gretel.ai/users/me/key).\n",
"\n",
"- With `validate=True`, your login credentials will be validated immediately at instantiation."
],
"metadata": {
"id": "DNdDXiI-Xkf1"
}
},
{
"cell_type": "code",
"source": [
"from gretel_client import Gretel\n",
"\n",
"gretel = Gretel(api_key=\"prompt\", validate=True)"
],
"metadata": {
"id": "5qnVwoPZx4j0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# @title 🗂️ Pick a tabular dataset 👇 { display-mode: \"form\" }\n",
"dataset_path_dict = {\n",
" \"adult income in the USA (14000 records, 15 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/us-adult-income.csv\",\n",
" \"hospital length of stay (9999 records, 18 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/sample-synthetic-healthcare.csv\",\n",
" \"customer churn (7032 records, 21 fields)\": \"https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/sample_data/monthly-customer-payments.csv\"\n",
"}\n",
"\n",
"dataset = \"adult income in the USA (14000 records, 15 fields)\" # @param [\"adult income in the USA (14000 records, 15 fields)\", \"hospital length of stay (9999 records, 18 fields)\", \"customer churn (7032 records, 21 fields)\"]\n",
"dataset = dataset_path_dict[dataset]\n"
],
"metadata": {
"id": "uRbY7vk3tSBg"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"import pandas as pd\n",
"\n",
"# explore the data using pandas\n",
"df = pd.read_csv(dataset)\n",
"df.head()"
],
"metadata": {
"id": "cW3VKpyPvm6W"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 🏋️‍♂️ Train a generative model\n",
"\n",
"- The [tabular-actgan](https://github.com/gretelai/gretel-blueprints/blob/main/config_templates/gretel/synthetics/tabular-actgan.yml) base config tells Gretel which model to train and how to configure it.\n",
"\n",
"- You can replace `tabular-actgan` with the path to a custom config file, or you can select any of the tabular configs [listed here](https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics).\n",
"\n",
"- The training data is passed in using the `data_source` argument. Its type can be a file path or `DataFrame`.\n",
"\n",
"- **Tip:** Click the printed Console URL to monitor your job's progress in the Gretel Console."
],
"metadata": {
"id": "SwROZthrvXil"
}
},
{
"cell_type": "code",
"source": [
"trained = gretel.submit_train(\"tabular-actgan\", data_source=dataset)"
],
"metadata": {
"id": "i89eGZwIxSCW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 🧐 Evaluate the synthetic data quality\n",
"\n",
"- Gretel automatically creates a [synthetic data quality report](https://docs.gretel.ai/reference/evaluate/synthetic-data-quality-report) for each model you train.\n",
"\n",
"- The training results object returned by `submit_train` has a `GretelReport` attribute for viewing the quality report.\n"
],
"metadata": {
"id": "eljkfb8jb_hK"
}
},
{
"cell_type": "code",
"source": [
"# view the quality scores\n",
"print(trained.report)"
],
"metadata": {
"id": "bNZqhFPOclrV"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# display the full report within this notebook\n",
"trained.report.display_in_notebook()"
],
"metadata": {
"id": "3QMiP7lKecE5"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# inspect the synthetic data used to create the report\n",
"df_synth_report = trained.fetch_report_synthetic_data()\n",
"df_synth_report.head()"
],
"metadata": {
"id": "2dHuQT_cuIno"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## 🤖 Generate synthetic data\n",
"\n",
"- The `model_id` argument can be the ID of any trained model within the current project.\n"
],
"metadata": {
"id": "ZIeY7TczxvDV"
}
},
{
"cell_type": "code",
"source": [
"generated = gretel.submit_generate(trained.model_id, num_records=1000)"
],
"metadata": {
"id": "J6XZUuR2eguX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# inspect the generated synthetic data\n",
"generated.synthetic_data.head()"
],
"metadata": {
"id": "-_do0Kvvunv2"
},
"execution_count": null,
"outputs": []
}
]
}
Loading

0 comments on commit eba5420

Please sign in to comment.