Skip to content

Commit

Permalink
Add NavDD text-to-code demo notebooks (#458)
Browse files Browse the repository at this point in the history
* add notebooks

* fix colab link

* try again
  • Loading branch information
johnnygreco authored Nov 12, 2024
1 parent f261306 commit 0d75883
Show file tree
Hide file tree
Showing 2 changed files with 685 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"<a href=\"https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MMMHiDmEcYZY"
},
"source": [
"# 🎨 Navigator Data Designer SDK: Text-to-Python\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "mNoaC7dX28y0"
},
"outputs": [],
"source": [
"%%capture\n",
"!pip install -U git+https://github.com/gretelai/gretel-python-client"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "1k5NjjtzPQJi"
},
"outputs": [],
"source": [
"from gretel_client.navigator import DataDesigner\n",
"\n",
"session_kwargs = {\n",
" \"api_key\": \"prompt\",\n",
" \"endpoint\": \"https://api.gretel.cloud\",\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "K2NzEYJedJeA"
},
"source": [
"## 📘 Text-to-Python Configuration\n",
"\n",
"Below we show an example Text-to-Python `DataDesigner` configuration. The main sections are as follow:\n",
"\n",
"- **model_suite:** You can use `apache-2.0` or `llama-3.x` depending on the type of license you want associated with the data you generate. Selecting `apache-2.0` ensures that all models used by Data Designer comply with the `apache-2.0` license and using `llama-3.x` means the models used by Data Designer will fall under the `Llama 3` license.\n",
"\n",
"- **special_system_instructions:** This is an optional use-case-specific instruction to be added to the system prompt of all LLMs used during synthetic data generation.\n",
"\n",
"- **categorical_seed_columns:** Specifies categorical data seed columns that will be used to seed the synthetic data generation process. Here we fully specify all seed categories and subcategories. It is also possible to generate category values using the `num_new_values_to_generate` parameter.\n",
"\n",
"- **generated_data_columns:** Specifies data columns that are fully generated using LLMs, seeded by the categorical seed columns. The `generation_prompt` field is the prompt template that will be used to generate the data column. All data seeds and previously defined data columns can be used as template keyword arguments.\n",
"\n",
"- **post_processors:** Specifics validation / evaluation / processing that is applied to the dataset after generation. Here, we define a code validator and the `text_to_python` evaluation suite."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0cxx8ensOqLl"
},
"outputs": [],
"source": [
"config_string = \"\"\"\n",
"model_suite: apache-2.0\n",
"\n",
"special_system_instructions: >-\n",
" You are an expert at writing, analyzing, and editing Python code. You know what\n",
" a high-quality, clean, efficient, and maintainable Python code looks like. You\n",
" excel at transforming natural language into Python, as well as Python back into\n",
" natural language. Your job is to assist the user with their Python-related tasks.\n",
"\n",
"categorical_seed_columns:\n",
" - name: industry_sector\n",
" values: [Healthcare, Finance, Technology]\n",
" subcategories:\n",
" - name: topic\n",
" values:\n",
" Healthcare:\n",
" - Electronic Health Records (EHR) Systems\n",
" - Telemedicine Platforms\n",
" - AI-Powered Diagnostic Tools\n",
" Finance:\n",
" - Fraud Detection Software\n",
" - Automated Trading Systems\n",
" - Personal Finance Apps\n",
" Technology:\n",
" - Cloud Computing Platforms\n",
" - Artificial Intelligence and Machine Learning Platforms\n",
" - DevOps and Continuous Integration/Continuous Deployment (CI/CD) Tools\n",
"\n",
" - name: code_complexity\n",
" values: [Intermediate, Advanced, Expert]\n",
" subcategories:\n",
" - name: code_concept\n",
" values:\n",
" Intermediate: [Functions, List Comprehensions, Classes]\n",
" Advanced: [Object-oriented programming, Error Handling, Lambda Functions]\n",
" Expert: [Decorators, Multithreading, Context Managers]\n",
"\n",
" - name: prompt_type\n",
" values: [instruction, question]\n",
" subcategories:\n",
" - name: prompt_creation_instruction\n",
" values:\n",
" instruction:\n",
" - Write an instruction for a user to write Python code for a specific task.\n",
" - Generate a clear and concise instruction for a Python programming challenge.\n",
" question:\n",
" - Ask a specific question about how to solve a problem using Python code.\n",
" - Generate a question about how to perform a general task in Python.\n",
"\n",
"generated_data_columns:\n",
" - name: text\n",
" generation_prompt: >-\n",
" {prompt_creation_instruction} \\n\n",
"\n",
" ### Important Guidelines ###\n",
" * Make sure the {prompt_type} is related to {topic} in the {industry_sector} sector.\n",
" * Do not write any code as part of the {prompt_type}.\n",
" columns_to_list_in_prompt: all_categorical_seed_columns\n",
"\n",
" - name: code\n",
" generation_prompt: >-\n",
" Write Python code that will be paired with the following prompt:\n",
" {text} \\n\n",
"\n",
" ### Important Guidelines ###\n",
" * Your code should be self-contained and executable.\n",
" * Remember to import any necessary libraries.\n",
" * The code should be written at a {code_complexity} level and make use of {code_concept}.\n",
" llm_type: code\n",
" columns_to_list_in_prompt: [industry_sector, topic]\n",
"\n",
"post_processors:\n",
" - validator: code\n",
" settings:\n",
" code_lang: python\n",
" code_columns: [code]\n",
"\n",
" - evaluator: text_to_python\n",
" settings:\n",
" text_column: text\n",
" code_column: code\n",
"\"\"\"\n",
"\n",
"data_designer = DataDesigner.from_config(config_string, **session_kwargs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NDxYu6azd3c4"
},
"source": [
"## 👀 Generating a dataset preview\n",
"\n",
"- Preview mode allows you to quickly iterate on your data design.\n",
"\n",
"- Each preview generation call creates 10 records for inspection."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ef8Ws90cPbIu"
},
"outputs": [],
"source": [
"preview = data_designer.generate_dataset_preview()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Y5I2GjczNh_s"
},
"outputs": [],
"source": [
"# The preview dataset is accessible as a DataFrame\n",
"preview.dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CjYiKmcWd_2t"
},
"source": [
"## 🔎 Easily inspect individual records\n",
"\n",
"- Run the cell below to display individual records for inspection.\n",
"\n",
"- Run the cell multiple times to cycle through the 10 preview records.\n",
"\n",
"- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fAWaJKnAP8ZJ"
},
"outputs": [],
"source": [
"preview.display_sample_record()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eMjFAR0Yenrk"
},
"source": [
"## 🤔 Like what you see?\n",
"\n",
"- Submit a batch workflow!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VziAxDPtQEes"
},
"outputs": [],
"source": [
"batch_job = data_designer.submit_batch_workflow(num_records=25)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dY1XI8q-Ru4z"
},
"outputs": [],
"source": [
"batch_job.status"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fDAG5KmQeQ0m"
},
"outputs": [],
"source": [
"df = batch_job.fetch_dataset(wait_for_completion=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "y4joRe9aJZCM"
},
"outputs": [],
"source": [
"path = batch_job.download_evaluation_report()"
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyOULXxjB7a5FBgCdNl8vi0v",
"include_colab_link": true,
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 0d75883

Please sign in to comment.