Add NavDD text-to-code demo notebooks (#458)

* add notebooks * fix colab link * try again
gretelai · Nov 12, 2024 · 0d75883 · 0d75883
1 parent f261306
commit 0d75883
Show file tree

Hide file tree

Showing 2 changed files with 685 additions and 0 deletions.
diff --git a/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb b/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb
@@ -0,0 +1,308 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "MMMHiDmEcYZY"
+   },
+   "source": [
+    "# 🎨 Navigator Data Designer SDK: Text-to-Python\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "id": "mNoaC7dX28y0"
+   },
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install -U git+https://github.com/gretelai/gretel-python-client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "id": "1k5NjjtzPQJi"
+   },
+   "outputs": [],
+   "source": [
+    "from gretel_client.navigator import DataDesigner\n",
+    "\n",
+    "session_kwargs = {\n",
+    "    \"api_key\": \"prompt\",\n",
+    "    \"endpoint\": \"https://api.gretel.cloud\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "K2NzEYJedJeA"
+   },
+   "source": [
+    "## 📘 Text-to-Python Configuration\n",
+    "\n",
+    "Below we show an example Text-to-Python `DataDesigner` configuration. The main sections are as follow:\n",
+    "\n",
+    "- **model_suite:** You can use `apache-2.0` or `llama-3.x` depending on the type of license you want associated with the data you generate. Selecting `apache-2.0` ensures that all models used by Data Designer comply with the `apache-2.0` license and using `llama-3.x` means the models used by Data Designer will fall under the `Llama 3` license.\n",
+    "\n",
+    "- **special_system_instructions:** This is an optional use-case-specific instruction to be added to the system prompt of all LLMs used during synthetic data generation.\n",
+    "\n",
+    "- **categorical_seed_columns:** Specifies categorical data seed columns that will be used to seed the synthetic data generation process. Here we fully specify all seed categories and subcategories. It is also possible to generate category values using the `num_new_values_to_generate` parameter.\n",
+    "\n",
+    "- **generated_data_columns:** Specifies data columns that are fully generated using LLMs, seeded by the categorical seed columns. The `generation_prompt` field is the prompt template that will be used to generate the data column. All data seeds and previously defined data columns can be used as template keyword arguments.\n",
+    "\n",
+    "- **post_processors:** Specifics validation / evaluation / processing that is applied to the dataset after generation. Here, we define a code validator and the `text_to_python` evaluation suite."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0cxx8ensOqLl"
+   },
+   "outputs": [],
+   "source": [
+    "config_string = \"\"\"\n",
+    "model_suite: apache-2.0\n",
+    "\n",
+    "special_system_instructions: >-\n",
+    "  You are an expert at writing, analyzing, and editing Python code. You know what\n",
+    "  a high-quality, clean, efficient, and maintainable Python code looks like. You\n",
+    "  excel at transforming natural language into Python, as well as Python back into\n",
+    "  natural language. Your job is to assist the user with their Python-related tasks.\n",
+    "\n",
+    "categorical_seed_columns:\n",
+    "  - name: industry_sector\n",
+    "    values: [Healthcare, Finance, Technology]\n",
+    "    subcategories:\n",
+    "      - name: topic\n",
+    "        values:\n",
+    "          Healthcare:\n",
+    "            - Electronic Health Records (EHR) Systems\n",
+    "            - Telemedicine Platforms\n",
+    "            - AI-Powered Diagnostic Tools\n",
+    "          Finance:\n",
+    "            - Fraud Detection Software\n",
+    "            - Automated Trading Systems\n",
+    "            - Personal Finance Apps\n",
+    "          Technology:\n",
+    "            - Cloud Computing Platforms\n",
+    "            - Artificial Intelligence and Machine Learning Platforms\n",
+    "            - DevOps and Continuous Integration/Continuous Deployment (CI/CD) Tools\n",
+    "\n",
+    "  - name: code_complexity\n",
+    "    values: [Intermediate, Advanced, Expert]\n",
+    "    subcategories:\n",
+    "      - name: code_concept\n",
+    "        values:\n",
+    "          Intermediate: [Functions, List Comprehensions, Classes]\n",
+    "          Advanced: [Object-oriented programming, Error Handling, Lambda Functions]\n",
+    "          Expert: [Decorators, Multithreading, Context Managers]\n",
+    "\n",
+    "  - name: prompt_type\n",
+    "    values: [instruction, question]\n",
+    "    subcategories:\n",
+    "      - name: prompt_creation_instruction\n",
+    "        values:\n",
+    "          instruction:\n",
+    "            - Write an instruction for a user to write Python code for a specific task.\n",
+    "            - Generate a clear and concise instruction for a Python programming challenge.\n",
+    "          question:\n",
+    "            - Ask a specific question about how to solve a problem using Python code.\n",
+    "            - Generate a question about how to perform a general task in Python.\n",
+    "\n",
+    "generated_data_columns:\n",
+    "    - name: text\n",
+    "      generation_prompt: >-\n",
+    "        {prompt_creation_instruction} \\n\n",
+    "\n",
+    "        ### Important Guidelines ###\n",
+    "            * Make sure the {prompt_type} is related to {topic} in the {industry_sector} sector.\n",
+    "            * Do not write any code as part of the {prompt_type}.\n",
+    "      columns_to_list_in_prompt: all_categorical_seed_columns\n",
+    "\n",
+    "    - name: code\n",
+    "      generation_prompt: >-\n",
+    "        Write Python code that will be paired with the following prompt:\n",
+    "        {text} \\n\n",
+    "\n",
+    "        ### Important Guidelines ###\n",
+    "            * Your code should be self-contained and executable.\n",
+    "            * Remember to import any necessary libraries.\n",
+    "            * The code should be written at a {code_complexity} level and make use of {code_concept}.\n",
+    "      llm_type: code\n",
+    "      columns_to_list_in_prompt: [industry_sector, topic]\n",
+    "\n",
+    "post_processors:\n",
+    "    - validator: code\n",
+    "      settings:\n",
+    "        code_lang: python\n",
+    "        code_columns: [code]\n",
+    "\n",
+    "    - evaluator: text_to_python\n",
+    "      settings:\n",
+    "        text_column: text\n",
+    "        code_column: code\n",
+    "\"\"\"\n",
+    "\n",
+    "data_designer = DataDesigner.from_config(config_string, **session_kwargs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NDxYu6azd3c4"
+   },
+   "source": [
+    "## 👀 Generating a dataset preview\n",
+    "\n",
+    "- Preview mode allows you to quickly iterate on your data design.\n",
+    "\n",
+    "- Each preview generation call creates 10 records for inspection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Ef8Ws90cPbIu"
+   },
+   "outputs": [],
+   "source": [
+    "preview = data_designer.generate_dataset_preview()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "Y5I2GjczNh_s"
+   },
+   "outputs": [],
+   "source": [
+    "# The preview dataset is accessible as a DataFrame\n",
+    "preview.dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "CjYiKmcWd_2t"
+   },
+   "source": [
+    "## 🔎 Easily inspect individual records\n",
+    "\n",
+    "- Run the cell below to display individual records for inspection.\n",
+    "\n",
+    "- Run the cell multiple times to cycle through the 10 preview records.\n",
+    "\n",
+    "- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "fAWaJKnAP8ZJ"
+   },
+   "outputs": [],
+   "source": [
+    "preview.display_sample_record()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "eMjFAR0Yenrk"
+   },
+   "source": [
+    "## 🤔 Like what you see?\n",
+    "\n",
+    "- Submit a batch workflow!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "VziAxDPtQEes"
+   },
+   "outputs": [],
+   "source": [
+    "batch_job = data_designer.submit_batch_workflow(num_records=25)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "dY1XI8q-Ru4z"
+   },
+   "outputs": [],
+   "source": [
+    "batch_job.status"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "fDAG5KmQeQ0m"
+   },
+   "outputs": [],
+   "source": [
+    "df = batch_job.fetch_dataset(wait_for_completion=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "y4joRe9aJZCM"
+   },
+   "outputs": [],
+   "source": [
+    "path = batch_job.download_evaluation_report()"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "authorship_tag": "ABX9TyOULXxjB7a5FBgCdNl8vi0v",
+   "include_colab_link": true,
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}