diff --git a/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb b/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb new file mode 100644 index 00000000..99fcde21 --- /dev/null +++ b/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-python.ipynb @@ -0,0 +1,308 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MMMHiDmEcYZY" + }, + "source": [ + "# 🎨 Navigator Data Designer SDK: Text-to-Python\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "mNoaC7dX28y0" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install -U git+https://github.com/gretelai/gretel-python-client" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "1k5NjjtzPQJi" + }, + "outputs": [], + "source": [ + "from gretel_client.navigator import DataDesigner\n", + "\n", + "session_kwargs = {\n", + " \"api_key\": \"prompt\",\n", + " \"endpoint\": \"https://api.gretel.cloud\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2NzEYJedJeA" + }, + "source": [ + "## 📘 Text-to-Python Configuration\n", + "\n", + "Below we show an example Text-to-Python `DataDesigner` configuration. The main sections are as follow:\n", + "\n", + "- **model_suite:** You can use `apache-2.0` or `llama-3.x` depending on the type of license you want associated with the data you generate. Selecting `apache-2.0` ensures that all models used by Data Designer comply with the `apache-2.0` license and using `llama-3.x` means the models used by Data Designer will fall under the `Llama 3` license.\n", + "\n", + "- **special_system_instructions:** This is an optional use-case-specific instruction to be added to the system prompt of all LLMs used during synthetic data generation.\n", + "\n", + "- **categorical_seed_columns:** Specifies categorical data seed columns that will be used to seed the synthetic data generation process. Here we fully specify all seed categories and subcategories. It is also possible to generate category values using the `num_new_values_to_generate` parameter.\n", + "\n", + "- **generated_data_columns:** Specifies data columns that are fully generated using LLMs, seeded by the categorical seed columns. The `generation_prompt` field is the prompt template that will be used to generate the data column. All data seeds and previously defined data columns can be used as template keyword arguments.\n", + "\n", + "- **post_processors:** Specifics validation / evaluation / processing that is applied to the dataset after generation. Here, we define a code validator and the `text_to_python` evaluation suite." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0cxx8ensOqLl" + }, + "outputs": [], + "source": [ + "config_string = \"\"\"\n", + "model_suite: apache-2.0\n", + "\n", + "special_system_instructions: >-\n", + " You are an expert at writing, analyzing, and editing Python code. You know what\n", + " a high-quality, clean, efficient, and maintainable Python code looks like. You\n", + " excel at transforming natural language into Python, as well as Python back into\n", + " natural language. Your job is to assist the user with their Python-related tasks.\n", + "\n", + "categorical_seed_columns:\n", + " - name: industry_sector\n", + " values: [Healthcare, Finance, Technology]\n", + " subcategories:\n", + " - name: topic\n", + " values:\n", + " Healthcare:\n", + " - Electronic Health Records (EHR) Systems\n", + " - Telemedicine Platforms\n", + " - AI-Powered Diagnostic Tools\n", + " Finance:\n", + " - Fraud Detection Software\n", + " - Automated Trading Systems\n", + " - Personal Finance Apps\n", + " Technology:\n", + " - Cloud Computing Platforms\n", + " - Artificial Intelligence and Machine Learning Platforms\n", + " - DevOps and Continuous Integration/Continuous Deployment (CI/CD) Tools\n", + "\n", + " - name: code_complexity\n", + " values: [Intermediate, Advanced, Expert]\n", + " subcategories:\n", + " - name: code_concept\n", + " values:\n", + " Intermediate: [Functions, List Comprehensions, Classes]\n", + " Advanced: [Object-oriented programming, Error Handling, Lambda Functions]\n", + " Expert: [Decorators, Multithreading, Context Managers]\n", + "\n", + " - name: prompt_type\n", + " values: [instruction, question]\n", + " subcategories:\n", + " - name: prompt_creation_instruction\n", + " values:\n", + " instruction:\n", + " - Write an instruction for a user to write Python code for a specific task.\n", + " - Generate a clear and concise instruction for a Python programming challenge.\n", + " question:\n", + " - Ask a specific question about how to solve a problem using Python code.\n", + " - Generate a question about how to perform a general task in Python.\n", + "\n", + "generated_data_columns:\n", + " - name: text\n", + " generation_prompt: >-\n", + " {prompt_creation_instruction} \\n\n", + "\n", + " ### Important Guidelines ###\n", + " * Make sure the {prompt_type} is related to {topic} in the {industry_sector} sector.\n", + " * Do not write any code as part of the {prompt_type}.\n", + " columns_to_list_in_prompt: all_categorical_seed_columns\n", + "\n", + " - name: code\n", + " generation_prompt: >-\n", + " Write Python code that will be paired with the following prompt:\n", + " {text} \\n\n", + "\n", + " ### Important Guidelines ###\n", + " * Your code should be self-contained and executable.\n", + " * Remember to import any necessary libraries.\n", + " * The code should be written at a {code_complexity} level and make use of {code_concept}.\n", + " llm_type: code\n", + " columns_to_list_in_prompt: [industry_sector, topic]\n", + "\n", + "post_processors:\n", + " - validator: code\n", + " settings:\n", + " code_lang: python\n", + " code_columns: [code]\n", + "\n", + " - evaluator: text_to_python\n", + " settings:\n", + " text_column: text\n", + " code_column: code\n", + "\"\"\"\n", + "\n", + "data_designer = DataDesigner.from_config(config_string, **session_kwargs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NDxYu6azd3c4" + }, + "source": [ + "## 👀 Generating a dataset preview\n", + "\n", + "- Preview mode allows you to quickly iterate on your data design.\n", + "\n", + "- Each preview generation call creates 10 records for inspection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ef8Ws90cPbIu" + }, + "outputs": [], + "source": [ + "preview = data_designer.generate_dataset_preview()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Y5I2GjczNh_s" + }, + "outputs": [], + "source": [ + "# The preview dataset is accessible as a DataFrame\n", + "preview.dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CjYiKmcWd_2t" + }, + "source": [ + "## 🔎 Easily inspect individual records\n", + "\n", + "- Run the cell below to display individual records for inspection.\n", + "\n", + "- Run the cell multiple times to cycle through the 10 preview records.\n", + "\n", + "- Alternatively, you can pass the `index` argument to `display_sample_record` to display a specific record." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fAWaJKnAP8ZJ" + }, + "outputs": [], + "source": [ + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eMjFAR0Yenrk" + }, + "source": [ + "## 🤔 Like what you see?\n", + "\n", + "- Submit a batch workflow!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VziAxDPtQEes" + }, + "outputs": [], + "source": [ + "batch_job = data_designer.submit_batch_workflow(num_records=25)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dY1XI8q-Ru4z" + }, + "outputs": [], + "source": [ + "batch_job.status" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fDAG5KmQeQ0m" + }, + "outputs": [], + "source": [ + "df = batch_job.fetch_dataset(wait_for_completion=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "y4joRe9aJZCM" + }, + "outputs": [], + "source": [ + "path = batch_job.download_evaluation_report()" + ] + } + ], + "metadata": { + "colab": { + "authorship_tag": "ABX9TyOULXxjB7a5FBgCdNl8vi0v", + "include_colab_link": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-sql.ipynb b/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-sql.ipynb new file mode 100644 index 00000000..379ecbd6 --- /dev/null +++ b/docs/notebooks/demo/navigator/navigator-data-designer-sdk-text-to-sql.ipynb @@ -0,0 +1,377 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MMMHiDmEcYZY" + }, + "source": [ + "# 🎨 Navigator Data Designer SDK: Text-to-SQL" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mNoaC7dX28y0" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install -U git+https://github.com/gretelai/gretel-python-client" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1k5NjjtzPQJi" + }, + "outputs": [], + "source": [ + "from gretel_client.navigator import DataDesigner\n", + "\n", + "session_kwargs = {\n", + " \"api_key\": \"prompt\",\n", + " \"endpoint\": \"https://api.gretel.cloud\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K2NzEYJedJeA" + }, + "source": [ + "## 📘 Text-to-SQL Configuration\n", + "\n", + "In this example, we want an LLM to help us generate _values_ for some data seed categories / subcategories, as specified by the `num_new_values_to_generate` parameter.\n", + "\n", + "- `num_new_values_to_generate` indicates that we want to generate this many new values, in addition to any that exist in the config.\n", + "\n", + "- If both `values` and `num_new_values_to_generate` are present, then the existing values are used as examples for generation.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0cxx8ensOqLl" + }, + "outputs": [], + "source": [ + "config = \"\"\"\n", + "model_suite: llama-3.x\n", + "\n", + "special_system_instructions: >-\n", + " You are an expert at writing, analyzing and editing SQL queries. You know what\n", + " a high-quality, clean, efficient, and maintainable SQL code looks like. You\n", + " excel at transforming natural language into SQL, as well as SQL back into\n", + " natural language. Your job is to assist the user with their SQL-related tasks.\n", + " Leverage T-SQL only.\n", + "\n", + "categorical_seed_columns:\n", + " - name: domain\n", + " description: Major industry domain or sector that relies on robust data solutions\n", + " values: [Healthcare, Finance, Education, Science and Technology, Environmental Science, Government]\n", + " num_new_values_to_generate: 5\n", + " subcategories:\n", + " - name: domain_description\n", + " description: High-level description of the domain, highlighting various types of data relevant to writing SQL\n", + " num_new_values_to_generate: 1\n", + " - name: topic\n", + " description: Key topics that professional SQL developers care about in the given domain\n", + " num_new_values_to_generate: 15\n", + "\n", + " - name: sql_complexity\n", + " description: Complexity of the SQL query, ranging from basic operations to advanced data processing techniques\n", + " values:\n", + " - \"Basic SQL\"\n", + " - \"Aggregation\"\n", + " - \"Single Join\"\n", + " - \"Subquery\"\n", + " - \"Multiple Join\"\n", + " - \"Window Functions\"\n", + " subcategories:\n", + " - name: sql_complexity_description\n", + " description: Description of the complexity level of the SQL query\n", + " num_new_values_to_generate: 1\n", + "\n", + " - name: sql_task_type\n", + " description: Type of SQL task that the query represents\n", + " values:\n", + " - \"Data Retrieval\"\n", + " - \"Data Definition\"\n", + " - \"Data Manipulation\"\n", + " - \"Analytics and Reporting\"\n", + " - \"Database Administration\"\n", + " - \"Data Cleaning and Transformation\"\n", + " subcategories:\n", + " - name: sql_task_type_description\n", + " description: Description of the type of SQL task\n", + " num_new_values_to_generate: 1\n", + "\n", + "generated_data_columns:\n", + " - name: sql_prompt\n", + " generation_prompt: >-\n", + " Create a natural language prompt to generate SQL in the field of {domain},\n", + " specifically about the topic of {topic}. Feel free to ask for data that\n", + " focus on a smaller subject within the scope of {domain_description}.\n", + " columns_to_list_in_prompt: all_categorical_seed_columns\n", + " llm_type: natural_language\n", + "\n", + " - name: sql_context\n", + " generation_prompt: >-\n", + " Write a SQL query that generates tables and views in a database and are\n", + " pertinent to the natural language prompt in {sql_prompt}.\n", + "\n", + " Include complete executable SQL table CREATE statements and/or view CREATE statements.\n", + " Provide up to five tables/views that are relevant to the user's natural language prompt.\n", + " Table names and schemas should correspond to the {domain} domain and focus on {domain_description}\n", + " columns_to_list_in_prompt: [domain, domain_description, topic, sql_prompt]\n", + " llm_type: code\n", + "\n", + " - name: sql\n", + " generation_prompt: >-\n", + " Write an SQL query to answer/execute the natural language prompt in\n", + " {sql_prompt}.\n", + "\n", + " SQL should be based on the database context generated in {sql_context}.\n", + " SQL should leverage {sql_complexity}.\n", + " columns_to_list_in_prompt: [domain, topic, sql_complexity, sql_task_type]\n", + " llm_type: code\n", + "\n", + "\n", + "post_processors:\n", + " - validator: code\n", + " settings:\n", + " code_lang: tsql\n", + " code_columns: [sql_context, sql]\n", + "\n", + " - evaluator: text_to_sql\n", + " settings:\n", + " text_column: sql_prompt\n", + " code_column: sql\n", + " context_column: sql_context\n", + "\"\"\"\n", + "\n", + "data_designer = DataDesigner.from_config(config, **session_kwargs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ApZ7xb8dPOO0" + }, + "outputs": [], + "source": [ + "data_designer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iapaPqrN6w4T" + }, + "source": [ + "## 🌱 Generating categorical seed _values_\n", + "\n", + "If some/all of your categorical data seeds have values that need to be generated (as is the case for this example), you have two choices:\n", + "\n", + "1. Generate them every time you generate a preview dataset and/or batch workflow. In this case, you simply call `designer.generate_dataset_preview` or `designer.submit_batch_workflow` without providing `data_seeds` as input.\n", + "\n", + "2. Generate them once using `designer.generate_seed_category_values` and then pass the resulting `data_seeds` as input when generating a preview / batch workflow, as we will show below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7N_VqvICN9In" + }, + "outputs": [], + "source": [ + "data_seeds = data_designer.generate_seed_category_values()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7zwFpB24ZAHf" + }, + "outputs": [], + "source": [ + "data_seeds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4LDG93_KOcF2" + }, + "outputs": [], + "source": [ + "data_seeds.inspect()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NDxYu6azd3c4" + }, + "source": [ + "## 👀 Generating a dataset preview\n", + "\n", + "- You can run `generate_seed_category_values` multiple times.\n", + "\n", + "- Once you are happy with the results, you can pass `data_seeds` as input to the preview / batch generation methods.\n", + "\n", + "- Notice that Step 1 now loads the data seeds rather than generating them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ef8Ws90cPbIu" + }, + "outputs": [], + "source": [ + "preview = data_designer.generate_dataset_preview(data_seeds=data_seeds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preview.dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CjYiKmcWd_2t" + }, + "source": [ + "## 🔎 Taking a closer look at single records" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nSMFirnBMXtb" + }, + "outputs": [], + "source": [ + "preview.display_sample_record(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preview.dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eMjFAR0Yenrk" + }, + "source": [ + "## 🤔 Like what you see?\n", + "\n", + "- Submit a batch workflow!\n", + "\n", + "- Notice we pass `data_seeds` as an argument to `data_designer.submit_batch_workflow` so we use the same data seeds any time we run this workflow." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VziAxDPtQEes" + }, + "outputs": [], + "source": [ + "batch_job = data_designer.submit_batch_workflow(num_records=25, data_seeds=data_seeds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zRIrNifpj5vO" + }, + "outputs": [], + "source": [ + "batch_job.status" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fDAG5KmQeQ0m" + }, + "outputs": [], + "source": [ + "df = batch_job.fetch_dataset(wait_for_completion=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0I9rT4mNOLTh" + }, + "outputs": [], + "source": [ + "path = batch_job.download_evaluation_report()" + ] + } + ], + "metadata": { + "colab": { + "include_colab_link": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}