Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize Spark Gem format #450

Merged
merged 3 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/Spark/gems/custom/delta-table-operations.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 4
title: Delta Table Operations
title: DeltaTableOperations
id: delta-ops
description: Gem that encompasses some of the import side operations of Delta
tags:
Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/custom/file-operation.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
sidebar_position: 3
title: File Operation
title: FileOperation
id: file-operations
description: Perform file operations on different file systems
tags:
- file
- dbfs
---

Helps perform file operations like `copy` and `move` on different file systems
Helps perform file operations like `copy` and `move` on different file systems.

## Parameters

Expand Down
2 changes: 1 addition & 1 deletion docs/Spark/gems/custom/rest-api-enrich.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 5
title: Rest API Enrich
title: RestAPIEnrich
id: rest-api-enrich
description: Enrich DataFrame with content from rest API response based on configuration
tags:
Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/custom/sql-statement.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 1
title: SQL Statement
title: SQLStatement
id: sql-statement
description: Create DataFrames based on custom SQL queries
tags:
Expand All @@ -9,7 +9,7 @@ tags:
- custom
---

Create one or more DataFrame(s) based on provided SQL queries to run against one or more input DataFrame(s).
Create one or more DataFrame(s) based on provided SQL queries to run against one or more input DataFrames.

### Parameters

Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/join-split/compare-columns.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 4
title: Compare Columns
title: CompareColumns
id: compare-columns
description: Compare columns between two dataframes
tags:
Expand All @@ -10,7 +10,7 @@ tags:
- compare-columns
---

Compare columns between two DataFrame based on the key id columns defined
The CompareColumns Gem lets you compare columns between two DataFrames based on the key id columns defined.

## Parameters

Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/join-split/row-distributor.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 3
title: Row Distributor
title: RowDistributor
id: row-distributor
description: Create multiple DataFrames based on filter conditions
tags:
Expand All @@ -10,7 +10,7 @@ tags:
- row distributor
---

Create multiple DataFrames based on provided filter conditions from an input DataFrame.
Use the RowDistributor Gem to create multiple DataFrames based on provided filter conditions from an input DataFrame.

This is useful for cases where rows from the input DataFrame needs to be distributed into multiple DataFrames in different ways for downstream Gems.

Expand Down
16 changes: 8 additions & 8 deletions docs/Spark/gems/machine-learning/ml-pinecone-lookup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 3
title: Pinecone Lookup
title: PineconeLookup
id: ml-pinecone-lookup
description: Lookup a vector embedding from a Pinecone Database
tags: [generative-ai, machine-learning, llm, pinecone, openai]
Expand All @@ -14,7 +14,7 @@ tags: [generative-ai, machine-learning, llm, pinecone, openai]

<br />

The Pinecone Lookup Gem identifies content that is similar to a provided vector embedding. The Gem calls the Pinecone API and returns a set of IDs with highest similarity to the provided embedding.
The PineconeLookup Gem identifies content that is similar to a provided vector embedding. The Gem calls the Pinecone API and returns a set of IDs with highest similarity to the provided embedding.

- [**Parameters:**](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#gem-parameters) Configure the parameters needed to call the Pinecone API.

Expand All @@ -40,15 +40,15 @@ Hardcoding the Pinecone credential is not recommended. Selecting this option cou

#### Properties

Pinecone DB uses indexing to map the vectors to a data structure that will enable faster searching. The Pinecone Lookup Gem searches through a Pinecone index to identify embeddings with similarity to the input embedding. Enter the Pinecone **[(4) Index name](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#faq)** which you’d like to use for looking up embeddings.
Pinecone DB uses indexing to map the vectors to a data structure that will enable faster searching. The PineconeLookup Gem searches through a Pinecone index to identify embeddings with similarity to the input embedding. Enter the Pinecone **[(4) Index name](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#faq)** which you’d like to use for looking up embeddings.

Select one of the Gem’s input columns with vector embeddings as the **(5) Vector column** to send to Pinecone’s API. The column [must](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#input) be compatible with the Pinecone Index. To change the column’s datatype and properties, [configure](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#faq) the Gem(s) preceding the Pinecone Lookup Gem.
Select one of the Gem’s input columns with vector embeddings as the **(5) Vector column** to send to Pinecone’s API. The column [must](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#input) be compatible with the Pinecone Index. To change the column’s datatype and properties, [configure](https://docs.prophecy.io/Spark/gems/machine-learning/ml-pinecone-lookup#faq) the Gem(s) preceding the PineconeLookup Gem.

Pinecone’s API can return multiple results. Depending on the use case, select the desired **(6) Number of results** sorted by similarity score. The result with highest similarity to the user’s text question will be listed first.

### Input

Pinecone Lookup requires a model_embedding column as input. Use one of Prophecy's Machine Learning Gems to provide the model_embedding. For example, the OpenAI Gem can precede the Pinecone Lookup Gem in the Pipeline. The OpenAI Gem, configured to `Compute a text embedding`, will output an openai_embedding column. This is a suitable input for the Pinecone Lookup Gem.
PineconeLookup requires a model_embedding column as input. Use one of Prophecy's Machine Learning Gems to provide the model_embedding. For example, the OpenAI Gem can precede the PineconeLookup Gem in the Pipeline. The OpenAI Gem, configured to `Compute a text embedding`, will output an openai_embedding column. This is a suitable input for the PineconeLookup Gem.

| Column | Description | Required |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
Expand All @@ -61,9 +61,9 @@ The output Dataset contains the pinecone_matches and pinecone_error columns. For
| Column | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| pinecone_matches | array - an array of several content IDs and their scores. Example: `[{"id":"web-223","score":0.8437653},{"id":"web-224","score":0.8403446}, ...{"id":"web-237","score":0.82916564}]` |
| pinecone_error | string - this column is provided to show any error message returned from Pinecone’s API; helpful for troubleshooting errors related to the Pinecone Lookup Gem. |
| pinecone_error | string - this column is provided to show any error message returned from Pinecone’s API; helpful for troubleshooting errors related to the PineconeLookup Gem. |

Prophecy converts the visual design into Spark code available on the Prophecy user's Git repository. Find the Spark code for the Pinecone Lookup Gem below.
Prophecy converts the visual design into Spark code available on the Prophecy user's Git repository. Find the Spark code for the PineconeLookup Gem below.

````mdx-code-block
import Tabs from '@theme/Tabs';
Expand Down Expand Up @@ -105,7 +105,7 @@ def vector_lookup(Spark: SparkSession, in0: DataFrame) -> DataFrame:

#### Troubleshooting

To troubleshoot the Gem preceding Pinecone Lookup, open the data preview output from the previous Gem. For example if the embedding structure is incorrect then try adjusting the previous Gem, run, and view that Gem’s output data preview.
To troubleshoot the Gem preceding PineconeLookup, open the data preview output from the previous Gem. For example if the embedding structure is incorrect then try adjusting the previous Gem, run, and view that Gem’s output data preview.

#### Creating a Pinecone Index

Expand Down
2 changes: 1 addition & 1 deletion docs/Spark/gems/machine-learning/ml-text-processing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 1
title: Text Processing
title: TextProcessing
id: ml-text-processing
description: Text processing to prepare data to submit to a foundational model API.
tags:
Expand Down
2 changes: 1 addition & 1 deletion docs/Spark/gems/subgraph/basicSubgraph.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 1
title: Basic Subgraph
title: Basic subgraph
id: basic-subgraph
description: Basic Subgraph, Group your Gems in reusable Parent Gems.
tags:
Expand Down
14 changes: 7 additions & 7 deletions docs/Spark/gems/subgraph/tableIterator.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 2
title: Table Iterator
title: TableIterator
id: table-iterator
description: Loop over each row of an input Dataframe
tags:
Expand All @@ -9,21 +9,21 @@ tags:
- iterator
---

Table Iterator allows you to iterate over one or more Gems for each row of the first input DataFrame.
TableIterator allows you to iterate over one or more Gems for each row of the first input DataFrame.
Let's see how to create a Basic Loop which loops over a Metadata Table, and for each row of the table will run the Gems inside the Subgraph.

## Creating a Table Iterator Gem
## Creating a TableIterator Gem

First add the Input Gem on which you want to Iterate over. For this, simply use an existing Dataset or create a new [Source Gem](/docs/Spark/gems/source-target/source-target.md) pointing to your Metadata table.
You can run this Source Gem to see the data your loop would be running for.

Now, Drag and Drop the **(1) Table Iterator** Gem from the Subgraph menu, and connect it to the above created Source Gem.
Now, Drag and Drop the **(1) TableIterator** Gem from the Subgraph menu, and connect it to the above created Source Gem.

![Create_table_iterator](img/Create_table_iterator.png)

## Configure the Table Iterator
## Configure the TableIterator

Open the Table Iterator Gem, and click on **(1) Configure** to open the Settings dialog.
Open the TableIterator Gem, and click on **(1) Configure** to open the Settings dialog.
Here, on the left side panel you can edit the **(2) Name ** of your Gem, check the **(3) Input Schema** for your DataFrame on which the loop will iterate.

On the right side, you can define your Iterator Settings, and any other Subgraph Configs you want to use in the Subgraph.
Expand Down Expand Up @@ -70,7 +70,7 @@ Click on the **(2) Iteration** button, and it will open up the Iterations table

## Adding Inputs and Outputs to TableIterator

For a Table Iterator Gem, the first input port is for your DataFrame on which you want to Iterate Over.
For a TableIterator Gem, the first input port is for your DataFrame on which you want to Iterate Over.
You can **(1)Add** more Inputs or Switch to **(2) Output** tab to add more Outputs as needed. These extra inputs would not change for every iteration.
Also, the output will be a Union of output of all Iterations. You can **(3) Delete** any port by hovering over it.

Expand Down
6 changes: 3 additions & 3 deletions docs/Spark/gems/transform/bulk-column-expressions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 11
title: Bulk Column Expressions
title: BulkColumnExpressions
id: bulk-column-expressions
description: Change the data type of multiple columns at once.
tags:
Expand All @@ -9,7 +9,7 @@ tags:
- columns
---

The Bulk Column Expressions Gem primarily lets you cast or change the data type of multiple columns at once. It provides additional functionality, including:
The BulkColumnExpressions Gem primarily lets you cast or change the data type of multiple columns at once. It provides additional functionality, including:

- Adding a prefix or suffix to selected columns.
- Applying a custom expression to selected columns.
Expand All @@ -28,7 +28,7 @@ The Bulk Column Expressions Gem primarily lets you cast or change the data type

Assume you have some columns in a table that represent zero-based indices and are stored as long data types. You want them to represent one-based indices and be stored as integers to optimize memory use.

Using the Bulk Column Expressions Gem, you can:
Using the BulkColumnExpressions Gem, you can:

- Filter your columns by long data types.
- Select the columns you wish to transform.
Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/transform/bulk-column-rename.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 10
title: Bulk Column Rename
title: BulkColumnRename
id: bulk-column-rename
description: Rename multiple columns in your Dataset in a systematic way.
tags:
Expand All @@ -9,7 +9,7 @@ tags:
- columns
---

Use the Bulk Column Rename Gem to rename multiple columns in your Dataset in a systematic way.
Use the BulkColumnRename Gem to rename multiple columns in your Dataset in a systematic way.

## Parameters

Expand Down
6 changes: 3 additions & 3 deletions docs/Spark/gems/transform/data-cleansing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 12
title: Data Cleansing
title: DataCleansing
id: data-cleansing
description: Standardize data formats and address missing or null values in the data.
tags:
Expand All @@ -9,7 +9,7 @@ tags:
- format
---

Use the Data Cleansing Gem to standardize data formats and address missing or null values in the data.
Use the DataCleansing Gem to standardize data formats and address missing or null values in the data.

## Parameters

Expand All @@ -22,6 +22,6 @@ Use the Data Cleansing Gem to standardize data formats and address missing or nu

## Example

Assume you have a table that includes customer feedback on individual orders. In this scenario, some customers may not provide feedback, resulting in null values in the data. You can use the Data Cleansing Gem to replace null values with the string `NA`.
Assume you have a table that includes customer feedback on individual orders. In this scenario, some customers may not provide feedback, resulting in null values in the data. You can use the DataCleansing Gem to replace null values with the string `NA`.

![Replace null with string](./img/replace-null-with-string.png)
8 changes: 4 additions & 4 deletions docs/Spark/gems/transform/dynamic-select.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 13
title: Dynamic Select
title: DynamicSelect
id: dynamic-select
description: Dynamically filter columns of your dataset based on a set of conditions.
tags:
Expand All @@ -9,11 +9,11 @@ tags:
- dynamic
---

Use the Dynamic Select Gem to dynamically filter columns of your Dataset based on a set of conditions.
Use the DynamicSelect Gem to dynamically filter columns of your Dataset based on a set of conditions.

## Configuration

There are two ways to configure the Dynamic Select.
There are two ways to configure the DynamicSelect.

| Configuration | Description |
| --------------------- | --------------------------------------------------------------------------------------------- |
Expand All @@ -22,7 +22,7 @@ There are two ways to configure the Dynamic Select.

## Examples

You’ll use Dynamic Select when you want to avoid hard-coding your choice of columns. In other words, rather than define each column to keep in your Pipeline, you let the system automatically choose the columns based on certain conditions or rules.
You’ll use DynamicSelect when you want to avoid hard-coding your choice of columns. In other words, rather than define each column to keep in your Pipeline, you let the system automatically choose the columns based on certain conditions or rules.

### Remove date columns using field type

Expand Down
12 changes: 6 additions & 6 deletions docs/Spark/gems/transform/flattenschema.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 5
title: Flatten Schema
title: FlattenSchema
id: flatten-schema
description: Flatten nested data
tags:
Expand All @@ -10,7 +10,7 @@ tags:
- flatten
---

When processing raw data it can be useful to flatten complex data types like `Struct`s and `Array`s into simpler, flatter schemas.
When processing raw data it can be useful to flatten complex data types like structures and arrays into simpler, flatter schemas.

![The FlattenSchema gem](./img/flatten_gem.png)

Expand All @@ -26,19 +26,19 @@ And the data looks like so:

![Input data](./img/flatten_input_interim.png)

We want to extract `count`, and all of the columns from the `struct`s in `events` into a flattened schema.
We want to extract `count` from _result_ and all of the columns from _events_ into a flattened schema.

## The Expressions

Having added a `FlattenSchema` Gem to your Pipeline, all you need to do is click the column names you wish to extract and they'll be added to the `Expressions` section. Once added you can change the `Target Column` for a given row to change the name of the Column in the output.
Having added a FlattenSchema Gem to your Pipeline, all you need to do is click the column names you wish to extract and they'll be added to the **Expressions** section. Then, you can change the values in the **Target Column** to change the name of output columns.

![Adding Expressions](./img/flatten_add_exp.gif)

The `Columns Delimiter` dropdown allows you to control how the names of the new columns are derived. Currently dashes and underscores are supported.
The **Columns Delimiter** dropdown allows you to control how the names of the new columns are derived. Currently dashes and underscores are supported.

## The Output

If we check the `Output` tab in the Gem, you'll see the schema that we've created using the selected columns.
If we check the **Output** tab in the Gem, you'll see the schema that we've created using the selected columns.

![Output schema](./img/flatten_output.png)

Expand Down
2 changes: 1 addition & 1 deletion docs/Spark/gems/transform/order-by.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 3
title: Order By
title: OrderBy
id: order-by
description: Sort your data based on one or more Columns
tags:
Expand Down
8 changes: 4 additions & 4 deletions docs/Spark/gems/transform/schema-transform.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 5
title: Schema Transform
title: SchemaTransform
id: schema-transform
description: Add, Edit, Rename or Drop Columns
tags:
Expand Down Expand Up @@ -80,19 +80,19 @@ object transform {

## Advanced Import

The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a `Schema Transform` Gem based on existing logic.
The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a SchemaTransform Gem based on existing logic.

### Using Advanced Import

1. Click the `Advanced` button in the `Schema Transform` Gem UI
1. Click the **Advanced** button in the SchemaTransform Gem UI

![Advanced import toggle](./img/schematransform_advanced_1.png)

2. Enter the expressions into the text area using the format as described below:

![Advanced import mode](./img/schematransform_advanced_2.png)

3. Use the button at the top (labeled `Expressions`) to switch back to the expressions view. This will translate the expressions from the CSV format to the table format and will show any errors detected.
3. Use the button at the top (labeled **Expressions**) to switch back to the expressions view. This will translate the expressions from the CSV format to the table format and will show any errors detected.

### Format

Expand Down
4 changes: 2 additions & 2 deletions docs/Spark/gems/transform/set-operation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
sidebar_position: 8
title: Set Operation
title: SetOperation
id: set-operation
description: Union, Intersect and Difference
tags:
Expand All @@ -11,7 +11,7 @@ tags:
- difference
---

Allows you to perform addition or subtraction of rows from DataFrames with identical schemas and different data.
Use the SetOperation Gem to perform addition or subtraction of rows from DataFrames with identical schemas and different data.

### Parameters

Expand Down
Loading