Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add back dedupe and flatten schema #431

Merged
merged 5 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"label": "Transformations",
"label": "Transform",
"position": 1,
"collapsible": true,
"collapsed": true
Expand Down
98 changes: 98 additions & 0 deletions docs/SQL/gems/transform/deduplicate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: Deduplicate
id: deduplicate
description: Remove rows with duplicate values of specified columns
sidebar_position: 3
tags:
- gems
- dedupe
- distinct
- unique
---

Removes rows with duplicate values of specified columns.

## Parameters

| Parameter | Description | Required |
| :--------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- |
| Source | Input source | True |
| Row to keep | - `Distinct Rows`: Keeps all distinct rows. This is equivalent to performing a `select distinct` operation <br/>- `Unique Only`: Keeps rows that don't have duplicates <br/>- `First`: Keeps first occurrence of the duplicate row <br/>- `Last`: Keeps last occurrence of the duplicate row <br/>Default is `Distinct Rows` | True |
| Deduplicate On Columns | Columns to consider while removing duplicate rows (not required for `Distinct Rows`) | True |

## Row to keep options

As mentioned in the previous parameters, there are four **Row to keep** options that you can use in your deduplicate Gem.

![Deduplicate row to keep](./img/deduplicate_row_to_keep.png)

In the Code view, you can see that the Deduplicate Gem contains `SELECT DISTINCT *` when using the `Distinct Rows` option.

![Deduplicate code view](./img/deduplicate_code_view.png)

## Example

Suppose you're deduplicating the following table.

| First_Name | Last_Name | Type | Contact |
| :--------- | :-------- | :---- | :---------------- |
| John | Doe | phone | 123-456-7890 |
| John | Doe | phone | 123-456-7890 |
| John | Doe | phone | 123-456-7890 |
| Alice | Johnson | phone | 246-135-0987 |
| Alice | Johnson | phone | 246-135-0987 |
| Alice | Johnson | email | [email protected] |
| Alice | Johnson | email | [email protected] |
| Bob | Smith | email | [email protected] |

For `Distinct Rows`, the interim data will show the following:

| First_Name | Last_Name | Type | Contact |
| :--------- | :-------- | :---- | :---------------- |
| John | Doe | phone | 123-456-7890 |
| Alice | Johnson | phone | 246-135-0987 |
| Alice | Johnson | email | [email protected] |
| Bob | Smith | email | [email protected] |

The `First` and `Last` options work similarly to `Distinct Rows`, but they keep the first and last occurrence of the duplicate rows respectively.

For `Unique Only`, the interim data will look like the following:

| First_Name | Last_Name | Type | Contact |
| :--------- | :-------- | :---- | :------------ |
| Bob | Smith | email | [email protected] |

You'll be left with only one unique row since the rest were all duplicates.

---

You can add `First_Name` and `Last_Name` to Deduplicate On Columns if you want to further deduplicate the table.

For `Distinct Rows`, the interim data will show the following:

| First_Name | Last_Name |
| :--------- | :-------- |
| John | Doe |
| Alice | Johnson |
| Bob | Smith |

:::note

For `First`, `Last`, and `Unique Only`, the interim data will contain all columns, irrespective of the columns that were added.

For `First` and `Last`, the interim data will look like the following:

| First_Name | Last_Name | Type | Contact |
| :--------- | :-------- | :---- | :---------------- |
| John | Doe | phone | 123-456-7890 |
| Alice | Johnson | phone | 246-135-0987 |
| Alice | Johnson | email | [email protected] |
| Bob | Smith | email | [email protected] |

For `Unique Only`, the interim data will look like the following:

| First_Name | Last_Name | Type | Contact |
| :--------- | :-------- | :---- | :------------ |
| Bob | Smith | email | [email protected] |

:::
68 changes: 68 additions & 0 deletions docs/SQL/gems/transform/flattenschema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
title: Flatten Schema
id: flattenschema
description: Flatten nested data
sidebar_position: 4
tags:
- gems
- schema
- explode
- flatten
---

When processing raw data it can be useful to flatten complex data types like `Struct`s and `Array`s into simpler, flatter schemas. This allows you to preserve all schemas, and not just the first one. You can use FlattenSchema with Snowflake Models.

![The FlattenSchema gem](./img/flatten_gem.png)

## The Input

FlattenSchema works on Snowflake sources that have nested columns that you'd like to extract into a flat schema.

For example, with an input schema like so:

![Input schema](./img/flatten_input.png)

And the data looks like so:

![Input data](./img/flatten_input_interim.png)

We want to extract the `contact`, and all of the columns from the `struct`s in `content` into a flattened schema.

## The Expressions

Having added a `FlattenSchema` Gem to your Model, all you need to do is click the column names you wish to extract and they'll be added to the `Expressions` section.

:::tip

You can click to add all columns, which would make all nested leaf level values of an object visible as columns.

:::

Once added you can change the `Output Column` for a given row to change the name of the Column in the output.

![Adding expressions](./img/flatten_add_exp.png)

## The Output

If we check the `Output` tab in the Gem, you'll see the schema that we've created using the selected columns.

And here's what the output data looks like:

![Output interim](./img/flatten_output_interim.png)

The nested contact information has been flatten so that you have individual rows for each content type.

## Advanced settings

If you're familiar with Snowflake's `FLATTEN` table function, you can use the advanced settings to customize the optional column arguments.

To use the advanced settings, hover over a column, and click the dropdown arrow.

![Advanced settings](./img/flatten_advanced_settings.png)

You can customize the following options:

- Path to the element: The path to the element within the variant data structure that you want to flatten.
- Flatten all elements recursively: If set to `false`, only the element mentioned in the path is expanded. If set to `true`, all sub-elements are expanded recursively. This is set to false by default.
- Preserve rows with missing fields: If set to `false`, rows with missing fields are omitted from the output. If set to `true`, rows with missing fields are generated with `null` in the key, index, and value columns. This is set to false by default.
- Datatype that needs to be flattened: The data type that you want to flatten. You can choose `Object`, `Array`, or `Both`. This is set to `Both` by default.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: SQL Transformations
id: sql-transformations
title: Transform
id: transform
description: Data transformation steps in SQL
sidebar_position: 1
tags:
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/getting-started-sql-snowflake.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@ Here we create a `customers_nations` model that’s going to enrich our customer

The `customers_nations` model is stored as a `.sql` file on Git. The table or view defined by the model is stored on the SQL warehouse, database, and schema defined in the attached Fabric.

Suggestions are provided each step of the way. If Copilot's suggestions aren't exactly what you need, just select and configure the Gems as desired. Click [here](../SQL/gems/joins.md) for details on configuring joins or [here](../SQL/gems/transformations/sql-aggregate) for aggregations.
Suggestions are provided each step of the way. If Copilot's suggestions aren't exactly what you need, just select and configure the Gems as desired. Click [here](../SQL/gems/joins.md) for details on configuring joins or [here](../SQL/gems/transform/aggregate.md) for aggregations.

### 4.5 Interactively Test

Expand Down