Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into standardize-gem-name-…
Browse files Browse the repository at this point in the history
…format
  • Loading branch information
kathweinschenkprophecy committed Nov 27, 2024
2 parents d164450 + f258943 commit dcff75a
Show file tree
Hide file tree
Showing 7 changed files with 131 additions and 0 deletions.
36 changes: 36 additions & 0 deletions docs/Spark/gems/transform/bulk-column-expressions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
sidebar_position: 11
title: Bulk Column Expressions
id: bulk-column-expressions
description: Change the data type of multiple columns at once.
tags:
- gems
- type
- columns
---

The Bulk Column Expressions Gem primarily lets you cast or change the data type of multiple columns at once. It provides additional functionality, including:

- Adding a prefix or suffix to selected columns.
- Applying a custom expression to selected columns.

## Parameters

| Parameter | Description |
| -------------------------------------------- | ------------------------------------------------------------------ |
| Data Type of the columns to do operations on | The data type of columns to select. |
| Selected Columns | The columns on which to apply transformations |
| Change output column name | An option to add a prefix or suffix to the selected column names |
| Change output column type | The data type that the columns will be transformed into |
| Output Expression | A Spark SQL expression that can be applied to the selected columns |

## Example

Assume you have some columns in a table that represent zero-based indices and are stored as long data types. You want them to represent one-based indices and be stored as integers to optimize memory use.

Using the Bulk Column Expressions Gem, you can:

- Filter your columns by long data types.
- Select the columns you wish to transform.
- Cast the output column(s) to be integers.
- Include `column_value + 1` in the expression field to shift the indices.
33 changes: 33 additions & 0 deletions docs/Spark/gems/transform/bulk-column-rename.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
sidebar_position: 10
title: Bulk Column Rename
id: bulk-column-rename
description: Rename multiple columns in your Dataset in a systematic way.
tags:
- gems
- rename
- columns
---

Use the Bulk Column Rename Gem to rename multiple columns in your Dataset in a systematic way.

## Parameters

| Parameter | Description |
| ----------------- | ---------------------------------------------------------------------------------------- |
| Columns to rename | Select one or more columns to rename from the dropdown. |
| Method | Choose to add a prefix, add a suffix, or use a custom expression to change column names. |

Based on the method you select, you will see an option to enter the prefix, suffix, or expression of your choice.

## Examples

### Add a prefix

One example is to add the prefix `meta_` to tag columns that contain metadata.

![Add prefix to multiple columns](./img/bulk-add-prefix.png)

### Use a custom expression

You can accomplish the same or more complex changes using a custom expression like `concat('meta_', column_name)`.
27 changes: 27 additions & 0 deletions docs/Spark/gems/transform/data-cleansing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
sidebar_position: 12
title: Data Cleansing
id: data-cleansing
description: Standardize data formats and address missing or null values in the data.
tags:
- gems
- clean
- format
---

Use the Data Cleansing Gem to standardize data formats and address missing or null values in the data.

## Parameters

| Parameter | Description |
| -------------------------------- | --------------------------------------------------------------- |
| Select columns you want to clean | The set of columns on which to perform cleaning transformations |
| Remove null data | The method used to remove null data |
| Replace null values in column | The method used to replace null values |
| Clean data | Different ways to standardize the format of data in columns |

## Example

Assume you have a table that includes customer feedback on individual orders. In this scenario, some customers may not provide feedback, resulting in null values in the data. You can use the Data Cleansing Gem to replace null values with the string `NA`.

![Replace null with string](./img/replace-null-with-string.png)
35 changes: 35 additions & 0 deletions docs/Spark/gems/transform/dynamic-select.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
sidebar_position: 13
title: Dynamic Select
id: dynamic-select
description: Dynamically filter columns of your dataset based on a set of conditions.
tags:
- gems
- filter
- dynamic
---

Use the Dynamic Select Gem to dynamically filter columns of your Dataset based on a set of conditions.

## Configuration

There are two ways to configure the Dynamic Select.

| Configuration | Description |
| --------------------- | --------------------------------------------------------------------------------------------- |
| Select field types | Choose one or more types of columns to keep in the Dataset, such as string, decimal, or date. |
| Select via expression | Create an expression that limits the type of columns to keep in the Dataset. |

## Examples

You’ll use Dynamic Select when you want to avoid hard-coding your choice of columns. In other words, rather than define each column to keep in your Pipeline, you let the system automatically choose the columns based on certain conditions or rules.

### Remove date columns using field type

Assume you would like to remove irrelevant date and timestamp columns from your Dataset. You can do so with the **Select field types** method by selecting all field types to maintain, except for date and timestamp.

![Keep all columns except Date and Timestamp column using the visual interface](./img/remove-date-timestamp.png)

### Remove date columns with an expression

Using the same example, you can accomplish the same task with the **Select via expression** method by inputting the the expression `column_type NOT IN ('date', 'timestamp')`.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit dcff75a

Please sign in to comment.