Skip to content

Commit

Permalink
Merge branch 'main' into 3.4.2-release-notes
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderahn authored Dec 12, 2024
2 parents 08193fa + ffae0c7 commit cfdf552
Show file tree
Hide file tree
Showing 160 changed files with 1,296 additions and 1,179 deletions.
2 changes: 2 additions & 0 deletions docs/SQL/gems/custom/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ tags:
- sql
---

<h3><span class="badge">SQL Gem</span></h3>

:::caution
This page about Custom SQL Gems is under construction. Please pardon our dust.
:::
Expand Down
8 changes: 5 additions & 3 deletions docs/SQL/gems/gems.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Gems
title: SQL Gems
id: sql-gems
description: Gems are data seeds, sources, transformations, and targets
sidebar_position: 2
Expand All @@ -11,9 +11,11 @@ tags:
- cte
---

In Prophecy and dbt, Data [Models](/docs/concepts/project/models.md) are SQL statements that build a single table or view. Prophecy visualizes Data Models to illustrate the many steps needed to generate a single table or view. Gems represent the individual steps. A Gem is a unit of functionality ranging from reading, transforming, writing, and various other ad-hoc operations on data.
In Prophecy and dbt, data [models](/docs/concepts/project/models.md) are groups of SQL statements used to create a single table or view. Prophecy simplifies data modeling by visualizing the data model as a series of steps, each represented by a [Gem](/docs/concepts/project/gems.md). Gems are functional units that perform tasks such as reading, transforming, writing, or handling other data operations.

Each Gem represents a SQL statement, and allows users to construct that statement by configuring a visual interface. Prophecy is smart about whether to construct a CTE or subquery for each Gem; users just configure the visual interface, and Prophecy includes the Gem's SQL statement as part of the Model. Here is a nice [overview](/docs/concepts/project/gems.md) of all the aspects of the Gem user interface. The table below outlines each Gem category:
Each Gem corresponds to a SQL statement, which users can construct through an intuitive visual interface. Prophecy handles the underlying complexity by deciding whether each Gem should generate a CTE or a subquery. Users simply configure the Gem's interface, and Prophecy integrates the resulting SQL into the larger data model.

The table below outlines the different SQL Gem categories.

<div class="gems-table">

Expand Down
8 changes: 5 additions & 3 deletions docs/SQL/gems/joins.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Joins
title: Join
id: data-joins
description: Join data from multiple tables
sidebar_position: 3
Expand All @@ -10,7 +10,9 @@ tags:
- transformation
---

Upon opening the join Gem, you can see a pop-up which provides several helpful features.
<h3><span class="badge">SQL Gem</span></h3>

Upon opening the Join Gem, you can see a pop-up which provides several helpful features.

![Join definition](img/JoinCondition.png)

Expand All @@ -20,7 +22,7 @@ To fill-in our **(5) Join condition** within the **(4) Conditions** section, sta

When you’re writing your join conditions, you’ll see available functions and columns to speed up your development. When the autocomplete appears, press ↑, ↓ to navigate between the suggestions and press tab to accept the suggestion.

Select the **(6)Join Type** according to the provider, eg [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-join.html) or [Snowflake.](https://docs.snowflake.com/en/user-guide/querying-joins)
Select the **(6)Join Type** according to the provider, e.g. [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-join.html) or [Snowflake.](https://docs.snowflake.com/en/user-guide/querying-joins)

The **(7) Expressions** tab allows you to define the set of output columns that are going to be returned from the Gem. Here we leave it empty, which by default passes through all the input columns, from both of the joined sources, without any modifications.

Expand Down
4 changes: 3 additions & 1 deletion docs/SQL/gems/subgraph/subgraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ tags:
- SQL
---

Subgraph allows you to take multiple distinct Gems and wrap them under a single parent Gem. Doing so can help you decompose complex logic into more manageable components and simplify the Visual view of your model.
<h3><span class="badge">SQL Gem</span></h3>

Subgraph Gems let you take multiple different Gems and wrap them under a single reusable parent Gem. In other words, they allow you to decompose complex logic into reusable components and simplify the visual view of your data model.

## Basic Subgraph

Expand Down
2 changes: 2 additions & 0 deletions docs/SQL/gems/transform/aggregate.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ tags:
- transformation
---

<h3><span class="badge">SQL Gem</span></h3>

Together let's deconstruct a commonly used Transformation, the Aggregate Gem. Follow along in the `HelloWorld_SQL` Project.

## Using the Gem
Expand Down
2 changes: 2 additions & 0 deletions docs/SQL/gems/transform/deduplicate.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ tags:
- unique
---

<h3><span class="badge">SQL Gem</span></h3>

Removes rows with duplicate values of specified columns.

## Parameters
Expand Down
2 changes: 2 additions & 0 deletions docs/SQL/gems/transform/flattenschema.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ tags:
- flatten
---

<h3><span class="badge">SQL Gem</span></h3>

When processing raw data it can be useful to flatten complex data types like `Struct`s and `Array`s into simpler, flatter schemas. This allows you to preserve all schemas, and not just the first one. You can use FlattenSchema with Snowflake Models.

![The FlattenSchema gem](./img/flatten_gem.png)
Expand Down
Binary file added docs/Spark/extensibility/img/add-function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/Spark/extensibility/img/call-function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/Spark/extensibility/img/define-function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 21 additions & 35 deletions docs/Spark/extensibility/user-defined-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,46 +9,32 @@ tags:
- udafs
---

Allows you to create user defined functions (UDF) which are then usable anywhere in the Pipeline
Prophecy lets you create user-defined functions (UDFs) which can be used anywhere in the Pipeline.

### Parameters
## Parameters

| Parameter | Description | Required |
| :---------------------- | :--------------------------------------------------------------------------------------------------------------------------------------- | :------- |
| UDF Name | Name of the UDF to be used to register it. All calls to the UDF will use this name | True |
| Definition | Definition of the UDF function. <br/> Eg: `udf((value:Int)=>value*value)` | True |
| UDF initialization code | Code block that contains initialization of entities used by UDFs. This could for example contain any static mapping that a UDF might use | False |
| Parameter | Description | Required |
| :---------------------- | :------------------------------------------------------------------------------------------------------------------------------------------ | :------- |
| Function name | The name of the function as it appears in your project. | True |
| UDF Name | The name of the UDF that will register it. All calls to the UDF will use this name. | True |
| Definition | Definition of the UDF function. <br/> For example, `udf((value:Int)=>value*value)` | True |
| UDF initialization code | Code block that contains initialization of entities used by UDFs. This could, for example, contain any static mapping that a UDF might use. | False |

### Examples
## Steps

---
There are a few steps to take to create and use a new UDF.

#### Defining and Using UDF

```mdx-code-block
import App from '@site/src/components/slider';
export const ImageData = [
{
"image":"/img/udf/1.png",
"description":<h3 style={{padding:'10px'}}>Step 1 - Open UDF definition window</h3>,
},
{
"image":"/img/udf/2.1.png",
"description":<h3 style={{padding:'10px'}}>Step 2 (Python)- Define Python UDF</h3>,
},
{
"image":"/img/udf/2.2.png",
"description":<h3 style={{padding:'10px'}}> Step 2 (Scala) - Define Scala UDf</h3>
},
{
"image":"/img/udf/3.png",
"description":<h3 style={{padding:'10px'}}>Step 3 - UDFs can now be called by their defined names</h3>,
},
];
<App ImageData={ImageData}></App>
```
1. Create a new function. You can find the **Functions** section in the left sidebar of a project page.

![Add a function to the pipeline](img/add-function.png)

2. Define the function.

![Define the function](img/define-function.png)

3. Call the function.

![Call the function](img/call-function.png)

````mdx-code-block
import Tabs from '@theme/Tabs';
Expand Down
6 changes: 6 additions & 0 deletions docs/Spark/fabrics/dataproc/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"label": "Google Cloud Dataproc",
"position": 8,
"collapsible": true,
"collapsed": true
}
44 changes: 44 additions & 0 deletions docs/Spark/fabrics/dataproc/dataproc-tips.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: "Connectivity Tips"
id: gcp-dataproc-fabric-tips
description: If your cluster doesn't connect, try these tips
sidebar_position: 1
tags:
- deployment
- configuration
- google
- gcp
- dataproc
- livy
---

:::tip
Sometimes the Livy Cluster cannot access the Scala or Python libraries.
:::

### Error

```
Creating new Livy Session...
Using prophecy libs path...repo1.maven.org...
Using python libraries...files.pythonhosted.org...
...
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)\n\nYARN Diagnostics: ","level":"error"
```

### Corrective Actions

**Option 1:**
Adjust network settings on the Livy Cluster to allow traffic from the Scala Prophecy Library url
`repo1.maven.org` and the Python Prophecy Library url
`files.pythonhosted.org`.

**Option 2:**
Configure the Scala and Python Library Paths as mentioned [here](./dataproc.md).
Configure Scala Library Path.
`gs://prophecy-public-gcp/prophecy-scala-libs/`.
Configure Python Library Path.
`gs://prophecy-public-gcp/prophecy-python-libs/`.

**Option 3:**
Setup an GCS bucket internally. Create two folders as in the previous option, and add `prophecy-scala-libs` and `prophecy-python-libs` in those folders.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Google Cloud Dataproc"
id: gcp-dataproc-fabric-guide
description: Configuring GCP Dataproc Fabric
sidebar_position: 7
sidebar_position: 8
tags:
- deployment
- configuration
Expand All @@ -26,7 +26,7 @@ Livy is required for the Fabric. Prophecy provides a script required to deploy a

1. If you don't already have a private key, create a private key for the service account that you're using.
<br/><br/>
<img src={require('./img/createkey.png').default} alt="dataproc security" width="75%" />
<img src={require('./../img/createkey.png').default} alt="dataproc security" width="75%" />
<br/><br/>
2. Ensure you have the following permissions configured.

Expand Down Expand Up @@ -79,35 +79,42 @@ gcloud config set account [email protected]

1. Create a Fabric and select **Dataproc**.
<br/><br/>
<img src={require('./img/selectdataproc.png').default} alt="select dataproc" width="75%" />
<img src={require('./../img/selectdataproc.png').default} alt="select dataproc" width="75%" />
<br/><br/>
2. Fill out your **Project Name** and **Region**, and upload the **Private Key**.
<br/><br/>
<img src={require('./img/configuredataproc.png').default} alt="configure dataproc" width="75%" />
<img src={require('./../img/configuredataproc.png').default} alt="configure dataproc" width="75%" />
<br/><br/>
3. Click on **Fetch environments** and select the Dataproc **cluster** that you created earlier.
<br/><br/>
<img src={require('./img/selectenv.png').default} alt="select cluster" width="75%" />
<img src={require('./../img/selectenv.png').default} alt="select cluster" width="75%" />
<br/><br/>
4. Leave everything as default and provide the **Livy URL**. Locate the **External IP** of your cluster instance. Optionally, you may configure the DNS instead of using the IP. The URL is `http://<external-ip>:8998`.
<br/><br/>
<img src={require('./img/externalip.png').default} alt="livy ip" width="75%" />
<img src={require('./../img/externalip.png').default} alt="livy ip" width="75%" />
<br/><br/>
5. Configure the bucket associated with your cluster.
<br/><br/>
<img src={require('./img/bucketloc.png').default} alt="bucket location" width="75%" />
<img src={require('./../img/bucketloc.png').default} alt="bucket location" width="75%" />
<br/><br/>
6. Add the **Job Size**.
<br/><br/>
<img src={require('./img/procjobsize.png').default} alt="Job Size" width="55%" />
<img src={require('./../img/procjobsize.png').default} alt="Job Size" width="55%" />
<br/><br/>
7. Configure Scala Library Path.
`gs://prophecy-public-gcp/prophecy-scala-libs/`.
8. Configure Python Library Path.
`gs://prophecy-public-gcp/prophecy-python-libs/`.
<br/><br/>
<img src={require('./img/proclib.png').default} alt="dependences" width="85%" />
<img src={require('./../img/proclib.png').default} alt="dependences" width="85%" />
<br/><br/>
9. Click on **Complete**.
<br/><br/>
Run a simple Pipeline and make sure that the interim returns data properly.

```mdx-code-block
import DocCardList from '@theme/DocCardList';
import {useCurrentSidebarCategory} from '@docusaurus/theme-common';
<DocCardList items={useCurrentSidebarCategory().items}/>
```
2 changes: 1 addition & 1 deletion docs/Spark/fabrics/diagnostics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Diagnostics"
id: fabric-diagnostics
description: Troubleshooting Fabrics using diagnostics
sidebar_position: 8
sidebar_position: 9
tags:
- diagnostics
- fabric
Expand Down
101 changes: 0 additions & 101 deletions docs/Spark/fabrics/emr-fabric-serverless.md

This file was deleted.

Loading

0 comments on commit cfdf552

Please sign in to comment.