Skip to content

Commit

Permalink
[yaml] Add Beam YAML Blog (#30898)
Browse files Browse the repository at this point in the history
* [yaml] Add Beam YAML Blog

Signed-off-by: Jeffrey Kinard <[email protected]>

* remove whitespace

Signed-off-by: Jeffrey Kinard <[email protected]>

* remove incorrect apostrophes

Signed-off-by: Jeffrey Kinard <[email protected]>

* Update website/www/site/content/en/blog/beam-yaml-release.md

---------

Signed-off-by: Jeffrey Kinard <[email protected]>
Co-authored-by: Danny McCormick <[email protected]>
  • Loading branch information
Polber and damccorm authored Apr 11, 2024
1 parent 8ca15ef commit cc71921
Show file tree
Hide file tree
Showing 2 changed files with 321 additions and 0 deletions.
318 changes: 318 additions & 0 deletions website/www/site/content/en/blog/beam-yaml-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
---
title: "Introducing Beam YAML: Apache Beam's First No-code SDK"
date: 2024-04-11 10:00:00 -0400
categories:
- blog
authors:
- jkinard

---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

Writing a Beam pipeline can be a daunting task. Learning the Beam model, downloading dependencies for the SDK language
of choice, debugging the pipeline, and maintaining the pipeline code is a lot of overhead for users who want to write a
simple to intermediate data processing pipeline. There have been strides in making the SDK's entry points easier, but
for many, it is still a long way from being a painless process.

To address some of these issues and simplify the entry point to Beam, we have introduced a new way to specify Beam
pipelines by using configuration files rather than code. This new SDK, known as
[Beam YAML](https://beam.apache.org/documentation/sdks/yaml/), employs a declarative approach to creating
data processing pipelines using [YAML](https://yaml.org/), a widely used data serialization language.

<!--more-->

# Benefits of using Beam YAML

The primary goal of Beam YAML is to make the entry point to Beam as welcoming as possible. However, this should not
come at the expense of sacrificing the rich features that Beam offers.

Here are some of the benefits of using Beam YAML:

* **No-code development:** Allows users to develop pipelines without writing any code. This makes it easier to get
started with Beam and to develop pipelines quickly and easily.
* **Maintainability**: Configuration-based pipelines are easier to maintain than code-based pipelines. YAML format
enables clear separation of concerns, simplifying changes and updates without affecting other code sections.
* **Declarative language:** Provides a declarative language, which means that it is based on the description of the
desired outcome rather than expressing the intent through code. This makes it easy to understand the structure and
flow of a pipeline. The YAML syntax is also widely used with a rich community of resources for learning and
leveraging the YAML syntax.
* **Powerful features:** Supports a wide range of features, including a variety of data sources and sinks, turn-key
transforms, and execution parameters. This makes it possible to develop complex data processing pipelines with Beam
YAML.
* **Reusability**: Beam YAML promotes code reuse by providing a way to define and share common pipeline patterns. You
can create reusable YAML snippets or blocks that can be easily shared and reused in different pipelines. This reduces
the need to write repetitive tasks and helps maintain consistency across pipelines.
* **Extensibility**: Beam YAML offers a structure for integrating custom transformations into a pipeline, enabling
organizations to contribute or leverage a pre-existing catalog of transformations that can be seamlessly accessed
using the Beam YAML syntax across multiple pipelines. It is also possible to build third-party extensions, including
custom parsers and other tools, that do not need to depend on Beam directly.
* **Backwards Compatibility**: Beam YAML is still being actively worked on, bringing exciting new features and
capabilities, but as these features are added, backwards compatibility will be preserved. This way, once a pipeline
is written, it will continue to work despite future released versions of the SDK.

Overall, using Beam YAML provides a number of advantages. It makes pipeline development and management more efficient
and effective, enabling users to focus on the business logic and data processing tasks, rather than spending time on
low-level coding details.


# Case Study: A simple business analytics use-case

Let's take the following sample transaction data for a department store:

<table>
<tr>
<td><strong>transaction_id</strong>
</td>
<td><strong>product_name</strong>
</td>
<td><strong>category</strong>
</td>
<td><strong>price</strong>
</td>
</tr>
<tr>
<td>T0012
</td>
<td>Headphones
</td>
<td>Electronics
</td>
<td>59.99
</td>
</tr>
<tr>
<td>T5034
</td>
<td>Leather Jacket
</td>
<td>Apparel
</td>
<td>109.99
</td>
</tr>
<tr>
<td>T0024
</td>
<td>Aluminum Mug
</td>
<td>Kitchen
</td>
<td>29.99
</td>
</tr>
<tr>
<td>T0104
</td>
<td>Headphones
</td>
<td>Electronics
</td>
<td>59.99
</td>
</tr>
<tr>
<td>T0302
</td>
<td>Monitor
</td>
<td>Electronics
</td>
<td>249.99
</td>
</tr>
</table>

Now, let's say that the business wants to get a record of transactions for all purchases made in the Electronics
department for audit purposes. Assuming the records are stored as a CSV file, a Beam YAML pipeline may look something
like this:

Source code for this example can be found
[here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/simple_filter.yaml).
```yaml
pipeline:
transforms:
- type: ReadFromCsv
name: ReadInputFile
config:
path: /path/to/input.csv
- type: Filter
name: FilterWithCategory
input: ReadInputFile
config:
language: python
keep: category == "Electronics"
- type: WriteToCsv
name: WriteOutputFile
input: FilterWithCategory
config:
path: /path/to/output
```
This would leave us with the following data:
<table>
<tr>
<td><strong>transaction_id</strong>
</td>
<td><strong>product_name</strong>
</td>
<td><strong>category</strong>
</td>
<td><strong>price</strong>
</td>
</tr>
<tr>
<td>T0012
</td>
<td>Headphones
</td>
<td>Electronics
</td>
<td>59.99
</td>
</tr>
<tr>
<td>T0104
</td>
<td>Headphones
</td>
<td>Electronics
</td>
<td>59.99
</td>
</tr>
<tr>
<td>T0302
</td>
<td>Monitor
</td>
<td>Electronics
</td>
<td>249.99
</td>
</tr>
</table>
Now, let's say the business wants to determine how much of each Electronics item is being sold to ensure that the
correct number is being ordered from the supplier. Let's also assume that they want to determine the total revenue for
each item. This simple aggregation can follow the Filter from the previous example as such:
Source code for this example can be found
[here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/simple_filter_and_combine.yaml).
```yaml
pipeline:
transforms:
- type: ReadFromCsv
name: ReadInputFile
config:
path: /path/to/input.csv
- type: Filter
name: FilterWithCategory
input: ReadInputFile
config:
language: python
keep: category == "Electronics"
- type: Combine
name: CountNumberSold
input: FilterWithCategory
config:
group_by: product_name
combine:
num_sold:
value: product_name
fn: count
total_revenue:
value: price
fn: sum
- type: WriteToCsv
name: WriteOutputFile
input: CountNumberSold
config:
path: /path/to/output
```
This would leave us with the following data:
<table>
<tr>
<td><strong>product_name</strong>
</td>
<td><strong>num_sold</strong>
</td>
<td><strong>total_revenue</strong>
</td>
</tr>
<tr>
<td>Headphones
</td>
<td>2
</td>
<td>119.98
</td>
</tr>
<tr>
<td>Monitor
</td>
<td>1
</td>
<td>249.99
</td>
</tr>
</table>
While this was a relatively simple use-case, it shows the power of Beam YAML and how easy it is to go from business
use-case to a prototype data pipeline in just a few lines of YAML.
# Getting started with Beam YAML
There are several resources that have been compiled to help users get familiar with Beam YAML.
## Day Zero Notebook
<a target="_blank" href="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
To help get started with Apache Beam, there is a Day Zero Notebook available on
[Google Colab](https://colab.sandbox.google.com/), an online Python notebook environment with a free attachable
runtime, containing some basic YAML pipeline examples.
## Documentation
The Apache Beam website provides a set of [docs](https://beam.apache.org/documentation/sdks/yaml/) that demonstrate the
current capabilities of the Beam YAML SDK. These [docs](https://beam.apache.org/documentation/sdks/yaml/) can be found
on the website and offer a comprehensive overview of the SDK's functionality.
## Examples
A catalog of examples can be found [here](https://beam.apache.org/releases/yamldoc/current/). These examples showcase
all the turnkey transforms that can be utilized in Beam YAML. There are also a number of Dataflow Cookbook examples
that can be found [here](https://github.com/GoogleCloudPlatform/dataflow-cookbook/tree/main/Python/yaml).
## Contributing
Developers who wish to help build out and add functionalities are welcome to start contributing to the effort in the
Beam YAML module found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml).
There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml) found
on the GitHub repo - now marked with the 'yaml' tag.
While Beam YAML has been marked stable as of Beam 2.52, it is still under heavy development, with new features being
added with each release. Those who wish to be part of the design decisions and give insights to how the framework is
being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to
the dev list can be found [here](https://beam.apache.org/community/contact-us/).
3 changes: 3 additions & 0 deletions website/www/site/data/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -278,3 +278,6 @@ namitasharma:
talat:
name: Talat Uyarer
email: [email protected]
jkinard:
name: Jeff Kinard
email: [email protected]

0 comments on commit cc71921

Please sign in to comment.