Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit Testing in Beam Blog Post #32412

Merged
merged 16 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions examples/notebooks/blog/unittests_in_beam.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"metadata": {
"colab": {
"provenance": [],
"authorship_tag": "ABX9TyMh77PihysZUlOcgZAje/i2",
"authorship_tag": "ABX9TyNKlk6MKeCAFiaFkcs9pvkB",
"include_colab_link": true
},
"kernelspec": {
Expand Down Expand Up @@ -85,7 +85,7 @@
"source": [
"**Example 1**\n",
"\n",
"The following example shows how we can use the `Map` construct to calculate median house value per bedroom.\n"
"The following example shows how to use the `Map` construct to calculate median house value per bedroom.\n"
],
"metadata": {
"id": "IVjBkewt1sLA"
Expand Down Expand Up @@ -123,7 +123,7 @@
"source": [
"**Example 2**\n",
"\n",
"The following code is an extension of example 1, but with more complex pipeline logic. Thus, you will see that the `median_house_value_per_bedroom` function is now more complex, and involves writing to various keys."
"The following code is an extension of example 1, but with more complex pipeline logic. The `median_house_value_per_bedroom` function is now more complex, and involves writing to various keys."
],
"metadata": {
"id": "Mh3nZZ1_12sX"
Expand All @@ -133,7 +133,7 @@
"cell_type": "code",
"source": [
"import random\n",
"# The following code computes the median house value per bedroom\n",
"# The following code computes the median house value per bedroom.\n",
"counter=-1 #define a counter globally\n",
"\n",
"\n",
Expand Down Expand Up @@ -186,7 +186,7 @@
" | beam.Map(multiply_by_factor)\n",
" | beam.CombinePerKey(sum))\n",
"\n",
"# Define a new class that inherits from beam.PTransform\n",
"# Define a new class that inherits from beam.PTransform.\n",
"class MapAndCombineTransform(beam.PTransform):\n",
" def expand(self, pcoll):\n",
" return transform_data_set(pcoll)\n",
Expand Down Expand Up @@ -254,7 +254,7 @@
"source": [
"**Example 3**\n",
"\n",
"This `DoFn` (and corresponding pipeline) is used to convey a situation in which a `DoFn` makes an API call. Note that an error is raised here if the length of the API response (returned_record) is less than length 10."
"This `DoFn` and the corresponding pipeline demonstrate a `DoFn` making an API call. An error occurs if the length of the API response (`returned_record`) is less than the length `10`."
],
"metadata": {
"id": "Z8__izORM3r8"
Expand Down Expand Up @@ -282,7 +282,7 @@
{
"cell_type": "markdown",
"source": [
"**Note:** The following cell may take about 2 minutes to run"
"**Note:** The following cell can take about 2 minutes to run"
],
"metadata": {
"id": "3tGnPucbzmEx"
Expand All @@ -291,7 +291,7 @@
{
"cell_type": "code",
"source": [
"#The following packages are used to run the example pipelines\n",
"# The following packages are used to run the example pipelines.\n",
"from apache_beam.options.pipeline_options import PipelineOptions\n",
"\n",
"class MyDoFn(beam.DoFn):\n",
Expand Down Expand Up @@ -320,7 +320,7 @@
"source": [
"**Mocking Example**\n",
"\n",
"The following blocks of code illustrate how we can mock an API response, to test out the error message we've written. Note that we can use mocking to avoid making the actual API call in our test."
"To test the error message, mock an API response, as demonstrated in the following blocks of code. Use mocking to avoid making the actual API call in the test."
],
"metadata": {
"id": "58GVMyMa2PwE"
Expand All @@ -329,7 +329,7 @@
{
"cell_type": "code",
"source": [
"!pip install mock # Install the 'mock' module"
"!pip install mock # Install the 'mock' module."
],
"metadata": {
"id": "ESclJ_G-6JcW"
Expand All @@ -340,13 +340,13 @@
{
"cell_type": "code",
"source": [
"# We import the mock package for mocking functionality.\n",
"# Import the mock package for mocking functionality.\n",
"from unittest.mock import Mock,patch\n",
"# from MyApiCall import get_data\n",
"import mock\n",
"\n",
"\n",
"# MyApiCall is a function that calls get_data to fetch some data via an API call.\n",
"# MyApiCall is a function that calls get_data to fetch some data by using an API call.\n",
"@patch('MyApiCall.get_data')\n",
"def test_error_message_wrong_length(self, mock_get_data):\n",
" response = ['field1','field2']\n",
Expand Down
61 changes: 31 additions & 30 deletions website/www/site/content/en/blog/unit-testing-in-beam.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,28 +21,28 @@ limitations under the License.
-->

Testing remains one of the most fundamental components of software engineering. In this blog post, we shed light on some of the constructs that Apache Beam provides to allow for testing.
svetakvsundhar marked this conversation as resolved.
Show resolved Hide resolved
We cover an opinionated set of best practices to write unit tests for your data pipeline in this post. Note that this post does not include integration tests, and those should be authored separately.
All snippets in this post are included in [this notebook](https://github.com/apache/beam/blob/master/examples/notebooks/blog/unittests_in_beam.ipynb). Additionally, please take a look at the [Beam starter projects](https://beam.apache.org/blog/beam-starter-projects/), as these contain tests that exhibit best practices.
We cover an opinionated set of best practices to write unit tests for your data pipeline. This post doesn't include integration tests, and you need to author those separately.
All snippets in this post are included in [this notebook](https://github.com/apache/beam/blob/master/examples/notebooks/blog/unittests_in_beam.ipynb). Additionally, to see tests that exhibit best practices, look at the [Beam starter projects](https://beam.apache.org/blog/beam-starter-projects/), as these contain tests that exhibit best practices.
svetakvsundhar marked this conversation as resolved.
Show resolved Hide resolved

## Best practices

When testing Beam pipelines, we recommend the following best practices:

####When testing Beam pipelines, we recommend the following best practices:
1) You don’t need to write any unit tests for the already supported connectors in the Beam Library, such as `ReadFromBigQuery` and `WriteToText`. These connectors are already tested in Beam’s test suite to ensure correct functionality. They add unnecessary cost and dependencies to a unit test.
svetakvsundhar marked this conversation as resolved.
Show resolved Hide resolved

1) You don’t need to write any unit tests for the already supported connectors in the Beam Library, such as `ReadFromBigQuery` and `WriteToText`. These connectors are already tested in Beam’s test suite to ensure correct functionality and add unnecessary cost and dependencies to a unit test.

2) You should ensure your function is well tested when using it with `Map`, `FlatMap`, or `Filter`. You can assume your function will work as intended when using `Map(your_function)`.
2) Ensure that your function is well tested when using it with `Map`, `FlatMap`, or `Filter`. You can assume your function will work as intended when using `Map(your_function)`.
3) For more complex transforms such as `ParDo`’s, side inputs, timestamp inspection, etc., treat the entire transform as a unit, and test it.
4) If needed, use mocking to mock any API calls that might be present in your DoFn. The purpose of mocking is to test your functionality extensively, even if this testing requires a specific response from an API call.

1) Be sure to modularize your API calls in separate functions, rather than making the API call directly in the `DoFn`. This will allow for a cleaner experience when mocking the external API calls.
1) Be sure to modularize your API calls in separate functions, rather than making the API call directly in the `DoFn`. This step provides a cleaner experience when mocking the external API calls.


###Example 1
## Example 1

Let’s use the following pipeline as an example. We do not have to write a separate unit test to test this function in the context of this pipeline, assuming the function `median_house_value_per_bedroom` is unit tested elsewhere in the code. We can trust that the Map primitive will work as expected (this illustrates point #2 from above).
Use the following pipeline as an example. You don't have to write a separate unit test to test this function in the context of this pipeline, assuming the function `median_house_value_per_bedroom` is unit tested elsewhere in the code. You can trust that the `Map` primitive works as expected (this illustrates point #2 noted previously).

```python
# The following code computes the median house value per bedroom
# The following code computes the median house value per bedroom.

with beam.Pipeline() as p1:
result = (
Expand All @@ -53,9 +53,9 @@ with beam.Pipeline() as p1:
)
```

###Example 2
## Example 2

Now let’s use the following function as our example. The functions `median_house_value_per_bedroom`, and `multiply_by_factor` are tested elsewhere, but the pipeline as a whole (which consists of composite transforms) is not.
Use the following function as the example. The functions `median_house_value_per_bedroom` and `multiply_by_factor` are tested elsewhere, but the pipeline as a whole, which consists of composite transforms, is not.

```python
with beam.Pipeline() as p2:
Expand All @@ -69,7 +69,7 @@ with beam.Pipeline() as p2:
)
```

The best practice for the above is to create a transform with all functions between the `ReadFromText` and `WriteToText`.This will separate the transformation logic from the IOs, allowing us to unit-test the transformation logic. The following is a refactoring of the code above:
The best practice for the previous code is to create a transform with all functions between `ReadFromText` and `WriteToText`. This step separates the transformation logic from the I/Os, allowing you to unit-test the transformation logic. The following example is a refactoring of the previous code:
svetakvsundhar marked this conversation as resolved.
Show resolved Hide resolved

```python
def transform_data_set(pcoll):
Expand All @@ -78,7 +78,7 @@ def transform_data_set(pcoll):
| beam.Map(multiply_by_factor)
| beam.CombinePerKey(sum))

# Define a new class that inherits from beam.PTransform
# Define a new class that inherits from beam.PTransform.
class MapAndCombineTransform(beam.PTransform):
def expand(self, pcoll):
return transform_data_set(pcoll)
Expand All @@ -92,7 +92,7 @@ with beam.Pipeline() as p2:
)
```

Here is the corresponding unit test for the above example:
This code shows the corresponding unit test for the previous example:

```python
import unittest
Expand Down Expand Up @@ -121,15 +121,15 @@ class TestBeam(unittest.TestCase):
assert_that(result,equal_to(expected))
```

###Example 3
## Example 3

Suppose we write a pipeline that reads data from a JSON file, gets passed through a custom function that makes external API calls for parsing, and is then written to a custom destination (for example, we need to do some custom data formatting to have data prepared for a downstream application).
Suppose we write a pipeline that reads data from a JSON file, passes it through a custom function that makes external API calls for parsing, and then writes it to a custom destination (for example, if we need to do some custom data formatting to have data prepared for a downstream application).


The pipeline is structured as follows:
The pipeline has the following structure:

```python
#The following packages are used to run the example pipelines
# The following packages are used to run the example pipelines.

import apache_beam as beam
from apache_beam.io import ReadFromText, WriteToText
Expand All @@ -151,19 +151,19 @@ with beam.Pipeline() as p3:
)
```

This test checks that if the API response is a record of the wrong length, we throw the expected error.
This test checks whether the API response is a record of the wrong length and throws the expected error if the test fails.

```python
!pip install mock # Install the 'mock' module
!pip install mock # Install the 'mock' module.
```
```python
# We import the mock package for mocking functionality.
# Import the mock package for mocking functionality.
from unittest.mock import Mock,patch
# from MyApiCall import get_data
import mock


# MyApiCall is a function that calls get_data to fetch some data via an API call.
# MyApiCall is a function that calls get_data to fetch some data by using an API call.
@patch('MyApiCall.get_data')
def test_error_message_wrong_length(self, mock_get_data):
response = ['field1','field2']
Expand All @@ -178,20 +178,21 @@ def test_error_message_wrong_length(self, mock_get_data):
result
```

###The following cover other testing best practices:
## Other testing best practices:

1) Test all error messages you raise.
2) Cover any edge cases that might be present in your data.
3) Notice that in example 1, we could have written the `beam.Map` step with lambda functions:
1) Test all error messages that you raise.
2) Cover any edge cases that might exist in your data.
3) Example 1 could have written the `beam.Map` step with lambda functions instead of with `beam.Map(median_house_value_per_bedroom)`:

```
beam.Map(lambda x: x.strip().split(',')) | beam.Map(lambda x: float(x[8])/float(x[4])
```

, instead of `beam.Map(median_house_value_per_bedroom)`. The latter (separating lambdas into a helper function) is the recommended approach for more testable code, as changes to the function would be modularized.
5) Use the `assert_that` statement to ensure that PCollection values match up correctly, such as done above
Separating lambdas into a helper function by using `beam.Map(median_house_value_per_bedroom)` is the recommended approach for more testable code, because changes to the function would be modularized.

4) Use the `assert_that` statement to ensure that `PCollection` values match correctly, as in the previous example.

For more pointed guidance on testing on Beam/Dataflow, see the [Google Cloud documentation](https://cloud.google.com/dataflow/docs/guides/develop-and-test-pipelines). Additionally, see some more examples of unit testing in Beam [here](https://github.com/apache/beam/blob/736cf50430b375d32093e793e1556567557614e9/sdks/python/apache_beam/ml/inference/base_test.py#L262).
For more guidance about testing on Beam and Dataflow, see the [Google Cloud documentation](https://cloud.google.com/dataflow/docs/guides/develop-and-test-pipelines). For more examples of unit testing in Beam, see [the `base_test.py` code](https://github.com/apache/beam/blob/736cf50430b375d32093e793e1556567557614e9/sdks/python/apache_beam/ml/inference/base_test.py#L262).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more guidance about testing on Beam and Dataflow, see the [Google Cloud documentation](https://cloud.google.com/dataflow/docs/guides/develop-and-test-pipelines). For more examples of unit testing in Beam, see [the `base_test.py` code](https://github.com/apache/beam/blob/736cf50430b375d32093e793e1556567557614e9/sdks/python/apache_beam/ml/inference/base_test.py#L262).
For more guidance about testing on Beam and Dataflow, see the [Google Cloud documentation](https://cloud.google.com/dataflow/docs/guides/develop-and-test-pipelines). For more examples of unit testing in Beam, see [the base_test.py code](https://github.com/apache/beam/blob/736cf50430b375d32093e793e1556567557614e9/sdks/python/apache_beam/ml/inference/base_test.py#L262).

This renders awkwardly with the backticks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Special thanks to Robert Bradshaw, Danny McCormick, XQ Hu, Surjit Singh, and Rebecca Spzer, who helped refine the ideas in this post.

Loading