Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BatchElements documentation #32082

Merged
merged 7 commits into from
Aug 30, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions sdks/python/apache_beam/transforms/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -802,6 +802,20 @@ class BatchElements(PTransform):
corresponding to its contents. Each batch is emitted with a timestamp at
the end of their window.

When the max_batch_duration_secs arg is provided, a stateful implementation
of BatchElements is used to batch elements across bundles. This is most
impactful in streaming applications where many bundles only contain one
element. Larger max_batch_duration_secs values can reduce the throughput of
jrmccluskey marked this conversation as resolved.
Show resolved Hide resolved
the transform, while smaller values will improve the throughput but make it
jrmccluskey marked this conversation as resolved.
Show resolved Hide resolved
more likely that batches are smaller than the target batch size.

As a general recommendation, start with low values (e.g. 0.005 aka 5ms) and
increase as needed to get the desired tradeoff between target batch size
and latency or throughput.

For more information on tuning parameters to this transform, see
https://beam.apache.org/documentation/patterns/batch-elements

Args:
min_batch_size: (optional) the smallest size of a batch
max_batch_size: (optional) the largest size of a batch
Expand Down
Loading