Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow one to bound the size of output shards when writing to files. #22130

Merged
merged 2 commits into from
Jul 11, 2022

Conversation

robertwb
Copy link
Contributor

@robertwb robertwb commented Jul 1, 2022

This fixes #22129 by possibly writing multiple shards per bundle.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Jul 1, 2022

Codecov Report

Merging #22130 (5a66a61) into master (52e1b3f) will increase coverage by 0.71%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master   #22130      +/-   ##
==========================================
+ Coverage   73.99%   74.71%   +0.71%     
==========================================
  Files         703      703              
  Lines       92936    96503    +3567     
==========================================
+ Hits        68769    72102    +3333     
- Misses      22901    23135     +234     
  Partials     1266     1266              
Flag Coverage Δ
python 84.08% <96.29%> (+0.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/textio.py 97.05% <ø> (ø)
sdks/python/apache_beam/io/filebasedsink.py 95.90% <95.45%> (-0.08%) ⬇️
sdks/python/apache_beam/io/iobase.py 86.41% <100.00%> (+0.16%) ⬆️
...python/apache_beam/runners/worker/worker_status.py 78.26% <0.00%> (-1.45%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.54% <0.00%> (-0.13%) ⬇️
...apache_beam/runners/dataflow/internal/apiclient.py 77.28% <0.00%> (-0.12%) ⬇️
.../python/apache_beam/typehints/trivial_inference.py 96.41% <0.00%> (ø)
...thon/apache_beam/ml/inference/pytorch_inference.py 0.00% <0.00%> (ø)
...am/examples/inference/pytorch_language_modeling.py 0.00% <0.00%> (ø)
sdks/python/apache_beam/runners/common.py 88.71% <0.00%> (+0.12%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52e1b3f...5a66a61. Read the comment docs.

@robertwb
Copy link
Contributor Author

robertwb commented Jul 1, 2022

R: @Abacn

@apache apache deleted a comment from github-actions bot Jul 1, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2022

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this. Since the request is on file sink I am thinking about keeping the scope of change limited on FileBasedSink and thus do not need to touch iobase. Please find the following comments if it makes sense.

@@ -68,6 +68,9 @@ def __init__(
shard_name_template=None,
mime_type='application/octet-stream',
compression_type=CompressionTypes.AUTO,
*,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have code style guide about the usage of asterisk in function parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any guidance here (other than that which is generic to Python, which would indicate most of these arguments should be passed by keyword).

@@ -108,6 +111,8 @@ def __init__(
shard_name_template)
self.compression_type = compression_type
self.mime_type = mime_type
self.max_records_per_shard = max_records_per_shard
self.max_bytes_per_shard = max_bytes_per_shard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the implementation of write below, only one of them will take effect. Do we need to raise a warning (or info) to remind possible misuse when neither is None? Also need to document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Fixed so that both take effect.


def close(self):
self.sink.close(self.temp_handle)
return self.temp_shard_path


class _ByteCountingWriter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can bytes_written be handled also in write function as num_records_written thus no need for the wrapped class? FileBasedSink.open used to return an instance of BufferedWriter always but if use this wrapped class it now it may not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, if the writer is compressed, record sends to FileBasedWriter may have different length to the record actually written and that's why a wrapped class is needed here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found that io.BufferedWriter.write returns the number of bytes written (https://docs.python.org/3.7/library/io.html#io.BufferedWriter.write) so bytes_written can be traced directly in FileBasedSinkWriter.write

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately io.BufferedWriter.write returns the number of bytes written for that call, not a running total.

@@ -848,8 +848,12 @@ class Writer(object):
See ``iobase.Sink`` for more detailed documentation about the process of
writing to a sink.
"""
def write(self, value):
"""Writes a value to the sink using the current writer."""
def write(self, value) -> Optional[bool]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not change the signature of base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's backwards compatible, which is why I made it Optional. But I've moved to using at_capacity as suggested instead.


def write(self, value):
self.sink.write_record(self.temp_handle, value)
if self.sink.max_records_per_shard:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If write still does not return, could create another method like "at_capacity" returns true if the writer has reached capacity. Also in this way max_bytes_per_shard and max_records_per_shard can have effect at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1184,7 +1188,9 @@ def process(self, element, init_result):
if self.writer is None:
# We ignore UUID collisions here since they are extremely rare.
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
self.writer.write(element)
if self.writer.write(element):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always call self.writer.write and then test if self.writer.at_capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've addressed your comments. We do have to push the change down to iobase as that's where the writers are created (and destroyed) but I think the change there should be minimal and natural.

@@ -68,6 +68,9 @@ def __init__(
shard_name_template=None,
mime_type='application/octet-stream',
compression_type=CompressionTypes.AUTO,
*,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any guidance here (other than that which is generic to Python, which would indicate most of these arguments should be passed by keyword).


def close(self):
self.sink.close(self.temp_handle)
return self.temp_shard_path


class _ByteCountingWriter:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately io.BufferedWriter.write returns the number of bytes written for that call, not a running total.

@@ -848,8 +848,12 @@ class Writer(object):
See ``iobase.Sink`` for more detailed documentation about the process of
writing to a sink.
"""
def write(self, value):
"""Writes a value to the sink using the current writer."""
def write(self, value) -> Optional[bool]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's backwards compatible, which is why I made it Optional. But I've moved to using at_capacity as suggested instead.

@@ -1184,7 +1188,9 @@ def process(self, element, init_result):
if self.writer is None:
# We ignore UUID collisions here since they are extremely rare.
self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
self.writer.write(element)
if self.writer.write(element):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def write(self, value):
self.sink.write_record(self.temp_handle, value)
if self.sink.max_records_per_shard:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -108,6 +111,8 @@ def __init__(
shard_name_template)
self.compression_type = compression_type
self.mime_type = mime_type
self.max_records_per_shard = max_records_per_shard
self.max_bytes_per_shard = max_bytes_per_shard
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Fixed so that both take effect.

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM

@robertwb
Copy link
Contributor Author

robertwb commented Jul 8, 2022

Unrelated failure in apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers.test_pardo_large_input

@robertwb
Copy link
Contributor Author

robertwb commented Jul 8, 2022

Run Python PreCommit

1 similar comment
@robertwb
Copy link
Contributor Author

robertwb commented Jul 8, 2022

Run Python PreCommit

@robertwb robertwb merged commit abc8099 into apache:master Jul 11, 2022
@lhoestq
Copy link

lhoestq commented Oct 24, 2022

Hi ! is this available for WriteToParquet as well ? I couldn't find a way to do it with Parquet files

@robertwb
Copy link
Contributor Author

This has not been plumbed through for WriteToParquet, but the underlying infrastructure should work there (see the changes to sdks/python/apache_beam/io/textio.py -- just adding arguments and passing them through). I'd be happy to help you with a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Allow one to bound the size of output shards when writing to files.
3 participants