Allow one to bound the size of output shards when writing to files. #22130

robertwb · 2022-07-01T20:57:03Z

This fixes #22129 by possibly writing multiple shards per bundle.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

This fixes apache#22129.

codecov · 2022-07-01T21:17:52Z

Codecov Report

Merging #22130 (5a66a61) into master (52e1b3f) will increase coverage by 0.71%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master   #22130      +/-   ##
==========================================
+ Coverage   73.99%   74.71%   +0.71%     
==========================================
  Files         703      703              
  Lines       92936    96503    +3567     
==========================================
+ Hits        68769    72102    +3333     
- Misses      22901    23135     +234     
  Partials     1266     1266

Flag	Coverage Δ
python	`84.08% <96.29%> (+0.50%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/python/apache_beam/io/textio.py	`97.05% <ø> (ø)`
sdks/python/apache_beam/io/filebasedsink.py	`95.90% <95.45%> (-0.08%)`	⬇️
sdks/python/apache_beam/io/iobase.py	`86.41% <100.00%> (+0.16%)`	⬆️
...python/apache_beam/runners/worker/worker_status.py	`78.26% <0.00%> (-1.45%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`93.54% <0.00%> (-0.13%)`	⬇️
...apache_beam/runners/dataflow/internal/apiclient.py	`77.28% <0.00%> (-0.12%)`	⬇️
.../python/apache_beam/typehints/trivial_inference.py	`96.41% <0.00%> (ø)`
...thon/apache_beam/ml/inference/pytorch_inference.py	`0.00% <0.00%> (ø)`
...am/examples/inference/pytorch_language_modeling.py	`0.00% <0.00%> (ø)`
sdks/python/apache_beam/runners/common.py	`88.71% <0.00%> (+0.12%)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52e1b3f...5a66a61. Read the comment docs.

robertwb · 2022-07-01T23:16:53Z

R: @Abacn

github-actions · 2022-07-01T23:18:03Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Abacn

Thanks for implementing this. Since the request is on file sink I am thinking about keeping the scope of change limited on FileBasedSink and thus do not need to touch iobase. Please find the following comments if it makes sense.

Abacn · 2022-07-06T14:03:39Z

sdks/python/apache_beam/io/filebasedsink.py

@@ -68,6 +68,9 @@ def __init__(
      shard_name_template=None,
      mime_type='application/octet-stream',
      compression_type=CompressionTypes.AUTO,
+      *,


Do we have code style guide about the usage of asterisk in function parameters?

I don't think we have any guidance here (other than that which is generic to Python, which would indicate most of these arguments should be passed by keyword).

Abacn · 2022-07-06T14:15:19Z

sdks/python/apache_beam/io/filebasedsink.py

@@ -108,6 +111,8 @@ def __init__(
        shard_name_template)
    self.compression_type = compression_type
    self.mime_type = mime_type
+    self.max_records_per_shard = max_records_per_shard
+    self.max_bytes_per_shard = max_bytes_per_shard


From the implementation of write below, only one of them will take effect. Do we need to raise a warning (or info) to remind possible misuse when neither is None? Also need to document this.

Nice catch. Fixed so that both take effect.

Abacn · 2022-07-06T14:34:53Z

sdks/python/apache_beam/io/filebasedsink.py


  def close(self):
    self.sink.close(self.temp_handle)
    return self.temp_shard_path
+
+
+class _ByteCountingWriter:


Can bytes_written be handled also in write function as num_records_written thus no need for the wrapped class? FileBasedSink.open used to return an instance of BufferedWriter always but if use this wrapped class it now it may not.

I see, if the writer is compressed, record sends to FileBasedWriter may have different length to the record actually written and that's why a wrapped class is needed here

Found that io.BufferedWriter.write returns the number of bytes written (https://docs.python.org/3.7/library/io.html#io.BufferedWriter.write) so bytes_written can be traced directly in FileBasedSinkWriter.write

Unfortunately io.BufferedWriter.write returns the number of bytes written for that call, not a running total.

Abacn · 2022-07-06T14:35:02Z

sdks/python/apache_beam/io/iobase.py

@@ -848,8 +848,12 @@ class Writer(object):
  See ``iobase.Sink`` for more detailed documentation about the process of
  writing to a sink.
  """
-  def write(self, value):
-    """Writes a value to the sink using the current writer."""
+  def write(self, value) -> Optional[bool]:


Better not change the signature of base class.

It's backwards compatible, which is why I made it Optional. But I've moved to using at_capacity as suggested instead.

Abacn · 2022-07-06T14:41:33Z

sdks/python/apache_beam/io/filebasedsink.py


  def write(self, value):
    self.sink.write_record(self.temp_handle, value)
+    if self.sink.max_records_per_shard:


If write still does not return, could create another method like "at_capacity" returns true if the writer has reached capacity. Also in this way max_bytes_per_shard and max_records_per_shard can have effect at the same time.

Abacn · 2022-07-06T14:42:30Z

sdks/python/apache_beam/io/iobase.py

@@ -1184,7 +1188,9 @@ def process(self, element, init_result):
    if self.writer is None:
      # We ignore UUID collisions here since they are extremely rare.
      self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
-    self.writer.write(element)
+    if self.writer.write(element):


always call self.writer.write and then test if self.writer.at_capacity.

robertwb

Thanks. I've addressed your comments. We do have to push the change down to iobase as that's where the writers are created (and destroyed) but I think the change there should be minimal and natural.

robertwb · 2022-07-06T21:08:41Z

sdks/python/apache_beam/io/filebasedsink.py

@@ -68,6 +68,9 @@ def __init__(
      shard_name_template=None,
      mime_type='application/octet-stream',
      compression_type=CompressionTypes.AUTO,
+      *,


I don't think we have any guidance here (other than that which is generic to Python, which would indicate most of these arguments should be passed by keyword).

robertwb · 2022-07-06T21:12:28Z

sdks/python/apache_beam/io/filebasedsink.py


  def close(self):
    self.sink.close(self.temp_handle)
    return self.temp_shard_path
+
+
+class _ByteCountingWriter:


Unfortunately io.BufferedWriter.write returns the number of bytes written for that call, not a running total.

robertwb · 2022-07-06T21:12:51Z

sdks/python/apache_beam/io/iobase.py

@@ -848,8 +848,12 @@ class Writer(object):
  See ``iobase.Sink`` for more detailed documentation about the process of
  writing to a sink.
  """
-  def write(self, value):
-    """Writes a value to the sink using the current writer."""
+  def write(self, value) -> Optional[bool]:


It's backwards compatible, which is why I made it Optional. But I've moved to using at_capacity as suggested instead.

robertwb · 2022-07-06T21:15:49Z

sdks/python/apache_beam/io/iobase.py

@@ -1184,7 +1188,9 @@ def process(self, element, init_result):
    if self.writer is None:
      # We ignore UUID collisions here since they are extremely rare.
      self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
-    self.writer.write(element)
+    if self.writer.write(element):


robertwb · 2022-07-06T21:21:17Z

sdks/python/apache_beam/io/filebasedsink.py


  def write(self, value):
    self.sink.write_record(self.temp_handle, value)
+    if self.sink.max_records_per_shard:


robertwb · 2022-07-06T21:22:11Z

sdks/python/apache_beam/io/filebasedsink.py

@@ -108,6 +111,8 @@ def __init__(
        shard_name_template)
    self.compression_type = compression_type
    self.mime_type = mime_type
+    self.max_records_per_shard = max_records_per_shard
+    self.max_bytes_per_shard = max_bytes_per_shard


Nice catch. Fixed so that both take effect.

Abacn

Thanks! LGTM

robertwb · 2022-07-08T18:19:16Z

Unrelated failure in apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers.test_pardo_large_input

robertwb · 2022-07-08T18:19:24Z

Run Python PreCommit

robertwb · 2022-07-08T23:16:37Z

Run Python PreCommit

…pache#22130) This fixes apache#22129.

lhoestq · 2022-10-24T09:49:33Z

Hi ! is this available for WriteToParquet as well ? I couldn't find a way to do it with Parquet files

robertwb · 2022-10-24T16:27:40Z

This has not been plumbed through for WriteToParquet, but the underlying infrastructure should work there (see the changes to sdks/python/apache_beam/io/textio.py -- just adding arguments and passing them through). I'd be happy to help you with a pull request.

github-actions bot added io python labels Jul 1, 2022

Allow one to bound the size of output shards when writing to files.

fc81f14

This fixes apache#22129.

robertwb force-pushed the max-shard-size branch from 4f909c9 to fc81f14 Compare July 1, 2022 20:57

github-actions bot added the Next Action: Reviewers label Jul 1, 2022

apache deleted a comment from github-actions bot Jul 1, 2022

Abacn requested changes Jul 6, 2022

View reviewed changes

robertwb commented Jul 6, 2022

View reviewed changes

address comments

5a66a61

Abacn approved these changes Jul 7, 2022

View reviewed changes

robertwb merged commit abc8099 into apache:master Jul 11, 2022

konstantinurysov pushed a commit to akvelon/beam that referenced this pull request Jul 14, 2022

Allow one to bound the size of output shards when writing to files. (a…

dd60460

…pache#22130) This fixes apache#22129.

lhoestq mentioned this pull request Oct 24, 2022

Args to set the max shard size in WriteToParquet #23808

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow one to bound the size of output shards when writing to files. #22130

Allow one to bound the size of output shards when writing to files. #22130

robertwb commented Jul 1, 2022

codecov bot commented Jul 1, 2022 •

edited

Loading

robertwb commented Jul 1, 2022

github-actions bot commented Jul 1, 2022

Abacn left a comment

Abacn Jul 6, 2022

robertwb Jul 6, 2022

Abacn Jul 6, 2022

robertwb Jul 6, 2022

Abacn Jul 6, 2022

Abacn Jul 6, 2022

Abacn Jul 6, 2022

robertwb Jul 6, 2022

Abacn Jul 6, 2022

robertwb Jul 6, 2022

Abacn Jul 6, 2022

robertwb Jul 6, 2022

Abacn Jul 6, 2022

robertwb Jul 6, 2022

robertwb left a comment

robertwb Jul 6, 2022

robertwb Jul 6, 2022

robertwb Jul 6, 2022

robertwb Jul 6, 2022

robertwb Jul 6, 2022

robertwb Jul 6, 2022

Abacn left a comment

robertwb commented Jul 8, 2022

robertwb commented Jul 8, 2022

robertwb commented Jul 8, 2022

lhoestq commented Oct 24, 2022

robertwb commented Oct 24, 2022

Allow one to bound the size of output shards when writing to files. #22130

Allow one to bound the size of output shards when writing to files. #22130

Conversation

robertwb commented Jul 1, 2022

GitHub Actions Tests Status (on master branch)

codecov bot commented Jul 1, 2022 • edited Loading

Codecov Report

robertwb commented Jul 1, 2022

github-actions bot commented Jul 1, 2022

Abacn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn left a comment

Choose a reason for hiding this comment

robertwb commented Jul 8, 2022

robertwb commented Jul 8, 2022

robertwb commented Jul 8, 2022

lhoestq commented Oct 24, 2022

robertwb commented Oct 24, 2022

codecov bot commented Jul 1, 2022 •

edited

Loading