-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow one to bound the size of output shards when writing to files. #22130
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,6 +68,9 @@ def __init__( | |
shard_name_template=None, | ||
mime_type='application/octet-stream', | ||
compression_type=CompressionTypes.AUTO, | ||
*, | ||
max_records_per_shard=None, | ||
max_bytes_per_shard=None, | ||
skip_if_empty=False): | ||
""" | ||
Raises: | ||
|
@@ -108,6 +111,8 @@ def __init__( | |
shard_name_template) | ||
self.compression_type = compression_type | ||
self.mime_type = mime_type | ||
self.max_records_per_shard = max_records_per_shard | ||
self.max_bytes_per_shard = max_bytes_per_shard | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the implementation of write below, only one of them will take effect. Do we need to raise a warning (or info) to remind possible misuse when neither is None? Also need to document this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice catch. Fixed so that both take effect. |
||
self.skip_if_empty = skip_if_empty | ||
|
||
def display_data(self): | ||
|
@@ -130,7 +135,13 @@ def open(self, temp_path): | |
The returned file handle is passed to ``write_[encoded_]record`` and | ||
``close``. | ||
""" | ||
return FileSystems.create(temp_path, self.mime_type, self.compression_type) | ||
writer = FileSystems.create( | ||
temp_path, self.mime_type, self.compression_type) | ||
if self.max_bytes_per_shard: | ||
self.byte_counter = _ByteCountingWriter(writer) | ||
return self.byte_counter | ||
else: | ||
return writer | ||
|
||
def write_record(self, file_handle, value): | ||
"""Writes a single record go the file handle returned by ``open()``. | ||
|
@@ -406,10 +417,36 @@ def __init__(self, sink, temp_shard_path): | |
self.sink = sink | ||
self.temp_shard_path = temp_shard_path | ||
self.temp_handle = self.sink.open(temp_shard_path) | ||
self.num_records_written = 0 | ||
|
||
def write(self, value): | ||
self.num_records_written += 1 | ||
self.sink.write_record(self.temp_handle, value) | ||
|
||
def at_capacity(self): | ||
return ( | ||
self.sink.max_records_per_shard and | ||
self.num_records_written >= self.sink.max_records_per_shard | ||
) or ( | ||
self.sink.max_bytes_per_shard and | ||
self.sink.byte_counter.bytes_written >= self.sink.max_bytes_per_shard) | ||
|
||
def close(self): | ||
self.sink.close(self.temp_handle) | ||
return self.temp_shard_path | ||
|
||
|
||
class _ByteCountingWriter: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can bytes_written be handled also in write function as num_records_written thus no need for the wrapped class? FileBasedSink.open used to return an instance of BufferedWriter always but if use this wrapped class it now it may not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, if the writer is compressed, record sends to FileBasedWriter may have different length to the record actually written and that's why a wrapped class is needed here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Found that io.BufferedWriter.write returns the number of bytes written (https://docs.python.org/3.7/library/io.html#io.BufferedWriter.write) so bytes_written can be traced directly in FileBasedSinkWriter.write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately io.BufferedWriter.write returns the number of bytes written for that call, not a running total. |
||
def __init__(self, writer): | ||
self.writer = writer | ||
self.bytes_written = 0 | ||
|
||
def write(self, bs): | ||
self.bytes_written += len(bs) | ||
self.writer.write(bs) | ||
|
||
def flush(self): | ||
self.writer.flush() | ||
|
||
def close(self): | ||
self.writer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have code style guide about the usage of asterisk in function parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any guidance here (other than that which is generic to Python, which would indicate most of these arguments should be passed by keyword).