-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TextIO not fully reading a GCS file when decompressive transcoding happens #33384
base: master
Are you sure you want to change the base?
Conversation
For GCS, we determine the splittability based on whether the file meets decompressive transcoding criteria. When decompressive transcoding occurs, the size returned from metadata (gzip file size) does not match the size of the content returned (original data). In this case, we set the source to unsplittable to ensure all its content is read.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #33384 +/- ##
=========================================
Coverage 57.38% 57.39%
Complexity 1475 1475
=========================================
Files 973 973
Lines 154978 154997 +19
Branches 1076 1076
=========================================
+ Hits 88939 88956 +17
- Misses 63829 63831 +2
Partials 2210 2210
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -945,3 +945,6 @@ def report_lineage(self, path, unused_lineage, level=None): | |||
Unless override by FileSystem implementations, default to no-op. | |||
""" | |||
pass | |||
|
|||
def check_splittability(self, path): | |||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably not always be true. If this is a default, perhaps it should not have a default but be abstract and we implement for various filesystems. If it is the default, comment so we understand that is why it ignores the argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line has been removed in the new change.
def check_splittability(self, path): | ||
try: | ||
file_metadata = self._gcsIO()._status(path) | ||
if file_metadata.get('content_encoding', None) == 'gzip': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the content-type also have to be a particular thing in addition to the content-encoding being set to gzip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line has been removed in the new change.
… in gcs client lib
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
# object meets the criteria of decompressive transcoding | ||
# (https://cloud.google.com/storage/docs/transcoding). | ||
super().__init__( | ||
blob, chunk_size=chunk_size, retry=retry, raw_download=raw_download) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these be a stable API? It is not documented at https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.fileio.BlobReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question and that is a valid concern.
The raw_download
is included as one of the valid download parameters within the kwargs:
- https://github.com/googleapis/python-storage/blob/ca998db2db7c7b028f6fc145f1cc6b8b2c2a967b/google/cloud/storage/fileio.py#L108
- https://github.com/googleapis/python-storage/blob/ca998db2db7c7b028f6fc145f1cc6b8b2c2a967b/google/cloud/storage/fileio.py#L38
Let me ask the devs of the GCS lib about what is their plan for that API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirmed it is a stable API, and the devs of the library will update their doc to address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is so simple now. Nice! If the GCS client library breaks us later, then we can issue an update, but I just wanted to ask if it was going to be stable.
When decompressive transcoding occurs, the size returned from metadata (i.e. the gzipped file size) does not match the size of the content returned (i.e. original data size). This causes data loss problem.
In this case, we force the source to be unsplittable to ensure all its content is read.To address this, we leverage the GCS client library's ability to retrieve raw data, even when the object meets the criteria for decompressive transcoding. By setting raw_download=True when initializing the BlobReader, we ensure the complete data is retrieved
This change should not impact performance. The GCS client library already retrieves raw data from GCS and performs any necessary decompression client-side, mimicking the effects of server-side decompressive transcoding. Therefore, the decompression workload always occurs on the client side, which is consistent both before and after the fix.
fixes #31040
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.