-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Python Bounded Source Reader DoFn #13154
Conversation
Run Python 3.8 PostCommit |
Run Python 3.8 PostCommit |
Run PythonDocker PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious do we have a plan to build actual SDF for BQ instead of still relying on BoundedSource implementation?
@@ -1618,3 +1628,48 @@ def display_data(self): | |||
'source': DisplayDataItem(self.source.__class__, label='Read Source'), | |||
'source_dd': self.source | |||
} | |||
|
|||
|
|||
class SDFBoundedSourceReader(PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the major difference between SDFBoundedSourceWrapper
and SDFBoundedSourceReader
is that SDFBoundedSourceWrapper
takes the source as construction param where SDFBoundedSourceReader
takes the source as input element. We could change the implementation of SDFBoundedSourceWrapper
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this - but I've still allowed the source to come in via the constructor as well as as an input. The intention of doing this is to keep the display data for simple Read transforms where the source is known at construction time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought we still keep _SDFBoundedSourceWrapper
. Thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking whether it would be better for SDFBoundedSourceReader
to take data_to_display as constructor instead of source directly if any. What do you think?
In this case, we will have a simple DoFn that starts the read from BQ, but it eventually returns multiple Avro file sources that can be read individually. This is different from what we had before, where all of the BQ reading logic was part of a BoundedSource. In fact, the _CustomBigQuerySource will be removed eventually. |
Run Python 3.8 PostCommit |
@@ -1618,3 +1628,48 @@ def display_data(self): | |||
'source': DisplayDataItem(self.source.__class__, label='Read Source'), | |||
'source_dd': self.source | |||
} | |||
|
|||
|
|||
class SDFBoundedSourceReader(PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought we still keep _SDFBoundedSourceWrapper
. Thanks for the clarification!
sdks/python/apache_beam/io/iobase.py
Outdated
initializes restriction based on input element that is expected to be of | ||
BoundedSource type. | ||
""" | ||
def __init__(self, source: BoundedSource = None, desired_chunk_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remote source here?
@@ -1618,3 +1628,48 @@ def display_data(self): | |||
'source': DisplayDataItem(self.source.__class__, label='Read Source'), | |||
'source_dd': self.source | |||
} | |||
|
|||
|
|||
class SDFBoundedSourceReader(PTransform): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking whether it would be better for SDFBoundedSourceReader
to take data_to_display as constructor instead of source directly if any. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Changes look good to me except some minor comments.
sdks/python/apache_beam/io/iobase.py
Outdated
""" | ||
A `RestrictionProvider` that is used by SDF for `BoundedSource`. | ||
|
||
If source is provided, uses it for initializing restriction. Otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we also need to update pydoc here as well.
sdks/python/apache_beam/io/iobase.py
Outdated
self._desired_chunk_size = desired_chunk_size | ||
|
||
def _check_source(self, src): | ||
if src is not None and not isinstance(src, BoundedSource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The src
cannot be None
, right?
I see. It seems like you will use |
Run Python 3.8 PostCommit |
I think that's reasonable. If any improvements are made to ReadAllFromBQ, we can make sure that they are done without relying on BoundedSource then. |
Run Portable_Python PreCommit |
Run Python 3.8 PostCommit |
This is valuable for BigQuery repeatedly firing side input. This PR is intended to be used here: #13170
This makes the SDF Bounded Source reader available to use. A small change in functionality:
source
is provided to the initial restriction in the constructor, then the element is expected to be a source, and it's added to the initial restriction at creation.r: @boyuanzz
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.