-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make column names optional for parquet tfxio #54
Make column names optional for parquet tfxio #54
Conversation
@iindyk when implementing the changes in tfx to add to the factory method, I noticed that column names are not passed, and realized that they shouldn't be required by the tfxio |
@@ -301,6 +301,34 @@ def _AssertFn(record_batch_list): | |||
record_batch_pcoll = (p | tfxio.BeamSource(batch_size=_NUM_ROWS)) | |||
beam_testing_util.assert_that(record_batch_pcoll, _AssertFn) | |||
|
|||
def testOptionalColumnNames(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, could you please also add a test when only a subset of columns is read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iindyk happy to add one, but isn't the projected test case covering this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does, but through project path, which shouldn't be the default when one just wants to read subset of data (e.g. it's should not be necessary to have the whole schema). Users often take examples from tests, so seeing how it can be done here would be useful imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, I'll add it
tfxio = ParquetTFXIO( | ||
file_pattern=self._example_file, | ||
column_names=['int_feature'], | ||
schema=_SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this just contain the int_feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I was trying to test that even when the schema has all fields, but the requested columns are a subset, the result schema and result records will have just a subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another test case for this
bazel-bin
Outdated
@@ -0,0 +1 @@ | |||
/private/var/tmp/_bazel_martinbomio/17e5cf616981f27d03c506e3e9f0879d/execroot/tfx_bsl/bazel-out/darwin-opt/bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please revert files outside of the project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, removed them in a65d12b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imported internally, under review
yup, this was merged in a3ce0b8 |
Making the column names optional, reading all fields of the dataset when columns are not specified. This is the same behaviour as the underlying BeamSource.