Add TFXIO for reading parquet #52

martinbomio · 2022-03-09T22:29:17Z

These changes add a TFXIO for reading parquet files.
The implementation uses a tf.schema to describe the record schema being read, but can be further generalized to use other schemas like avro schema if this is something needed.

The implementation uses beam's ReadFromParquetBatched to read the files into pyarrow tables and then uses pa.Table.to_batches to transform them into pa.RecordBatch.

google-cla · 2022-03-09T22:29:23Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

martinbomio · 2022-03-09T22:30:31Z

tfx_bsl/tfxio/parquet_tfxio.py

+      self,
+      table: pa.Table,
+      batch_size: Optional[int] = None) -> List[pa.RecordBatch]:
+    return table.to_batches(self, max_chunksize=batch_size)


this line here throws the following exception when running tests:

Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 843, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "/Users/martinbomio/Spotify/Personal/tfx-bsl/tfx_bsl/tfxio/parquet_tfxio.py", line 87, in _TableToRecordBatch return table.to_batches(self, max_chunksize=batch_size) File "pyarrow/table.pxi", line 1701, in pyarrow.lib.Table.to_batches TypeError: to_batches() got multiple values for keyword argument 'max_chunksize'

I don't think self should be passed to to_batches

🤦 of course

martinbomio · 2022-03-09T22:31:37Z

tfx_bsl/tfxio/parquet_tfxio.py

+from tfx_bsl.tfxio.tfxio import TFXIO
+
+
+class ParquetTFXIO(TFXIO):


I also have another working implementation that uses ReadFromParquet instead of ReadFromParquetBatched, then batches and finally transforms the dicts into pa.RecordBatch

Both are experimental, so no difference in API stability, I would prefer to keep batching on Beam's side

sounds good, I'll keep this one then

martinbomio · 2022-03-09T22:56:38Z

@iindyk

martinbomio · 2022-03-09T23:58:16Z

tfx_bsl/tfxio/parquet_tfxio.py

+class ParquetTFXIO(TFXIO):
+  """TFXIO implementation for Parquet."""
+
+  def __init__(self,


this implementation is not really doing any profiling, the parquet io does not provide an easy way to get the raw records, we probably want to implement a custom telemetry DoFn that iterates over each record?

I would not undo the Beam's batching just to get telemetry or implement custom telemetry for tables, let's just add telemetry for resulting recordbatches in BeamSource like this

done in bcac68d

iindyk

Thanks Martin!

iindyk · 2022-03-10T19:37:58Z

tfx_bsl/tfxio/parquet_tfxio.py

+from tfx_bsl.tfxio.tfxio import TFXIO
+
+
+class ParquetTFXIO(TFXIO):


Both are experimental, so no difference in API stability, I would prefer to keep batching on Beam's side

iindyk · 2022-03-10T19:38:32Z

tfx_bsl/tfxio/parquet_tfxio.py

+      self,
+      table: pa.Table,
+      batch_size: Optional[int] = None) -> List[pa.RecordBatch]:
+    return table.to_batches(self, max_chunksize=batch_size)


I don't think self should be passed to to_batches

iindyk · 2022-03-10T19:47:54Z

tfx_bsl/tfxio/parquet_tfxio.py

+class ParquetTFXIO(TFXIO):
+  """TFXIO implementation for Parquet."""
+
+  def __init__(self,


I would not undo the Beam's batching just to get telemetry or implement custom telemetry for tables, let's just add telemetry for resulting recordbatches in BeamSource like this

martinbomio · 2022-03-11T02:36:14Z

@iindyk finished the implementation with your suggestion. I believe this should be ready for review

tfx_bsl/tfxio/parquet_tfxio.py

iindyk · 2022-03-14T19:15:36Z

tfx_bsl/tfxio/parquet_tfxio.py

+        schema should contain exactly the same features as column_names.
+      validate: Boolean flag to verify that the files exist during the pipeline
+        creation time.
+    """


add telemetry_descriptors doc section

tfx_bsl/tfxio/parquet_tfxio.py

iindyk · 2022-03-14T19:30:44Z

tfx_bsl/tfxio/parquet_tfxio.py

+               file_pattern: Text,
+               column_names: List[Text],
+               min_bundle_size: int = 0,
+               schema: Optional[schema_pb2.Schema] = None,


can parquet store other data types apart from dense, e.g. varying length features?
In other words, would it be possible to infer schema automatically if not provided?

yeah, it can be inferred. If you look at the tests, there's one without specifying a schema. The missing part if inferring the schema. Do you think that is something we want to do? The inference will need to happen by reading one of the parquet files.

@iindyk done in 217cb9c

tfx_bsl/tfxio/parquet_tfxio.py

iindyk

Thanks Martin! let me import and submit this internally

iindyk · 2022-03-15T15:59:56Z

tfx_bsl/tfxio/parquet_tfxio.py

+class ParquetTFXIO(TFXIO):
+  """TFXIO implementation for Parquet."""
+
+  def __init__(self,


actually, one more small usability thing: can we make all args after file_pattern and column_names key-word only, i.e. def __init__(self, file_pattern:..., column_names:..., *, min_bundle_size:...,...

done it ae1b00f.

also note that other tfxio do not use this pattern

Thanks! yeah, that's unfortunate, but I'd like to have it moving forward: I think it helps with preventing some errors

martinbomio · 2022-03-15T22:51:45Z

@iindyk @jay90099 what would be the process of cutting a release so I can add this new IO to the tfxio factory in tfx?

martinbomio · 2022-03-16T17:27:20Z

Hey @iindyk @jay90099 I saw that the changes introduced in this PR were reverted in 017c17b. Any reason for this?

iindyk · 2022-03-18T17:55:47Z

re reverted: this was not intentional
@jay90099 that was my change that was not removing these files internally, but copybara added deletion of these files to the resulting PR, how do we fix this?

iindyk · 2022-03-18T23:11:53Z

it's an issue with our internal tools, could you please reopen the PR

martinbomio commented Mar 9, 2022

View reviewed changes

martinbomio force-pushed the parquet-tfxio branch from 08bec4a to f798f3a Compare March 9, 2022 22:39

Add TFXIO for reading parquet

9262737

martinbomio force-pushed the parquet-tfxio branch from f798f3a to 9262737 Compare March 9, 2022 22:49

martinbomio closed this Mar 9, 2022

martinbomio reopened this Mar 9, 2022

martinbomio commented Mar 9, 2022

View reviewed changes

iindyk reviewed Mar 10, 2022

View reviewed changes

martinbomio added 2 commits March 10, 2022 19:20

Fix unit tests

28955d8

Implement telemetry

bcac68d

martinbomio marked this pull request as ready for review March 11, 2022 02:34

Format files

7712c9f

iindyk reviewed Mar 14, 2022

View reviewed changes

martinbomio added 2 commits March 14, 2022 15:43

Address PR comments

8069c93

Infer parquet schema if no tf.schema passed

217cb9c

iindyk approved these changes Mar 15, 2022

View reviewed changes

iindyk reviewed Mar 15, 2022

View reviewed changes

Make keyword only arguments to tfxio

ae1b00f

jay90099 merged commit d713de3 into tensorflow:master Mar 15, 2022

martinbomio deleted the parquet-tfxio branch March 15, 2022 22:51

martinbomio restored the parquet-tfxio branch March 19, 2022 00:46

martinbomio mentioned this pull request Mar 19, 2022

Parquet tfxio #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TFXIO for reading parquet #52

Add TFXIO for reading parquet #52

martinbomio commented Mar 9, 2022

google-cla bot commented Mar 9, 2022

martinbomio Mar 9, 2022

iindyk Mar 10, 2022

martinbomio Mar 11, 2022

martinbomio Mar 9, 2022

iindyk Mar 10, 2022

martinbomio Mar 11, 2022

martinbomio commented Mar 9, 2022

martinbomio Mar 9, 2022

iindyk Mar 10, 2022

martinbomio Mar 11, 2022

iindyk left a comment

iindyk Mar 10, 2022

iindyk Mar 10, 2022

iindyk Mar 10, 2022

martinbomio commented Mar 11, 2022

iindyk Mar 14, 2022

iindyk Mar 14, 2022

martinbomio Mar 14, 2022

martinbomio Mar 14, 2022

iindyk left a comment

iindyk Mar 15, 2022

martinbomio Mar 15, 2022

martinbomio Mar 15, 2022

iindyk Mar 15, 2022

martinbomio commented Mar 15, 2022

martinbomio commented Mar 16, 2022

iindyk commented Mar 18, 2022

iindyk commented Mar 18, 2022

		from tfx_bsl.tfxio.tfxio import TFXIO


		class ParquetTFXIO(TFXIO):

Add TFXIO for reading parquet #52

Add TFXIO for reading parquet #52

Conversation

martinbomio commented Mar 9, 2022

google-cla bot commented Mar 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio commented Mar 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iindyk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio commented Mar 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iindyk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio commented Mar 15, 2022

martinbomio commented Mar 16, 2022

iindyk commented Mar 18, 2022

iindyk commented Mar 18, 2022