Extend tfxio factory to use parquet-tfxio #4761

martinbomio · 2022-03-24T00:02:32Z

This PR depends on tensorflow/tfx-bsl#53 and tensorflow/tfx-bsl#54 being released.

Extend the make_tfxio factory to use ParquetTFXIO if the payload_format is FORMAT_PARQUET.

google-cla · 2022-03-24T00:02:36Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

For more information, open the CLA check for this pull request.

martinbomio · 2022-03-24T00:06:11Z

@iindyk PTAL

iindyk · 2022-03-30T18:35:11Z

lgtm, I'll loop in ppl with approval access

iindyk · 2022-03-30T18:38:41Z

tfx/components/util/tfxio_utils.py

 from tfx_bsl.tfxio import raw_tf_record
 from tfx_bsl.tfxio import record_to_tensor_tfxio
 from tfx_bsl.tfxio import tf_example_record
 from tfx_bsl.tfxio import tf_sequence_example_record
 from tfx_bsl.tfxio import tfxio
 from tensorflow_metadata.proto.v0 import schema_pb2

+
+_SUPPORTED_PAYLOAD_FORMATS = ['parquet', 'tfrecords_gzip']


nit: set or tuple

consider to use this

FORMAT_TFRECORDS_GZIP is currently supported, and FORMAT_PARQUET will be added.
ExampleGen will attach the file format in its output example artifact, which can be used by downstream to figure out the example artifact's file format

ah nice, I missed that FileFormat proto message!

@1025KB wondering if you also wanted me to change the type of file_format param in make_tfxio to example_gen_pb2.FileFormat?

oh, I didn't notice it's Payload format, then using this payload format is fine too

what I mean here is instead of hardcoded 'parquet', 'tfrecords_gzip' str, can we use enum in PayloadFormat proto here? otherwise I feel there will be a PayloadFormat.FORMAT_PARQUET to 'parquet' conversion somewhere?

iindyk@, any thoughts?

done in 465ddff. The variable was actually badly named, it is indeed the supported file formats

In general I think it makes sense to have enum for such thing instead of string, but if we want to update this we'd also need to update all callsites: components and tests, which is a bit trickier. If this proves to be messy (I think it may be) I'd keep it as is for now and add a todo to refactor this.

I see, then a todo sound good

@iindyk should i revert the last commit and add a TODO then? I did check the calls to make_tfxio within this project and non of them were passing the file format

should i revert the last commit and add a TODO then?

sgtm

I did check the calls to make_tfxio within this project and non of them were passing the file format

Even tests? that's unfortunate this path is not covered. Aside from the upstream code we'd also need to update tfxios that are produced by the factory, bc they take strings (or make a bridge proto enum ->string).

For payload_format there are places that actually pass int (and I'm not yet sure if there's a good reason for that, this seems fragile).

1025KB · 2022-03-30T18:57:31Z

tfx/proto/example_gen.proto

@@ -111,7 +111,10 @@ enum PayloadFormat {
  // Serialized any protocol buffer.
  FORMAT_PROTO = 11;

-  reserved 1 to 5, 8 to 10, 12 to max;
+  // Serialized parquet messages.
+  FORMAT_PARQUET = 12;


File format is also needed?

e.g., FORMAT_TF_EXAMPLE (payload format) is stored in FORMAT_TFRECORDS_GZIP (file format)

for Parquet, what's the file format?

I guess PARQUET is also the file format?

@1025KB these are the file formats supported by pyarrow https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility. Should I add one for each? or should we agree on a set of ones we are planning to support?

the default is SNAPPY

Not necessary, one splittable file format (parquet) in addition to tfrecord (non-splittable file format) should be enough for now

Note that we will use beam.io.writeToParquet to generate output files, so for reading, only need to support the format used by beam.io.writeToParquet

While we're in this code, to maximize compatibility I think it would make sense to include gzip, brotli, and none, as well as snappy - unless that creates headaches, in which case snappy and none (and maybe gzip?) would seem sufficient.

I think supporting only the format that beam source and sink support should be sufficient for now, other formats could be considered in a follow up if needed

github-actions · 2022-05-01T02:15:21Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

martinbomio · 2022-05-01T17:09:32Z

@iindyk any update on the release?

iindyk · 2022-05-02T18:17:20Z

The release is currently blocked on a tf-related breakage, but will likely happen this week

martinbomio · 2022-05-06T19:08:42Z

@iindyk any update on the release?

iindyk · 2022-05-09T19:16:29Z

I think all the release blockers are resolved at this point, so our release engineers are working on it, should be coming out soon

martinbomio · 2022-05-16T18:54:49Z

hey @iindyk should we think about wrapping this one up? Are there any outstanding comments in the PR that we need to address to be able to merge it?

iindyk

left couple comments, but generally LGTM. I don't have owner permission here, so @1025KB PTAL

iindyk · 2022-05-17T20:18:26Z

tfx/components/util/tfxio_utils.py

 from tfx_bsl.tfxio import raw_tf_record
 from tfx_bsl.tfxio import record_to_tensor_tfxio
 from tfx_bsl.tfxio import tf_example_record
 from tfx_bsl.tfxio import tf_sequence_example_record
 from tfx_bsl.tfxio import tfxio
 from tensorflow_metadata.proto.v0 import schema_pb2

+
+_SUPPORTED_FILE_FORMATS = {example_gen_pb2.FileFormat.FORMAT_PARQUET, example_gen_pb2.FileFormat.FORMAT_TFRECORDS_GZIP}


nit: use immutable tuple for constants

iindyk · 2022-05-17T20:19:54Z

tfx/proto/example_gen.proto

@@ -111,7 +111,10 @@ enum PayloadFormat {
  // Serialized any protocol buffer.
  FORMAT_PROTO = 11;

-  reserved 1 to 5, 8 to 10, 12 to max;
+  // Serialized parquet messages.
+  FORMAT_PARQUET = 12;


I think supporting only the format that beam source and sink support should be sufficient for now, other formats could be considered in a follow up if needed

iindyk · 2022-05-17T20:20:51Z

tfx/components/util/tfxio_utils.py

@@ -291,10 +294,10 @@ def make_tfxio(
            f'The length of file_pattern and file_formats should be the same.'
            f'Given: file_pattern={file_pattern}, file_format={file_format}')
      else:
-        if any(item != 'tfrecords_gzip' for item in file_format):
+        if any(item in _SUPPORTED_FILE_FORMATS for item in file_format):


iindyk · 2022-05-17T20:21:01Z

tfx/components/util/tfxio_utils.py

          raise NotImplementedError(f'{file_format} is not supported yet.')
    else:  # file_format is str type.
-      if file_format != 'tfrecords_gzip':
+      if file_format in _SUPPORTED_FILE_FORMATS:


martinbomio · 2022-05-20T17:30:35Z

Hey @1025KB woyld you mind taking a look. The PR has been open for a very long time and I would love for us to wrap up this work

1025KB · 2022-05-20T18:10:35Z

tfx/proto/example_gen.proto

+  // Serialized parquet messages.
+  FORMAT_PARQUET = 12;
+
+  reserved 1 to 5, 8 to 10, 13 to max;


Let's use 15

reserved 1 to 5, 8 to 10, 12 to 14, 16 to max;

made the change, but wondering why we don't continuous numbering?

1025KB · 2022-05-20T18:11:51Z

tfx/proto/example_gen.proto

+  // https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility
+  FORMAT_PARQUET = 6;
+
+  reserved 1 to 4, 7 to max;


let's use 16
reserved 1 to 4, 7 to 15, 17 to max;

btw, is parquet's file format and payload format both named "Parquet"? I wonder if there are better names to distinguish them?

rcrowe-google

I think that the changes suggested by @1025KB probably make sense, but otherwise LGTM, and thanks!

martinbomio · 2022-05-24T13:49:21Z

@iindyk @rcrowe-google @1025KB addressed the latest suggestions. Should be ready to merge. I need one of you to do the merging :)

iindyk · 2022-05-24T15:38:07Z

I don't have access to merge here

martinbomio · 2022-05-25T17:40:26Z

@1025KB build seems to be failing with what I think is an unrelated error. Would you mind taking a look?

1025KB · 2022-05-25T17:48:54Z

Hi, Venkat, could you help on merging this PR?

jiyongjung0 · 2022-05-27T02:02:25Z

@martinbomio The build failure seems like related to 517441d. Could you rebase the PR and retry?

martinbomio · 2022-05-27T02:14:22Z

@jiyongjung0 done!

jiyongjung0 · 2022-05-27T02:56:47Z

@martinbomio It seems like the current failure is a real problem. Could you take a look?

martinbomio · 2022-05-27T14:37:27Z

tfx/proto/example_gen.proto

-  reserved 1 to 4, 6 to max;
+  // Indicates parquet format files in any of the supported compressions.
+  // https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility
+  FORMAT_PARQUET = 16;


@iindyk @1025KB the proto generation is failing because the two enums are named the same FORMAT_PARQUET and enums are global. I renamed it to FILE_FORMAT_PARQUET

martinbomio · 2022-05-27T14:39:57Z

@jiyongjung0 fixed the issue, can we trigger another build?

martinbomio · 2022-05-27T21:31:48Z

tfx/components/util/tfxio_utils.py

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """TFXIO (standardized TFX inputs) related utilities."""
+from __future__ import annotations


had to add this to allow for proto enums to be used as types in signatures. Not sure if this is something desirable from the tfx side or not @jiyongjung0 @1025KB

I can go back to string in the make_tfxio function signature instead of the proto enum type.

martinbomio · 2022-05-27T21:32:30Z

tfx/components/util/tfxio_utils_test.py

@@ -306,7 +307,7 @@ def test_get_tf_dataset_factory_from_artifact(self):
    dataset_factory = tfxio_utils.get_tf_dataset_factory_from_artifact(
        [examples], _TELEMETRY_DESCRIPTORS)
    self.assertIsInstance(dataset_factory, Callable)
-    self.assertEqual(tf.data.Dataset,
+    self.assertEqual('tf.data.Dataset',


this stringify of the signature is a result of adding: from __future__ import annotations

martinbomio · 2022-05-31T17:53:47Z

@1025KB @jiyongjung0 seems like the build is failing with [tfx/proto/example_gen.proto:16:1] Package name "tfx.components.example_gen" must only contains lowercase letters, digits and/or periods. is this something you can take a look on your side?

1025KB · 2022-05-31T17:59:30Z

I created an internal pending change and I will make sure it submit (your PR will automatically close once I submit it)

martinbomio · 2022-06-07T15:43:08Z

Hi @rcrowe-google @iindyk @1025KB any update on this PR

1025KB · 2022-06-07T18:56:14Z

the reviewer is just back from vacation, change should be in this week

PiperOrigin-RevId: 452081871

martinbomio force-pushed the parquet-tfxio-support branch 2 times, most recently from c222433 to 470fdb3 Compare March 24, 2022 00:06

iindyk reviewed Mar 30, 2022

View reviewed changes

1025KB reviewed Mar 30, 2022

View reviewed changes

1025KB requested a review from rcrowe-google March 30, 2022 18:58

github-actions bot added the stale label May 1, 2022

github-actions bot removed the stale label May 2, 2022

iindyk reviewed May 17, 2022

View reviewed changes

1025KB reviewed May 20, 2022

View reviewed changes

1025KB approved these changes May 20, 2022

View reviewed changes

rcrowe-google approved these changes May 20, 2022

View reviewed changes

1025KB requested a review from venkat2469 May 25, 2022 17:48

1025KB removed the request for review from venkat2469 May 26, 2022 20:31

1025KB approved these changes May 26, 2022

View reviewed changes

iindyk approved these changes May 26, 2022

View reviewed changes

1025KB requested a review from jiyongjung0 May 26, 2022 21:37

jiyongjung0 approved these changes May 27, 2022

View reviewed changes

martinbomio added 4 commits May 26, 2022 22:07

Extend tfxio factory to use parquet-tfxio

c9b34af

Add parquet file format

5ac2ff2

Address PR comments

3913c44

More PR suggestions

30bec96

martinbomio force-pushed the parquet-tfxio-support branch from 814c46b to 30bec96 Compare May 27, 2022 02:07

martinbomio commented May 27, 2022

View reviewed changes

Rename parquet file format enum

3bdccaf

Add future annotations to support proto enum typing

35bfd1f

martinbomio commented May 27, 2022

View reviewed changes

Disable max line limit in protolint for comment

035e3cf

1025KB approved these changes May 31, 2022

View reviewed changes

gbaned self-assigned this Jun 3, 2022

copybara-service bot pushed a commit that referenced this pull request Jun 8, 2022

Merge pull request #4761 from martinbomio:parquet-tfxio-support

bce3fc8

PiperOrigin-RevId: 452081871

copybara-service bot mentioned this pull request Jun 8, 2022

PR #4761: Extend tfxio factory to use parquet-tfxio #4927

Closed

copybara-service bot merged commit d6ab4ff into tensorflow:master Jun 8, 2022

Extend tfxio factory to use parquet-tfxio #4761

Extend tfxio factory to use parquet-tfxio #4761

Conversation

martinbomio commented Mar 24, 2022

google-cla bot commented Mar 24, 2022

martinbomio commented Mar 24, 2022

iindyk commented Mar 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

iindyk Apr 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 1, 2022

martinbomio commented May 1, 2022

iindyk commented May 2, 2022

martinbomio commented May 6, 2022

iindyk commented May 9, 2022

martinbomio commented May 16, 2022

iindyk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio commented May 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcrowe-google left a comment

Choose a reason for hiding this comment

martinbomio commented May 24, 2022

iindyk commented May 24, 2022

martinbomio commented May 25, 2022

1025KB commented May 25, 2022

jiyongjung0 commented May 27, 2022

martinbomio commented May 27, 2022

jiyongjung0 commented May 27, 2022

Choose a reason for hiding this comment

martinbomio commented May 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinbomio commented May 31, 2022

1025KB commented May 31, 2022

martinbomio commented Jun 7, 2022

1025KB commented Jun 7, 2022

martinbomio Mar 31, 2022 •

edited

Loading

iindyk Apr 1, 2022 •

edited

Loading