Remove CoGBK in MLTransform's TFTProcessHandler #30146

AnandInguva · 2024-01-29T21:07:08Z

This PR attempts to remove CoGBK from MLTransform's TFTProcessHandler.

Instead of using CoGBK, encode the dict of columns not specified by the user into bytes and pass them in the original dict with a temporary key specified in the schema for TFT. In the TFT AnalyzeDataset or TransformDataset, this temp column would be a no-op. Once TFT is done processing data, decode the bytes into their actual format.

This only affects the data processing operations specified at apache_beam/ml/transforms/tft.py. The operations support float and string data. We have tests for them already.

For encoding and decoding purposes, I used apache_beam/pickler for now.

Fixes: #29593

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-01-29T22:05:38Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

tvalentyn · 2024-01-31T19:44:32Z

sdks/python/apache_beam/ml/transforms/handlers.py

+  """
+  def process(self, element):
+    element.update(pickler.loads(element[_TEMP_KEY].item()))
+    del element[_TEMP_KEY]


let's not mutate elements. emit a copy.

rationale: https://beam.apache.org/documentation/programming-guide/#immutability

sdks/python/apache_beam/ml/transforms/handlers.py

tvalentyn · 2024-01-31T20:26:42Z

sdks/python/apache_beam/ml/transforms/handlers.py

+      if key in data_to_encode:
+        del data_to_encode[key]
+
+    bytes = pickler.dumps(data_to_encode)


It might be more efficient to use type-aware cythonized coder. Performance difference can be tested in a microbenchmark.

We might be able to convert elements to Beam Row, and use RowCoder. Seeing something similar in:

beam/sdks/python/apache_beam/transforms/external.py

Lines 127 to 132 in a221f98

schema_proto = named_fields_to_schema(named_fields)

row = named_tuple_from_schema(schema_proto)(**fields_to_values)

schema = named_tuple_to_schema(type(row))

payload = RowCoder(schema).encode(row)

return (schema_proto, payload)

Can we expect the schema of the elements in the dict the same ? do we have type information from the elements or would need to infer them?

cc: @robertwb who might have a better idea

For the unused elements, we won't know what the schema of the elements.

if we have to use pickling to serialize elements, I would use native pickler rather than Beam's, perhaps by way of PickleCoder.

For the unused elements, we won't know what the schema of the elements.

According to MLTransform docs, elements end up being Rows:

To define a data processing transformation by using MLTransform, create instances of data processing transforms with columns as input parameters. The data in the specified columns is transformed and outputted to the beam.Row object.

do we infer the types later then?

Only the data for the. columns that are provided for transformation. For these columns using TFT, we infer the schema

AnandInguva · 2024-01-31T21:27:56Z

I will make few changes and will request a review soon. Thanks

sdks/python/apache_beam/ml/transforms/handlers.py

tvalentyn

Have we run existing tests on this change?

tvalentyn · 2024-02-03T00:09:38Z

sdks/python/apache_beam/ml/transforms/handlers.py

@@ -447,22 +409,22 @@ def expand(
      raw_data_metadata = metadata_io.read_metadata(
          os.path.join(self.artifact_location, RAW_DATA_METADATA_DIR))

-    keyed_raw_data = (raw_data | beam.ParDo(_ComputeAndAttachUniqueID()))
+    keyed_raw_data = (raw_data)  #  | beam.ParDo(_ComputeAndAttachUniqueID()))


leftover comment, also we no longer add keys , so keyed_ might not be the best name.

Removed the keyed_ from variable names

sdks/python/apache_beam/ml/transforms/handlers.py

tvalentyn · 2024-02-03T00:28:05Z

sdks/python/apache_beam/ml/transforms/handlers.py


    # To maintain consistency by outputting numpy array all the time,
    # whether a scalar value or list or np array is passed as input,
    #  we will convert scalar values to list values and TFT will ouput
    # numpy array all the time.
+    raw_data_list = (


for my understanding, why is this called raw_data_list? it's modified, so not raw i think, and what's here about _list?

yes, it is modified. I removed raw from the variable name.

_list: we convert the scalar element to list (len:1) to maintain uniformity. Users can pass list/np arrays to TFT ops and TFT outputs numpy arrays. Users when pass scalars, TFT outputs scalars. to maintain consistent output format, we convert scalar to list.

sdks/python/apache_beam/ml/transforms/handlers.py

tvalentyn · 2024-02-03T00:35:47Z

sdks/python/apache_beam/ml/transforms/handlers.py

+    self.exclude_columns = exclude_columns
+
+  def encode(self, element):
+    if not self.exclude_columns:


interesting. Is it possible for exclude_columns be emtpy? I'd imagine it could rather be the opposite, where all columns are being processed, so there is nothing to encode/decode.

yes, that is right but it errors because we are adding the temp id column name to the schema during construction so TFT errors out if the pcoll doesn't have the temp id column. So when the unused columns are none, we have to encode the empty dict and pass it to the PColl.

tvalentyn

Have we run existing tests on this change?

AnandInguva · 2024-02-05T16:51:02Z

Have we run existing tests on this change?

yes, they should be covered in the py38coverage.

sdks/python/apache_beam/ml/transforms/handlers.py

tvalentyn · 2024-02-08T23:27:44Z

sdks/python/apache_beam/ml/transforms/handlers.py

+
+  def decode(self, element):
+    clone = copy.copy(element)
+    clone.update(self.coder.decode(clone[_TEMP_KEY].item()))


what is the function of .item() here? what is the type of clone[_TEMP_KEY]? are the elements in given that we call .item() here - will elements in clone have consistent type after decoding?

Type of clone[_TEMP_KEY] is a numpy array and .item() returns underlying element of the numpy array.

will elements in clone have consistent type after decoding.

It should be. depending on the Coder.

sdks/python/apache_beam/ml/transforms/handlers.py

Co-authored-by: tvalentyn <[email protected]>

github-actions bot added the python label Jan 29, 2024

Add _Encode and _DecodeDict

abc1522

AnandInguva force-pushed the remove_cogbk branch from ff0ae49 to 8e4dda7 Compare January 29, 2024 21:12

Replace the CoGBK and utils with Encode and Decode utils

ee770c4

AnandInguva force-pushed the remove_cogbk branch from 8e4dda7 to ee770c4 Compare January 29, 2024 21:14

AnandInguva marked this pull request as ready for review January 29, 2024 21:21

AnandInguva requested a review from tvalentyn January 29, 2024 21:21

github-actions bot added the Next Action: Reviewers label Jan 29, 2024

tvalentyn reviewed Jan 31, 2024

View reviewed changes

AnandInguva added 2 commits February 1, 2024 14:44

Use PicklerCoder for encoding and decoding elements

48614b3

Remove comments

98404ca

tvalentyn reviewed Feb 1, 2024

View reviewed changes

sdks/python/apache_beam/ml/transforms/handlers.py Outdated Show resolved Hide resolved

update coder

094a888

AnandInguva requested a review from tvalentyn February 2, 2024 23:35

tvalentyn reviewed Feb 3, 2024

View reviewed changes

This comment was marked as duplicate.

Sign in to view

AnandInguva added 2 commits February 5, 2024 11:36

Address comments

066c4ce

Remove return comment

cddc98a

AnandInguva requested review from tvalentyn February 5, 2024 19:31

tvalentyn reviewed Feb 8, 2024

View reviewed changes

Update sdks/python/apache_beam/ml/transforms/handlers.py

b8dd486

Co-authored-by: tvalentyn <[email protected]>

AnandInguva self-assigned this Feb 9, 2024

Make _DataCoder internal

e0fc00b

tvalentyn approved these changes Feb 13, 2024

View reviewed changes

AnandInguva merged commit c004cc7 into master Feb 13, 2024
71 checks passed

AnandInguva deleted the remove_cogbk branch February 13, 2024 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove CoGBK in MLTransform's TFTProcessHandler #30146

Remove CoGBK in MLTransform's TFTProcessHandler #30146

AnandInguva commented Jan 29, 2024 •

edited

Loading

github-actions bot commented Jan 29, 2024

tvalentyn Jan 31, 2024

tvalentyn Jan 31, 2024

tvalentyn Jan 31, 2024

AnandInguva Jan 31, 2024

tvalentyn Jan 31, 2024

tvalentyn Jan 31, 2024 •

edited

Loading

AnandInguva Feb 1, 2024

AnandInguva commented Jan 31, 2024

tvalentyn left a comment

tvalentyn Feb 3, 2024

AnandInguva Feb 5, 2024

tvalentyn Feb 3, 2024

AnandInguva Feb 5, 2024

tvalentyn Feb 3, 2024

AnandInguva Feb 5, 2024

tvalentyn left a comment

This comment was marked as duplicate.

AnandInguva commented Feb 5, 2024

tvalentyn Feb 8, 2024

AnandInguva Feb 12, 2024 •

edited

Loading

	schema_proto = named_fields_to_schema(named_fields)
	row = named_tuple_from_schema(schema_proto)(**fields_to_values)
	schema = named_tuple_to_schema(type(row))

	payload = RowCoder(schema).encode(row)
	return (schema_proto, payload)

Remove CoGBK in MLTransform's TFTProcessHandler #30146

Remove CoGBK in MLTransform's TFTProcessHandler #30146

Conversation

AnandInguva commented Jan 29, 2024 • edited Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnandInguva commented Jan 31, 2024

tvalentyn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvalentyn left a comment

Choose a reason for hiding this comment

This comment was marked as duplicate.

AnandInguva commented Feb 5, 2024

Choose a reason for hiding this comment

AnandInguva Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

AnandInguva commented Jan 29, 2024 •

edited

Loading

tvalentyn Jan 31, 2024 •

edited

Loading

AnandInguva Feb 12, 2024 •

edited

Loading