Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GeneratedClassRowTypeConstraint #22679

Merged

Conversation

TheNeuralBit
Copy link
Member

A RowTypeConstraint that wraps a generated NamedTuple type cannot be pickled, because the generated type cannot be pickled. This PR adds a specialization, GeneratedClassRowTypeConstraint, with a custom __reduce__ that avoids pickling the user type.

Extracted from #22575

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Aug 10, 2022

Codecov Report

Merging #22679 (59686fe) into master (c9c57a7) will decrease coverage by 0.00%.
The diff coverage is 97.29%.

@@            Coverage Diff             @@
##           master   #22679      +/-   ##
==========================================
- Coverage   74.19%   74.19%   -0.01%     
==========================================
  Files         708      709       +1     
  Lines       93465    93498      +33     
==========================================
+ Hits        69347    69367      +20     
- Misses      22843    22856      +13     
  Partials     1275     1275              
Flag Coverage Δ
python 83.58% <97.29%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ks/python/apache_beam/typehints/schema_registry.py 95.23% <95.23%> (ø)
sdks/python/apache_beam/typehints/schemas.py 94.17% <97.14%> (-0.15%) ⬇️
sdks/python/apache_beam/coders/row_coder.py 94.54% <100.00%> (+0.05%) ⬆️
sdks/python/apache_beam/transforms/core.py 92.94% <100.00%> (ø)
sdks/python/apache_beam/typehints/row_type.py 97.01% <100.00%> (+0.46%) ⬆️
sdks/python/apache_beam/utils/interactive_utils.py 95.12% <0.00%> (-2.44%) ⬇️
sdks/python/apache_beam/utils/subprocess_server.py 56.54% <0.00%> (-2.20%) ⬇️
...python/apache_beam/runners/worker/worker_status.py 77.53% <0.00%> (-2.18%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 88.62% <0.00%> (-0.32%) ⬇️
...eam/runners/interactive/interactive_environment.py 91.71% <0.00%> (-0.31%) ⬇️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@TheNeuralBit
Copy link
Member Author

R: @yeandy

Since valentyn is out

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Copy link
Contributor

@yeandy yeandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Probably need to run formatter. Also seeing some Dataframe schema test errors, but didn't look too much into that.

schema_options: Optional[Sequence[Tuple[str, Any]]] = None,
field_options: Optional[Dict[str, Sequence[Tuple[str, Any]]]] = None,
schema_registry: SchemaTypeRegistry = None,
) -> RowTypeConstraint:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> RowTypeConstraint:
) -> GeneratedClassRowTypeConstraint:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer to use the base-class here. GeneratedClassRowTypeConstraint can be an implementation detail.

Comment on lines +593 to +594
row_proto = schema_pb2.FieldType(
row_type=schema_pb2.RowType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does schema_pb2.RowType have to be wrapped by schema_pb2.FieldType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked again, looks like top level of protos are schema_pb2.FieldType I believe

self._fields,
self._schema_id,
self._schema_options,
self._field_options))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Do you it would be clear to explicitly add None for the 5th item (schema_registry)? I was initially looking for the 5 args for from_fields, but only saw 4 😄 . I'm perfectly ok with the current implementation though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks

sdks/python/apache_beam/typehints/row_type.py Outdated Show resolved Hide resolved
class GeneratedClassRowTypeConstraint(RowTypeConstraint):
"""Specialization of RowTypeConstraint which relies on a generated user_type.

Since the generated user_type cannot be pickled, we supply a custom __reduce__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the generated user_type cannot be pickled

Can you please explain why this is the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't fully understand it, but it's been a consistent issue with the Schema code. Each pickle library (built-in, dill, cloudpickle) fails for a different reason. I filed #22714 to track this, and added a (skipped) test.

sdks/python/apache_beam/typehints/row_type.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/typehints/row_type.py Outdated Show resolved Hide resolved
@chamikaramj
Copy link
Contributor

Probably this resulted in x-lang test suite failures here: #22748

Abacn added a commit to Abacn/beam that referenced this pull request Aug 26, 2022
MarcoRob pushed a commit to MarcoRob/beam that referenced this pull request Sep 5, 2022
* Add (failing) pickling tests

* Add GeneratedClassRowTypeConstraint, plumb options

* Add top-level option conversion functions

* Refactor NamedTuple generation, always create GeneratedClassRowTypeConstraint

* Move registry to apache_beam.typehints.schema_registry

* yapf,lint

* fixup! Move registry to apache_beam.typehints.schema_registry

* Apply suggestions from code review

Co-authored-by: Andy Ye <[email protected]>

* Add None SchemaRegistry

* Add skipped test for pickling generated type

Co-authored-by: Andy Ye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants