Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and cleanup yaml MapToFields. #28462

Merged
merged 7 commits into from
Sep 20, 2023
Merged

Conversation

robertwb
Copy link
Contributor

  • Avoid the use of MetaProviders, which was always kind of hacky. We may want to remove this infrastructure altogether as it does not play nicely with provider inference.

  • Split MapToFields into separate mapping, filtering, and exploding operations.

  • Allow MapToFields to act on non-schema'd PCollections.

The various langauge flavors of these UDFs are now handled by a preprocessing step. This will make it easier to extend to other langauges, including in particular possible multiple (equivalent) implementations of javascript to minimize cross-langauge boundary crossings.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

* Avoid the use of MetaProviders, which was always kind of hacky.
  We may want to remove this infrastructure altogether as it
  does not play nicely with provider inference.

* Split MapToFields into separate mapping, filtering, and exploding
  operations.

* Allow MapToFields to act on non-schema'd PCollections.

The various langauge flavors of these UDFs are now handled by a preprocessing
step.  This will make it easier to extend to other langauges, including
in particular possible multiple (equivalent) implementations of javascript to
minimize cross-langauge boundary crossings.
@robertwb
Copy link
Contributor Author

R: @Polber

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@codecov
Copy link

codecov bot commented Sep 18, 2023

Codecov Report

Merging #28462 (461db4a) into master (220cae7) will increase coverage by 0.01%.
Report is 108 commits behind head on master.
The diff coverage is 80.46%.

@@            Coverage Diff             @@
##           master   #28462      +/-   ##
==========================================
+ Coverage   72.32%   72.33%   +0.01%     
==========================================
  Files         683      683              
  Lines      100709   100915     +206     
==========================================
+ Hits        72834    73001     +167     
- Misses      26297    26336      +39     
  Partials     1578     1578              
Flag Coverage Δ
python 82.81% <80.46%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
sdks/python/apache_beam/yaml/yaml_transform.py 90.03% <62.50%> (-0.33%) ⬇️
sdks/python/apache_beam/yaml/yaml_mapping.py 83.41% <80.35%> (-2.76%) ⬇️
sdks/python/apache_beam/transforms/core.py 91.53% <100.00%> (-0.04%) ⬇️
sdks/python/apache_beam/yaml/yaml_provider.py 70.34% <100.00%> (+0.25%) ⬆️

... and 18 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sdks/python/apache_beam/yaml/yaml_mapping.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/yaml/yaml_mapping.py Outdated Show resolved Hide resolved
missing = set(fields.values()) - set(input_schema.keys())
if missing:
raise ValueError(
f"Missing language specification or unkown input fields: {missing}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Missing language specification or unkown input fields: {missing}")
f"Unknown input fields: {missing}")

We've already dealt with the language specifications, right? (plus a typo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case above rejects callable and name/path specifications of the UDF. At this point we know that they are inline string expressions, but they could still be arbitrary expressions like "a+b" or even constants (though I suppose we could consider supporting some constants as well.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unknown" is misspelled though for what it's worth...

Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just had comments around error messages, otherwise this looks good to me

missing = set(fields.values()) - set(input_schema.keys())
if missing:
raise ValueError(
f"Missing language specification or unkown input fields: {missing}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unknown" is misspelled though for what it's worth...

print('yaml_provider.beam_jar', yaml_provider.beam_jar)
sql_provider = yaml_provider.beam_jar(
urns={'Sql': 'beam:external:java:sql:v1'},
gradle_target='sdks:java:extensions:sql:expansion-service:shadowJar')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have the version as well similar to what is defined in the standard_providers.yaml? Also, can't the standard_providers.yaml file be removed since there is now an explicit provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is wrapped by the other provider. We could try to unify them, but there are circular reference issues trying to pull this out from the set of standard providers, which is why it's just re-defined here.

Comment on lines +257 to +266
@maybe_with_exception_handling_transform_fn
def _PyJsFilter(
pcoll, keep: Union[str, Dict[str, str]], language: Optional[str] = None):

if error_handling is None:
error_handling_args = None
input_schema = dict(named_fields_from_element_type(pcoll.element_type))
if isinstance(keep, str) and keep in input_schema:
keep_fn = lambda row: getattr(row, keep)
else:
error_handling_args = {
'dead_letter_tag' if k == 'output' else k: v
for (k, v) in error_handling.items()
}
keep_fn = _as_callable(list(input_schema.keys()), keep, "keep", language)
return pcoll | beam.Filter(keep_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these functions not map the input back to Row? They used to run through beam.Select when MapToFields was 1 function, but with it separated out, why not carry the Select transform over too or at least utilize the Map transform you included in the test file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter returns the inputs as outputs. From the users point of view they don't care if they're Row objects or NamedTuple objects (though the equality semantics are a bit strict).

@robertwb robertwb merged commit c5e6c79 into apache:master Sep 20, 2023
73 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants