Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12385] Handle VARCHAR and other SQL specific logical types in AvroUtils #14858

Closed
wants to merge 7 commits into from
Closed

Conversation

anantdamle
Copy link
Contributor

Handle VARCHAR and other JDBC specific logical types in AvroUtils when converting a Row to GenericRecord.

@jbonofre @timrobertson100 can you help review? How do you suggest adding tests to JDBCIOTest? I can add tests to JdbcIOTest but would require to introduce unnecessary dependencies on MariaDB or MySQL drivers.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @jbonofre @timrobertson100).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@aromanenko-dev
Copy link
Contributor

CC: @iemejia

@iemejia iemejia self-requested a review May 24, 2021 12:07
@codecov
Copy link

codecov bot commented May 26, 2021

Codecov Report

Merging #14858 (79c1932) into master (40326dd) will decrease coverage by 0.01%.
The diff coverage is n/a.

❗ Current head 79c1932 differs from pull request most recent head de939da. Consider uploading reports for the commit de939da to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14858      +/-   ##
==========================================
- Coverage   83.80%   83.78%   -0.02%     
==========================================
  Files         434      434              
  Lines       58266    58260       -6     
==========================================
- Hits        48827    48816      -11     
- Misses       9439     9444       +5     
Impacted Files Coverage Δ
...tes/tox/py38/build/srcs/sdks/python/test_config.py 66.66% <0.00%> (-4.77%) ⬇️
.../python/apache_beam/examples/dataframe/taxiride.py 61.11% <0.00%> (-4.68%) ⬇️
...thon/apache_beam/runners/worker/sdk_worker_main.py 73.04% <0.00%> (-4.65%) ⬇️
...y38/build/srcs/sdks/python/apache_beam/__init__.py 84.21% <0.00%> (-1.51%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 91.21% <0.00%> (-1.22%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 89.00% <0.00%> (-0.46%) ⬇️
.../srcs/sdks/python/apache_beam/coders/typecoders.py 94.44% <0.00%> (-0.16%) ⬇️
.../py38/build/srcs/sdks/python/apache_beam/pvalue.py 93.02% <0.00%> (-0.11%) ⬇️
...s/python/apache_beam/examples/snippets/snippets.py 76.70% <0.00%> (-0.11%) ⬇️
...srcs/sdks/python/apache_beam/coders/slow_stream.py 94.49% <0.00%> (-0.10%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40326dd...de939da. Read the comment docs.

@anantdamle anantdamle changed the title [BEAM-12385] Handle VARCHAR and other JDBC specific logical types in AvroUtils. [BEAM-12385] Handle VARCHAR and Date-time JDBC specific logical types in AvroUtils. May 29, 2021
@anantdamle
Copy link
Contributor Author

@iemejia Added tests for date-time and string (varchar) related Logical types.

@anantdamle
Copy link
Contributor Author

gentle bump up.

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this -- the Avro conversions look good!

JDBCType jdbcType = JDBCType.valueOf(logicalType.getIdentifier());
Integer size = logicalType.getArgument();

String schemaJson =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, optionally you can create Avro schemas in a more readable way:

return org.apache.avro.SchemaBuilder.builder()
    .stringBuilder().prop("logicalType", jdbcType.name()).prop("maxLength", size).endString();

This is for info -- they should (must!) be equivalent to using the Parser. This snippet is small but builders are a good practice for larger schemas!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanSkraba thanks for the suggestion, one specific reason I didn't use the SchemaBuilder is due to the need to use a an Integer (non-String) value for the property. The Schema produced by using SchemaBuilder looks like:
{"type":"string","logicalType":"LONGVARCHAR","maxLength": "50"}
vs. the expected:
{"type":"string","logicalType":"LONGVARCHAR","maxLength": 50}

I think its the same reason, Hive's TypeInfoToSchema#L116 also uses the JSON based parsing approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I'm surprised -- thanks for pointing this out. It might be an Avro bug! I'm pretty sure that if size is an int it's a JSON number.
Regardless, thanks for the update, let's use the Hive approach then!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to be this annoying @anantdamle but since we are trying to 'align' with Hive/Spark maybe it is good that we name the logicalType names to coincide with the ones in the class you mention which are in lowercase.

https://github.com/apache/hive/blob/135629b8d6b538fed092641537034a9fbc59c7a0/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroSerDe.java#L59

Copy link
Contributor Author

@anantdamle anantdamle Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @iemejia, concern in using is that Hive only provides varchar, how to then deal with others like longvarchar etc.
If I use lowercase then converting back to JDBCType would become hard.
Do you suggest converting all the string based logical types to just varchar with appropriate maxLength?

The approach in JdbcIO schema is to represent them with Uppercase logical type of JDBC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the logicalTypes that Hive/Spark define are spelled with lowercase so let's do like them.

It is a bit odd compared with the Java SQL Types (on uppercase), at least SQL should be casing agnostic so it should not matter in that front.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iemejia made the changes, with following mapping as per Hive, as there are only two categories:

Variable Length Strings:
[VARCHAR, LONGVARCHAR, NVARCHAR, LONGNVARCHAR] -> varchar

Fixed Length Strings:
[CHAR, NCHAR] -> char

Copy link
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks @anantdamle I will merge manually to squash the commits and add one comment about aligning logical types with Hive.

@iemejia iemejia changed the title [BEAM-12385] Handle VARCHAR and Date-time JDBC specific logical types in AvroUtils. [BEAM-12385] Handle VARCHAR and other SQL specific logical types in AvroUtils Jun 7, 2021
iemejia added a commit that referenced this pull request Jun 7, 2021
@iemejia
Copy link
Member

iemejia commented Jun 7, 2021

Merged manually, it will be part of Beam 2.32.0 (sorry if it did not make it into 2.31.0 but the branch was cut just before the PR was merged).

@iemejia iemejia closed this Jun 7, 2021
@anantdamle anantdamle deleted the beam_12385 branch June 8, 2021 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants