-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12385] Handle VARCHAR and other SQL specific logical types in AvroUtils #14858
Conversation
CC: @iemejia |
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java
Show resolved
Hide resolved
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #14858 +/- ##
==========================================
- Coverage 83.80% 83.78% -0.02%
==========================================
Files 434 434
Lines 58266 58260 -6
==========================================
- Hits 48827 48816 -11
- Misses 9439 9444 +5
Continue to review full report at Codecov.
|
@iemejia Added tests for date-time and string (varchar) related Logical types. |
gentle bump up. |
sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this -- the Avro conversions look good!
…ogicalType preserving the string size characteristics. Add comment in CHANGES.md
JDBCType jdbcType = JDBCType.valueOf(logicalType.getIdentifier()); | ||
Integer size = logicalType.getArgument(); | ||
|
||
String schemaJson = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, optionally you can create Avro schemas in a more readable way:
return org.apache.avro.SchemaBuilder.builder()
.stringBuilder().prop("logicalType", jdbcType.name()).prop("maxLength", size).endString();
This is for info -- they should (must!) be equivalent to using the Parser. This snippet is small but builders are a good practice for larger schemas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RyanSkraba thanks for the suggestion, one specific reason I didn't use the SchemaBuilder
is due to the need to use a an Integer (non-String) value for the property. The Schema produced by using SchemaBuilder
looks like:
{"type":"string","logicalType":"LONGVARCHAR","maxLength": "50"}
vs. the expected:
{"type":"string","logicalType":"LONGVARCHAR","maxLength": 50}
I think its the same reason, Hive's TypeInfoToSchema#L116 also uses the JSON based parsing approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I'm surprised -- thanks for pointing this out. It might be an Avro bug! I'm pretty sure that if size is an int
it's a JSON number.
Regardless, thanks for the update, let's use the Hive approach then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to be this annoying @anantdamle but since we are trying to 'align' with Hive/Spark maybe it is good that we name the logicalType names to coincide with the ones in the class you mention which are in lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @iemejia, concern in using is that Hive only provides varchar
, how to then deal with others like longvarchar etc.
If I use lowercase then converting back to JDBCType would become hard.
Do you suggest converting all the string based logical types to just varchar with appropriate maxLength?
The approach in JdbcIO schema is to represent them with Uppercase logical type of JDBC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the logicalTypes that Hive/Spark define are spelled with lowercase so let's do like them.
It is a bit odd compared with the Java SQL Types (on uppercase), at least SQL should be casing agnostic so it should not matter in that front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iemejia made the changes, with following mapping as per Hive, as there are only two categories:
Variable Length Strings:
[VARCHAR
, LONGVARCHAR
, NVARCHAR
, LONGNVARCHAR
] -> varchar
Fixed Length Strings:
[CHAR
, NCHAR
] -> char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks @anantdamle I will merge manually to squash the commits and add one comment about aligning logical types with Hive.
…specific logical types in AvroUtils
Merged manually, it will be part of Beam 2.32.0 (sorry if it did not make it into 2.31.0 but the branch was cut just before the PR was merged). |
Handle VARCHAR and other JDBC specific logical types in AvroUtils when converting a Row to GenericRecord.
@jbonofre @timrobertson100 can you help review? How do you suggest adding tests to
JDBCIOTest
? I can add tests toJdbcIOTest
but would require to introduce unnecessary dependencies on MariaDB or MySQL drivers.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @jbonofre @timrobertson100
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.