-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native Trino Ion Format Integration #24511
base: master
Are you sure you want to change the base?
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This is the initial implementation of a Native Trino Ion Hive format. It does not handle the full range of Trino or Ion values, nor the set of coercions that we will likely need to implement. But I think it's a pretty good start. I implemented enough of the scalar types to make the tests pass but there are definitely some edges that need cleaning. I also added the ion-hive-serde as a test dependency so that we can add Ion to TestHiveFileFormats. I chose to avoid building off of the "Line" abstractions. While that could be done, and more closely mimics how the Ion Hive SerDe works, doing so is suboptimal, and for reading is actually more complicated. This is because while Ion is unschema'ed, each value stream has an evolving "encoding context." In Ion 1.0 that is the "Symbol Table" which is a form of dictionary encoding for the field names and other common text values. The other high-level choice was _not_ to use the ion-java-path-extraction library that is used in the Hive SerDe. The main value in that was to allow users to do path navigation in the SerDe config. But using it to go into the Block abstractions without a middle layer (as it uses in Hive SerDe) is complex, and I'm concerned that some data quality issues are effectively silenced. Given Trino's current capability set with nested and complex structures I doubt the value proposition. In fact, using SerDe config for what can be expressed in SQL is actually an anti-pattern.
Converts the field name into lower case for both Decoder and Encoder to ensure case-insensitivity for field names/keys
* Adds `ion.nativetrino` hvie config feature flag * Adds changes for Ion page source to use the feature flag * Adds test changes for feature flag
This change factors all of the setup involved in creating PageSources and FileWriters into a single TestFixture inner class. This reduces some existing duplication and should make it simpler to add/change table properties and hive config in the test.
There are a few changes related to how mistyped or oversize Ion Values are Handled: * Enable Strict or Lax Path Typing This change adds a SerDe Property to support non-strict path typing. It defaults to false to mimic the behavior of the ion-hive-serde which used path extraction for the top-level-values, whether the user had defined any extractions or not. Without support for pathing this is effectively a type-check (or not) for TLVs. With support for extraction the behavior is a little more subtle than that, so I named the property for how it will behave. The name is also consistent with other properties. I also added some tests for nested mistypings and changed the code to consistently throw TrinoExceptions. * IonInts are coerced to Decimals (as Ion Hive does) * Decimals with more whole digits than fit is now an error (oddly this is not an error in Ion Hive) Fairly exhaustive test cases were added for each of the changes. * Ensure Timestamps are Rounded Consistently This change makes it so that Ion Timestamps are rounded using Timestamps.round(). This makes it so that Timestamp down casting works consistently with the other formats. I considered using DecodedTimestamp and the TrinoTimestampEncoders but those didn't cover picos. So this code uses the same pattern they do and the Timestamps.round(). This code simply ignores anything after picos. * Truncate Char and Varchar Columns With this change we truncate text that is longer than the length of the Char or Varchar column. Before this change that caused an error. Note that this is an error with the Hive Serde, but per the public Athena docs and the behavior of the other Hive formats, truncation is preferred.
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
This commit fixes up a few things I noticed when previewing the PR to Trino. * Column/Field name casing should be preserved when writing * Some missing operational/metrics calls in IonPageSource * Throw clearer Exception for errors in IonFileWriter * Move some tests from TestHiveFileFormats to IonPageSourceSmokeTest * Add test for Timestamp Encoding
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Description
This PR is the implementation of a
PageSourceFactory
andFileWriterFactory
for the Ion data format. It replaces the legacy Ion Hive SerDe.The behavior is mostly backwards compatible with the Hive SerDe, with the exceptions listed below. There are a host of SerDe Properties supported by the Ion Hive SerDe. For this PR we are only supporting the ability to specify Text or Binary encoding. Otherwise only the defaults are accepted.
There is a new SerDe property to provide stricter semantics for mistyped values that defaults to
false
for backwards compatibility. The behavior is subtle so it is described below.It is disabled by default, behind a Hive Config flag:
hive.ion.nativetrino
our intention is to make that enabled by default at some point.Exceptions to Backwards Compatibility
SerDe Properties
What's Next
todo: more and flesh out below...
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text: