Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Trino Ion Format Integration #24511

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rmarrowstone
Copy link

@rmarrowstone rmarrowstone commented Dec 18, 2024

Description

This PR is the implementation of a PageSourceFactory and FileWriterFactory for the Ion data format. It replaces the legacy Ion Hive SerDe. It enables trino users to read and write Ion datasets. Ion is a richly-typed superset of JSON with a well-defined binary encoding.

Today the Ion Hive SerDe is used by Athena to provide access to Ion datasets. That Ion/Athena integration is used both by internal (to Amazon) teams and external customers. Newer releases of trino do not support the actual Hive SerDes. This implementation provides an Ion integration that will work in trino going forward.

The behavior is mostly backwards compatible with the Hive SerDe, with the exceptions listed below. There are a host of SerDe Properties supported by the Ion Hive SerDe. For this PR we are only supporting the ability to specify Text or Binary encoding. Otherwise only the defaults are accepted.

The new integration is disabled by default, behind a Hive Config flag: hive.ion.nativetrino our intention is to make that enabled by default at some point.

It has not been benchmarked. However, the performance of the existing Ion Hive SerDe is quite poor and this implementation intentionally avoids two of the primary causes of inefficiencies there.

We believe it is ready to be used for any tables with default (implicit or explicit) SerDe Properties.

Preserved Behaviors of Note

  • For the most part, mistyped values result in errors; e.g. an Ion String encountered where an Int is expected is an error.
  • Mistyped Top-Level-Values result in nulls, not errors. A new SerDe property ion.path_extractor.strict was added to address this, it defaults to false.
  • Duplicate keys in Structs/Rows: last one wins
  • There are some built-in and well-defined coercions, e.g. Ion Int to Trino Decimal, truncate an Ion Float (64 bits) to Trino REAL iff doing so is lossless.

Exceptions to Backwards Compatibility

  • Previously Timestamps were simply truncated. Now they are rounded. Athena users will not see a change as Timestamps will still be truncated there.
  • Previously Chars and Varchars longer than a fixed length were errors. Now they are truncated as others are.
  • Now Decimals with oversize whole digits (e.g. 1222.34 in a DECIMAL(3, 2)) are actually an error. Before the behavior was not defined.

SerDe Properties

For this PR if the IonPageSourceFactory sees any SerDe properties that affect decoding it will return Optional.empty. Obviously we could make this an error depending on feedback.

What's Next

  • Support coercion from Any Ion Value to Text (will be added to this PR)
  • Support non-default SerDe Properties (some may be added here)
  • The biggest (as in most complex) of the above is the "path extractor" properties; the ability to define paths (navigating struct fields or list indexes) into complex types for specific columns. (will be addressed in January)

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

## Section
* Ion File Format Integration TODO: more details (cribbed from above)

Copy link

cla-bot bot commented Dec 18, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the hive Hive connector label Dec 18, 2024
Copy link

cla-bot bot commented Dec 18, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Dec 18, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Dec 18, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

1 similar comment
Copy link

cla-bot bot commented Dec 19, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

rmarrowstone and others added 10 commits December 19, 2024 08:49
This is the initial implementation of a Native Trino Ion Hive format.

It does not handle the full range of Trino or Ion values, nor the set
of coercions that we will likely need to implement. But I think it's a
pretty good start. I implemented enough of the scalar types to make
the tests pass but there are definitely some edges that need cleaning.

I also added the ion-hive-serde as a test dependency so that we can
add Ion to TestHiveFileFormats.

I chose to avoid building off of the "Line" abstractions. While that
could be done, and more closely mimics how the Ion Hive SerDe works,
doing so is suboptimal, and for reading is actually more complicated.
This is because while Ion is unschema'ed, each value stream has an
evolving "encoding context." In Ion 1.0 that is the "Symbol Table"
which is a form of dictionary encoding for the field names and other
common text values.

The other high-level choice was _not_ to use the
ion-java-path-extraction library that is used in the Hive SerDe.
The main value in that was to allow users to do path navigation in the
SerDe config. But using it to go into the Block abstractions without
a middle layer (as it uses in Hive SerDe) is complex, and I'm
concerned that some data quality issues are effectively silenced.
Given Trino's current capability set with nested and complex
structures I doubt the value proposition. In fact, using SerDe config
for what can be expressed in SQL is actually an anti-pattern.
Converts the field name into lower case for both Decoder and Encoder to
ensure case-insensitivity for field names/keys
* Adds `ion.nativetrino` hvie config feature flag
* Adds changes for Ion page source to use the feature flag
* Adds test changes for feature flag
This change factors all of the setup involved in creating PageSources
and FileWriters into a single TestFixture inner class.

This reduces some existing duplication and should make it simpler to
add/change table properties and hive config in the test.
There are a few changes related to how mistyped or oversize Ion
Values are Handled:

* Enable Strict or Lax Path Typing

This change adds a SerDe Property to support non-strict path typing.
It defaults to false to mimic the behavior of the ion-hive-serde
which used path extraction for the top-level-values, whether the user
had defined any extractions or not. Without support for pathing
this is effectively a type-check (or not) for TLVs. With support for
extraction the behavior is a little more subtle than that, so I named
the property for how it will behave. The name is also consistent with
other properties.

I also added some tests for nested mistypings and changed the code
to consistently throw TrinoExceptions.

* IonInts are coerced to Decimals (as Ion Hive does)
* Decimals with more whole digits than fit is now an error
  (oddly this is not an error in Ion Hive)

Fairly exhaustive test cases were added for each of the changes.

* Ensure Timestamps are Rounded Consistently

This change makes it so that Ion Timestamps are rounded using
Timestamps.round(). This makes it so that Timestamp down casting works
consistently with the other formats.

I considered using DecodedTimestamp and the TrinoTimestampEncoders
but those didn't cover picos. So this code uses the same pattern they
do and the Timestamps.round(). This code simply ignores anything after
picos.

* Truncate Char and Varchar Columns

With this change we truncate text that is longer than the length of
the Char or Varchar column. Before this change that caused an error.

Note that this is an error with the Hive Serde, but per the public
Athena docs and the behavior of the other Hive formats, truncation
is preferred.
This commit fixes up a few things I noticed when previewing the PR to
Trino.

* Column/Field name casing should be preserved when writing
* Some missing operational/metrics calls in IonPageSource
* Throw clearer Exception for errors in IonFileWriter
* Move some tests from TestHiveFileFormats to IonPageSourceSmokeTest
* Add test for Timestamp Encoding
Copy link

cla-bot bot commented Dec 19, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@rmarrowstone rmarrowstone marked this pull request as ready for review December 19, 2024 16:50
@rmarrowstone
Copy link
Author

Aware of the CLA verification. I believe all 3 authors have at least submitted them at this point.

@rmarrowstone
Copy link
Author

The suite-hive-transactional test failure appears related to ORC and transactions. I have not seen it when running locally or in the fork. I assume it is transient but I don't seem to have permissions to retry it.

@findinpath
Copy link
Contributor

@rmarrowstone could you pls share in the description of the PR the business motivation for this contribution?

What kind of use-cases will the users of trinodb/trino be able to cover with this contribution?

// backwards incompatible. It requires explicitly opting to use this feature.
// TODO: In future this property should change to `true` as default and then the following statement can
// be removed.
.addHiveProperty("hive.ion.nativetrino", "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the apparent low test coverage for the new functionality intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the test coverage for the new functionality actually is low.

We have added the TestIonFormat in lib/trino-hive-formats and the IonPageSourceSmokeTest as well the addition of testIonWithBinaryEncoding to TestHiveFileFormats. And that flag is enabled for them. I see that those tests do run when I build locally.

Can you point me to the artifacts that indicate there is low test coverage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check that the TestHiveConnectorTest runs against the ION format and checked the coverage for our tests. There are a few gaps that we can and will address, but on the whole the coverage seems pretty robust.

@rmarrowstone
Copy link
Author

@rmarrowstone could you pls share in the description of the PR the business motivation for this contribution?

What kind of use-cases will the users of trinodb/trino be able to cover with this contribution?

Sure! I mentioned that this replaces the Ion Hive SerDe, which is used in Athena today. But using actual Hive Formats is no longer supported in newer trino releases. I will make that clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Hive connector
Development

Successfully merging this pull request may close these issues.

4 participants