Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automatic type coercion in Iceberg table creation #13981

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented Sep 3, 2022

Description

The connector can specify the actual Trino types that will be used to store the table for CTAS statements. This permits the core engine to perform the necessary coercions and casts.

Is this change a fix, improvement, new feature, refactoring, or other?

Enhancement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Change to Iceberg connector

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Sep 3, 2022
@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch from 6f29241 to 4993fb1 Compare September 4, 2022 09:58
martint
martint previously requested changes Sep 4, 2022
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has broader ranging implications around the semantics of CREATE TABLE and CREATE TABLE ... AS, and it's not as simple as adding support for precision <= 6 to the connector.

We had a discussion about this exact topic a couple of years ago, but I can't find it right now. @findepi, do you remember?

@mdesmet
Copy link
Contributor Author

mdesmet commented Sep 4, 2022

This change has broader ranging implications around the semantics of CREATE TABLE and CREATE TABLE ... AS, and it's not as simple as adding support for precision <= 6 to the connector.

Agreed, I'm not saying this PR is fully done or right at this moment.

The current behaviour in a CREATE TABLE is different from the CTAS. For example: the user created a table with TIMESTAMP(6), so we might say that the user opted for this precision and is OK with rounding (note that I have not checked the SQL spec on this but this is what is happening now).

create table iceberg.default.test (a TIMESTAMP(6));

insert into iceberg.default.test values TIMESTAMP '9999-12-31 23:59:59.999999999';
-- no error

select a from iceberg.default.test;
-- value is rounded

Now in the case of a CTAS statement this PR changes the behaviour to distinguish between the target table type and the query type.

The table type can be specified/overridden by the connector. For Iceberg we define the table type as TIMESTAMP(6) and the query type might be either lower, same or higher precision.

I think what is missing is that we should throw an exception in case of loss of precision. I thought this was the goal of the noTruncationCast but it just does a CAST for anything non VARCHAR/CHAR which might result in precision loss which should be explicit.

As far as I can see that is the impact, actually if a connector doesn't implement metadata.getNewTableMetadata(), there shouldn't be any impact as the table and query type is the same.

@findepi
Copy link
Member

findepi commented Sep 5, 2022

We had a discussion about this exact topic a couple of years ago, but I can't find it right now. @findepi, do you remember?

@martint Yeah, don't know where it was (slack, PRs or issues).

AFAIR we discussed being able to CTAS into a JDBC connector and we decided we will allow users to export data, even if we cannot create columns of requested types (eg more or less precision than requested). This was the driving use-case (obviously at odds with our desire to create table schema exactly as requested) and this is how it got implemented in JDBC connectors.

For reference, this is probably the first commit implementing this logic: 425c3b0 (part of #5342).

Now in the case of a CTAS statement this PR changes the behaviour to distinguish between the target table type and the query type.

This is a good improvement, as it allows us to differentiate between CTAS's implicit schema and CT explicit schema and handle them differently, addressing the above mentioned desire to create table schema with types exactly as requested by the user, or fail if it's not possible.

Involving engine in the conversion is a good change too. In 425c3b0 (and other such implementations), the connector needs to take care of CAST's rounding logic upon CTAS and doesn't have to do this upon iNSERT. This is unnecessary complexity. This is also much harder for connectors like Iceberg and Hive, where the conversion would be implemented in the file writers layer, or come at a performance cost.

Letting the connector tell the engine "hey, this is the effective table schema" during CTAS frees the connector from that logic.

I think what is missing is that we should throw an exception in case of loss of precision.

Just like in INSERT, we should round.
This is what we deliberately chose to do in JDBC connectors.

cc @losipiuk

@findepi
Copy link
Member

findepi commented Sep 5, 2022

@mdesmet I'd suggest to separate Allow connectors to specify tableMetadata for CTAS into its own PR. This should come with simplification changes in the JDBC connectors, thus exercising the logic being introduced.
Once we have this, we should return to the Iceberg portion of the changes.

@mdesmet
Copy link
Contributor Author

mdesmet commented Sep 5, 2022

@mdesmet I'd suggest to separate Allow connectors to specify tableMetadata for CTAS into its own PR. This should come with simplification changes in the JDBC connectors, thus exercising the logic being introduced. Once we have this, we should return to the Iceberg portion of the changes.

Created #13994

@bitsondatadev
Copy link
Member

@mdesmet I'd suggest to separate Allow connectors to specify tableMetadata for CTAS into its own PR. This should come with simplification changes in the JDBC connectors, thus exercising the logic being introduced. Once we have this, we should return to the Iceberg portion of the changes.

Created #13994

@mdesmet
Any update here?

@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch from 4993fb1 to 9ccbae3 Compare September 4, 2023 17:03
@github-actions github-actions bot added the iceberg Iceberg connector label Sep 4, 2023
@findinpath
Copy link
Contributor

Build is 🔴

BaseConnectorTest.testSetColumnTypes:2938 expected [time(3)] but found [time(6)]

return Types.TimeType.get();
}
if (type.equals(TIMESTAMP_MICROS)) {
if (type instanceof TimestampType timestampType) {
if (timestampType.getPrecision() > 6) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the table foo has a column col of type timestamp(6), why would we allow statements like the following:

ALTER TABLE foo ALTER COLUMN col SET DATA TYPE timestamp(3)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that even when internally iceberg stores time values using microseconds precision. We could still allow any precision in Trino.

However I agree this goes beyond the scope of this PR, I will restore the existing behaviour.

@findinpath
Copy link
Contributor

Build is still 🔴

Still the same method is failing: TestIcebergParquetConnectorTest>BaseConnectorTest.testSetFieldTypes

@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch from 9ccbae3 to 6bdafb9 Compare September 12, 2023 03:03
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested successfully with manually on my local machine.
There is still a bit of testing to be added.
Looks good overall.

@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch from 6bdafb9 to b0efb73 Compare September 13, 2023 08:07
@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch 2 times, most recently from 46e23d0 to 3fa3f1f Compare September 15, 2023 06:47
@findinpath
Copy link
Contributor

Build is 🔴

2023-09-15T07:41:29.8421334Z [ERROR] io.trino.plugin.iceberg.TestIcebergMinioOrcConnectorTest.testTimestampPrecisionOnCreateTableAsSelect[TIMESTAMP '1969-12-31 23:59:59.9999994', timestamp(6), TIMESTAMP '1969-12-31 23:59:59.999999'](25) -- Time elapsed: 0.113 s <<< FAILURE!
2023-09-15T07:41:29.8422428Z java.lang.AssertionError: 
2023-09-15T07:41:29.8422880Z For query 20230915_074023_07236_8ek3b: 
2023-09-15T07:41:29.8423359Z  SELECT * FROM test_coercion_show_create_table8cn1wk1x4e
2023-09-15T07:41:29.8423781Z not equal
2023-09-15T07:41:29.8424161Z Actual rows (up to 100 of 1 extra rows shown, 1 rows in total):
2023-09-15T07:41:29.8424588Z     [1970-01-01T00:00:00.999999]
2023-09-15T07:41:29.8424961Z Expected rows (up to 100 of 1 missing rows shown, 1 rows in total):
2023-09-15T07:41:29.8425505Z     [1969-12-31T23:59:59.999999]

@mdesmet mdesmet force-pushed the feature/typecoercion_iceberg_ctas branch from 3fa3f1f to 25b05fb Compare September 18, 2023 12:16
Comment on lines +170 to +177
{
if (setup.sourceValueLiteral().equals("TIMESTAMP '1969-12-31 23:59:59.999999499999'")) {
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
}
if (setup.sourceValueLiteral().equals("TIMESTAMP '1969-12-31 23:59:59.9999994'")) {
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
}
return Optional.of(setup);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi : Do you know if this is something ORC related?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i have context. this being what exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a timestamp like TIMESTAMP '1969-12-31 23:59:59.9999994 and TIMESTAMP '1969-12-31 23:59:59.999999499999' as an input value for a CTAS statement results in a Iceberg table with value TIMESTAMP '1970-01-01 00:00:00.999999.

Is this expected for ORC files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a bug. i remember there were some +/-1s for at-the-epoch values in ORC, but @dain will know more

are you able to reproduce this without coercions, i.e. with direct insert of a timestamp(6) value into Iceberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems also prevalent in Hive:

trino:default> CREATE table r AS SELECT TIMESTAMP '1969-12-31 23:59:59.999' a;
CREATE TABLE: 1 row

Query 20230920_082205_00009_62iqv, FINISHED, 1 node
Splits: 18 total, 18 done (100,00%)
0,42 [0 rows, 0B] [0 rows/s, 0B/s]

trino:default> select * from r;
            a
-------------------------
 1970-01-01 00:00:00.999
(1 row)

Query 20230920_082212_00010_62iqv, FINISHED, 1 node
Splits: 1 total, 1 done (100,00%)
0,31 [1 rows, 259B] [3 rows/s, 849B/s]

@mdesmet
Copy link
Contributor Author

mdesmet commented Sep 19, 2023

@martint : We have completed the implementation of the engine changes in #13994. Could you remove the requested changes on this PR as this is just the connector specific implementation.

return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
}
if (setup.sourceValueLiteral().equals("TIMESTAMP '1969-12-31 23:59:59.9999994'")) {
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why this happens

@martint martint dismissed their stale review September 20, 2023 12:24

Required engine changes were implemented elsewhere

@mdesmet
Copy link
Contributor Author

mdesmet commented Sep 21, 2023

@findepi : PTAL

public Optional<io.trino.spi.type.Type> getSupportedType(ConnectorSession session, io.trino.spi.type.Type type)
{
if (type instanceof TimestampWithTimeZoneType) {
return Optional.of(TIMESTAMP_TZ_MICROS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support time types with precison <= 6 in Iceberg CT and CTAS

Change title

  • replace time types with date/time types
  • replace with precison <= 6 with with precison != 6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed the PR title to much commit title.

@mdesmet mdesmet changed the title Support time types with precison <= 6 in Iceberg CT and CTAS Support date/time types with precison != 6 in Iceberg CT and CTAS Oct 9, 2023
@findinpath findinpath requested a review from findepi October 9, 2023 13:13
@findepi findepi changed the title Support date/time types with precison != 6 in Iceberg CT and CTAS Support automatic type coercion in Iceberg table creation Oct 10, 2023
@findepi findepi merged commit 11391fc into trinodb:master Oct 10, 2023
@github-actions github-actions bot added this to the 429 milestone Oct 10, 2023
SemionPar added a commit to SemionPar/trino that referenced this pull request Apr 23, 2024
Similar to trinodb#13981, extend type coercion support to char type.
ebyhr pushed a commit that referenced this pull request Apr 23, 2024
Similar to #13981, extend type coercion support to char type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

Support CTAS into Iceberg with timestamp(3)
6 participants