Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automatic type coercion in Iceberg table creation #13981

Merged
merged 1 commit into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@
import io.trino.spi.statistics.TableStatisticsMetadata;
import io.trino.spi.type.LongTimestamp;
import io.trino.spi.type.LongTimestampWithTimeZone;
import io.trino.spi.type.TimeType;
import io.trino.spi.type.TimestampType;
import io.trino.spi.type.TimestampWithTimeZoneType;
import io.trino.spi.type.TypeManager;
Expand Down Expand Up @@ -266,6 +267,9 @@
import static io.trino.spi.type.BigintType.BIGINT;
import static io.trino.spi.type.DateTimeEncoding.unpackMillisUtc;
import static io.trino.spi.type.DateType.DATE;
import static io.trino.spi.type.TimeType.TIME_MICROS;
import static io.trino.spi.type.TimestampType.TIMESTAMP_MICROS;
import static io.trino.spi.type.TimestampWithTimeZoneType.TIMESTAMP_TZ_MICROS;
import static io.trino.spi.type.Timestamps.MICROSECONDS_PER_MILLISECOND;
import static io.trino.spi.type.UuidType.UUID;
import static java.lang.Math.floorDiv;
Expand Down Expand Up @@ -815,6 +819,21 @@ public Optional<ConnectorTableLayout> getNewTableLayout(ConnectorSession session
return getWriteLayout(schema, partitionSpec, false);
}

@Override
public Optional<io.trino.spi.type.Type> getSupportedType(ConnectorSession session, io.trino.spi.type.Type type)
{
if (type instanceof TimestampWithTimeZoneType) {
return Optional.of(TIMESTAMP_TZ_MICROS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support time types with precison <= 6 in Iceberg CT and CTAS

Change title

  • replace time types with date/time types
  • replace with precison <= 6 with with precison != 6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed the PR title to much commit title.

}
if (type instanceof TimestampType) {
return Optional.of(TIMESTAMP_MICROS);
}
if (type instanceof TimeType) {
return Optional.of(TIME_MICROS);
}
return Optional.empty();
}

@Override
public ConnectorOutputTableHandle beginCreateTable(ConnectorSession session, ConnectorTableMetadata tableMetadata, Optional<ConnectorTableLayout> layout, RetryMode retryMode)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
import static io.trino.spi.type.TimeZoneKey.getTimeZoneKey;
import static io.trino.spi.type.VarcharType.VARCHAR;
import static io.trino.sql.planner.assertions.PlanMatchPattern.node;
import static io.trino.testing.DataProviders.toDataProvider;
import static io.trino.testing.MaterializedResult.resultBuilder;
import static io.trino.testing.QueryAssertions.assertEqualsIgnoreOrder;
import static io.trino.testing.TestingConnectorSession.SESSION;
Expand Down Expand Up @@ -4685,7 +4686,7 @@ public void testSplitPruningForFilterOnNonPartitionColumn(DataMappingTestSetup t
verifySplitCount("SELECT row_id FROM " + tableName + " WHERE col > " + sampleValue,
(format == ORC && testSetup.getTrinoTypeName().contains("timestamp") ? 2 : expectedSplitCount));
verifySplitCount("SELECT row_id FROM " + tableName + " WHERE col < " + highValue,
(format == ORC && testSetup.getTrinoTypeName().contains("timestamp") ? 2 : expectedSplitCount));
(format == ORC && testSetup.getTrinoTypeName().contains("timestamp(6)") ? 2 : expectedSplitCount));
}
}

Expand Down Expand Up @@ -4816,14 +4817,6 @@ protected Optional<DataMappingTestSetup> filterDataMappingSmokeTestData(DataMapp
// These types are not supported by Iceberg
return Optional.of(dataMappingTestSetup.asUnsupported());
}

// According to Iceberg specification all time and timestamp values are stored with microsecond precision.
if (typeName.equals("time") ||
typeName.equals("timestamp") ||
typeName.equals("timestamp(3) with time zone")) {
return Optional.of(dataMappingTestSetup.asUnsupported());
}

return Optional.of(dataMappingTestSetup);
}

Expand Down Expand Up @@ -6940,9 +6933,154 @@ protected void verifyTableNameLengthFailurePermissible(Throwable e)
assertThat(e).hasMessageMatching("Table name must be shorter than or equal to '128' characters but got .*");
}

@Test(dataProvider = "testTimestampPrecisionOnCreateTableAsSelect")
public void testTimestampPrecisionOnCreateTableAsSelect(TimestampPrecisionTestSetup setup)
{
try (TestTable testTable = new TestTable(
getQueryRunner()::execute,
"test_coercion_show_create_table",
format("AS SELECT %s a", setup.sourceValueLiteral))) {
assertEquals(getColumnType(testTable.getName(), "a"), setup.newColumnType);
assertQuery(
format("SELECT * FROM %s", testTable.getName()),
format("VALUES (%s)", setup.newValueLiteral));
}
}

@Test(dataProvider = "testTimestampPrecisionOnCreateTableAsSelect")
public void testTimestampPrecisionOnCreateTableAsSelectWithNoData(TimestampPrecisionTestSetup setup)
{
try (TestTable testTable = new TestTable(
getQueryRunner()::execute,
"test_coercion_show_create_table",
format("AS SELECT %s a WITH NO DATA", setup.sourceValueLiteral))) {
assertEquals(getColumnType(testTable.getName(), "a"), setup.newColumnType);
}
}

@DataProvider(name = "testTimestampPrecisionOnCreateTableAsSelect")
public Object[][] timestampPrecisionOnCreateTableAsSelectProvider()
{
return timestampPrecisionOnCreateTableAsSelectData().stream()
.map(this::filterTimestampPrecisionOnCreateTableAsSelectProvider)
.flatMap(Optional::stream)
.collect(toDataProvider());
}

protected Optional<TimestampPrecisionTestSetup> filterTimestampPrecisionOnCreateTableAsSelectProvider(TimestampPrecisionTestSetup setup)
{
return Optional.of(setup);
}

private List<TimestampPrecisionTestSetup> timestampPrecisionOnCreateTableAsSelectData()
{
return ImmutableList.<TimestampPrecisionTestSetup>builder()
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.000000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.9'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.900000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.56'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.560000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.123'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.4896'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.489600'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.89356'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.893560'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.123000'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.999'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.999000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.123456'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123456'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.1'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.100000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.9'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.900000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.123'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.123000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.123000'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.123000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.999'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.999000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '2020-09-27 12:34:56.123456'", "timestamp(6)", "TIMESTAMP '2020-09-27 12:34:56.123456'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.1234561'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123456'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.123456499'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123456'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.123456499999'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123456'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.1234565'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.123457'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.111222333444'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.111222'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 00:00:00.9999995'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:01.000000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1970-01-01 23:59:59.9999995'", "timestamp(6)", "TIMESTAMP '1970-01-02 00:00:00.000000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1969-12-31 23:59:59.9999995'", "timestamp(6)", "TIMESTAMP '1970-01-01 00:00:00.000000'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1969-12-31 23:59:59.999999499999'", "timestamp(6)", "TIMESTAMP '1969-12-31 23:59:59.999999'"))
.add(new TimestampPrecisionTestSetup("TIMESTAMP '1969-12-31 23:59:59.9999994'", "timestamp(6)", "TIMESTAMP '1969-12-31 23:59:59.999999'"))
.build();
}

public record TimestampPrecisionTestSetup(String sourceValueLiteral, String newColumnType, String newValueLiteral)
{
public TimestampPrecisionTestSetup
{
requireNonNull(sourceValueLiteral, "sourceValueLiteral is null");
requireNonNull(newColumnType, "newColumnType is null");
requireNonNull(newValueLiteral, "newValueLiteral is null");
}

public TimestampPrecisionTestSetup withNewValueLiteral(String newValueLiteral)
{
return new TimestampPrecisionTestSetup(sourceValueLiteral, newColumnType, newValueLiteral);
}
}

@Test(dataProvider = "testTimePrecisionOnCreateTableAsSelect")
public void testTimePrecisionOnCreateTableAsSelect(String inputType, String tableType, String tableValue)
{
try (TestTable testTable = new TestTable(
getQueryRunner()::execute,
"test_coercion_show_create_table",
format("AS SELECT %s a", inputType))) {
assertEquals(getColumnType(testTable.getName(), "a"), tableType);
assertQuery(
format("SELECT * FROM %s", testTable.getName()),
format("VALUES (%s)", tableValue));
}
}

@Test(dataProvider = "testTimePrecisionOnCreateTableAsSelect")
public void testTimePrecisionOnCreateTableAsSelectWithNoData(String inputType, String tableType, String ignored)
{
try (TestTable testTable = new TestTable(
getQueryRunner()::execute,
"test_coercion_show_create_table",
format("AS SELECT %s a WITH NO DATA", inputType))) {
assertEquals(getColumnType(testTable.getName(), "a"), tableType);
}
}

@DataProvider(name = "testTimePrecisionOnCreateTableAsSelect")
public static Object[][] timePrecisionOnCreateTableAsSelectProvider()
{
return new Object[][] {
{"TIME '00:00:00'", "time(6)", "TIME '00:00:00.000000'"},
{"TIME '00:00:00.9'", "time(6)", "TIME '00:00:00.900000'"},
{"TIME '00:00:00.56'", "time(6)", "TIME '00:00:00.560000'"},
{"TIME '00:00:00.123'", "time(6)", "TIME '00:00:00.123000'"},
{"TIME '00:00:00.4896'", "time(6)", "TIME '00:00:00.489600'"},
{"TIME '00:00:00.89356'", "time(6)", "TIME '00:00:00.893560'"},
{"TIME '00:00:00.123000'", "time(6)", "TIME '00:00:00.123000'"},
{"TIME '00:00:00.999'", "time(6)", "TIME '00:00:00.999000'"},
{"TIME '00:00:00.123456'", "time(6)", "TIME '00:00:00.123456'"},
{"TIME '12:34:56.1'", "time(6)", "TIME '12:34:56.100000'"},
{"TIME '12:34:56.9'", "time(6)", "TIME '12:34:56.900000'"},
{"TIME '12:34:56.123'", "time(6)", "TIME '12:34:56.123000'"},
{"TIME '12:34:56.123000'", "time(6)", "TIME '12:34:56.123000'"},
{"TIME '12:34:56.999'", "time(6)", "TIME '12:34:56.999000'"},
{"TIME '12:34:56.123456'", "time(6)", "TIME '12:34:56.123456'"},
{"TIME '00:00:00.1234561'", "time(6)", "TIME '00:00:00.123456'"},
{"TIME '00:00:00.123456499'", "time(6)", "TIME '00:00:00.123456'"},
{"TIME '00:00:00.123456499999'", "time(6)", "TIME '00:00:00.123456'"},
{"TIME '00:00:00.1234565'", "time(6)", "TIME '00:00:00.123457'"},
{"TIME '00:00:00.111222333444'", "time(6)", "TIME '00:00:00.111222'"},
{"TIME '00:00:00.9999995'", "time(6)", "TIME '00:00:01.000000'"},
{"TIME '23:59:59.9999995'", "time(6)", "TIME '00:00:00.000000'"},
{"TIME '23:59:59.9999995'", "time(6)", "TIME '00:00:00.000000'"},
{"TIME '23:59:59.999999499999'", "time(6)", "TIME '23:59:59.999999'"},
{"TIME '23:59:59.9999994'", "time(6)", "TIME '23:59:59.999999'"}};
}

@Override
protected Optional<SetColumnTypeSetup> filterSetColumnTypesDataProvider(SetColumnTypeSetup setup)
mdesmet marked this conversation as resolved.
Show resolved Hide resolved
{
if (setup.sourceColumnType().equals("timestamp(3) with time zone")) {
// The connector returns UTC instead of the given time zone
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '2020-02-12 14:03:00.123000 +00:00'"));
}
switch ("%s -> %s".formatted(setup.sourceColumnType(), setup.newColumnType())) {
case "bigint -> integer":
case "decimal(5,3) -> decimal(5,2)":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import java.nio.file.Files;
import java.util.List;
import java.util.Map;
import java.util.Optional;

import static com.google.common.base.Preconditions.checkArgument;
import static com.google.common.io.Resources.getResource;
Expand Down Expand Up @@ -163,4 +164,16 @@ public void testDropAmbiguousRowFieldCaseSensitivity()
.hasMessageContaining("Error opening Iceberg split")
.hasStackTraceContaining("Multiple entries with same key");
}

@Override
protected Optional<TimestampPrecisionTestSetup> filterTimestampPrecisionOnCreateTableAsSelectProvider(TimestampPrecisionTestSetup setup)
{
if (setup.sourceValueLiteral().equals("TIMESTAMP '1969-12-31 23:59:59.999999499999'")) {
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
}
if (setup.sourceValueLiteral().equals("TIMESTAMP '1969-12-31 23:59:59.9999994'")) {
return Optional.of(setup.withNewValueLiteral("TIMESTAMP '1970-01-01 00:00:00.999999'"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why this happens

}
return Optional.of(setup);
Comment on lines +170 to +177
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi : Do you know if this is something ORC related?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i have context. this being what exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a timestamp like TIMESTAMP '1969-12-31 23:59:59.9999994 and TIMESTAMP '1969-12-31 23:59:59.999999499999' as an input value for a CTAS statement results in a Iceberg table with value TIMESTAMP '1970-01-01 00:00:00.999999.

Is this expected for ORC files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a bug. i remember there were some +/-1s for at-the-epoch values in ORC, but @dain will know more

are you able to reproduce this without coercions, i.e. with direct insert of a timestamp(6) value into Iceberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems also prevalent in Hive:

trino:default> CREATE table r AS SELECT TIMESTAMP '1969-12-31 23:59:59.999' a;
CREATE TABLE: 1 row

Query 20230920_082205_00009_62iqv, FINISHED, 1 node
Splits: 18 total, 18 done (100,00%)
0,42 [0 rows, 0B] [0 rows/s, 0B/s]

trino:default> select * from r;
            a
-------------------------
 1970-01-01 00:00:00.999
(1 row)

Query 20230920_082212_00010_62iqv, FINISHED, 1 node
Splits: 1 total, 1 done (100,00%)
0,31 [1 rows, 259B] [3 rows/s, 849B/s]

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,9 @@ protected boolean supportsIcebergFileStatistics(String typeName)
protected boolean supportsRowGroupStatistics(String typeName)
{
return !(typeName.equalsIgnoreCase("varbinary") ||
typeName.equalsIgnoreCase("time") ||
typeName.equalsIgnoreCase("time(6)") ||
typeName.equalsIgnoreCase("timestamp(3) with time zone") ||
typeName.equalsIgnoreCase("timestamp(6) with time zone"));
}

Expand Down