Support changing column types in Hive connector #15938

ebyhr · 2023-02-02T09:45:17Z

Description

Support changing column types in Hive connector

Release notes

(x) Release notes are required, with the following suggested text:

# Hive
* Support changing column types. ({issue}`15515`)

lib/trino-orc/src/main/java/io/trino/orc/reader/ByteColumnReader.java

raunaqmorarka · 2023-02-07T04:03:52Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcTypeConversions.java

+            if (block.isNull(i)) {
+                isNull[i] = true;
+            }
+            else {
+                values[i] = block.getByte(i, 0);
+            }


I suspect this approach is going to be slower because of the additional null check. You can try adding a JMH benchmark like BenchmarkOrcDecimalReader and compare the performance of these approaches.
We're also not preserving the mayHaveNull of the original block, that should definitely be done.
We did these type of adaptations in column readers rather than the page source in the parquet reader as well.

But having all the adaptations in column reader won't it be a bit messy or can we have an adaptations which could be reused by parquet and orc file format.

And some cases of coercion is supported by hive but it could be restricted by iceberg or delta (not sure) - how could we handle them in these case ?

The primitive types of orc and parquet are different so the required adaptations won't be the same (e.g. there is no byte/short in parquet only int32/int64).
We could create a function in ByteColumnReader to avoid repeating code like below

if (type == TINYINT) { return new ByteArrayBlock(nextBatchSize, Optional.empty(), values); } if (type == INTEGER) { return new IntArrayBlock(nextBatchSize, Optional.empty(), convertToIntArray(values)); } throw new VerifyError("Unsupported type " + type);

These types of column adaptations which are done in the reader are simple ones meant to resolve minor differences in file format type and trino type (e.g. differing byte count of numbers, difference in precision of decimals, timestamps etc.). These should make sense for any connector. For anything complex like reading a number in the file format as a varchar trino type, which a connector may or may not want to do, can be done in the connector.

findepi · 2023-02-07T11:26:57Z

lib/trino-orc/src/main/java/io/trino/orc/reader/ColumnReaders.java

@@ -65,9 +64,6 @@ public static ColumnReader createColumnReader(
            case BOOLEAN:
                return new BooleanColumnReader(type, column, memoryContext.newLocalMemoryContext(ColumnReaders.class.getSimpleName()));
            case BYTE:
-                if (type == INTEGER && !column.getAttributes().containsKey("iceberg.id")) {
-                    throw invalidStreamType(column, type);


Don't require iceberg.id attribute in ORC reader

If I am understanding the change correctly, it could be titled "Fix reading ORC files after column evolved from tinyint to integer"

Am i reading this right?

Sorry for my misleading commit tile. Updated it.

Thanks for changing the commit title.

Is TestHiveTransactionalTable.java the only test coverage we have for this "Fix reading ORC files after column evolved from tinyint to integer" change?

findepi · 2023-02-07T13:17:55Z

lib/trino-orc/src/main/java/io/trino/orc/reader/ByteColumnReader.java

@@ -24,7 +24,6 @@
 import io.trino.orc.stream.InputStreamSources;
 import io.trino.spi.block.Block;
 import io.trino.spi.block.ByteArrayBlock;
-import io.trino.spi.block.IntArrayBlock;


Convert ORC block within ColumnAdaptation

The commit message bears no rational. in fact, what's the benefit of the change?

findepi · 2023-02-07T13:19:06Z

lib/trino-orc/src/main/java/io/trino/orc/reader/ByteColumnReader.java

@@ -77,7 +79,7 @@ public ByteColumnReader(Type type, OrcColumn column, LocalMemoryContext memoryCo
            throws OrcCorruptionException
    {
        this.type = requireNonNull(type, "type is null");
-        verifyStreamType(column, type, t -> t == TINYINT || t == INTEGER);
+        verifyStreamType(column, type, t -> t == TINYINT || t == SMALLINT || t == INTEGER || t == BIGINT);


i though the purpose of "Convert ORC block within ColumnAdaptation" was to have readers simplified, so surprised to see this change here.

I think I like the non-composed (without column adaptations) approach better though, so no changes requested here.

findepi · 2023-02-07T13:21:01Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcTypeConversions.java

+        boolean[] isNull = new boolean[block.getPositionCount()];
+        short[] values = new short[block.getPositionCount()];
+        for (int i = 0; i < block.getPositionCount(); i++) {
+            if (block.isNull(i)) {


if this had to be of top performance, this probably would want to have a simpler loop when ! block.mayHaveNull.

no change requested, since I don't know the perf expectations

BTW would it work to avoid the if here?

isNull[i] = block.isNull(i) values[i] = block.getByte(i, 0);

cc @dain @sopel39 @raunaqmorarka may know if it is legit

I don't get what was gained by adding abstractions at page source layer for these column adaptations rather than letting the column reader handle it as it does today.
My expectation is that there should be no perf regression from the existing approach (demonstrated through JMH results).
Preserving the mayHaveNull of the original block is definitely required, downstream operators take advantage of that to improve performance.
Avoiding branch in the above way looks okay to me, however the contract in Block doesn't explcitly specify any expected behaviour for getter on null positions. So I don't know if that can be safely relied on or in future it may change to throw an exception.

findepi · 2023-02-07T13:23:10Z

testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

@@ -2250,7 +2250,10 @@ private List<SetColumnTypeSetup> setColumnTypeSetupData()
    {
        return ImmutableList.<SetColumnTypeSetup>builder()
                .add(new SetColumnTypeSetup("tinyint", "TINYINT '127'", "smallint"))
+                .add(new SetColumnTypeSetup("tinyint", "TINYINT '127'", "integer"))


This probably belongs to "Support type evolution from tinyint to smallint and bigint in ORC" (currently in "Add test cases for changing numeric column types")

docs/src/main/sphinx/connector/hive.rst

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseHiveConnectorTest.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java

...ve/src/test/java/io/trino/plugin/hive/metastore/glue/TestHiveGlueMetastoreCompatibility.java

ebyhr · 2023-03-02T06:06:52Z

Rebased on master to resolve conflicts and added test cases for storage formats.

findepi · 2023-03-02T16:03:35Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+        if (storageFormat != HiveStorageFormat.ORC && storageFormat != HiveStorageFormat.PARQUET) {
+            throw new TrinoException(NOT_SUPPORTED, "Unsupported storage format for changing column type: " + storageFormat);


is the check effective?
can a partitioned table have partitions with different file format?

btw it's looks like we should be able to lift the limitation for eg TEXTFILE easily.
please add a code comment explaining the logic, why ORC/PARQUET only and what is our stance about partitioned tables.

findepi · 2023-03-02T16:06:44Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseHiveConnectorTest.java

+        ImmutableList.Builder<SetColumnTypeSetup> setup = ImmutableList.builder();
+        for (HiveStorageFormat storageFormat : HiveStorageFormat.values()) {
+            if (storageFormat == REGEX) {
+                // REGEX format is read-only


"Cannot prepare test data with REGEX table"

findepi · 2023-03-02T16:07:26Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseHiveConnectorTest.java

+            setup.addAll(super.setColumnTypeSetupData().stream()
+                    .map(data -> data.withTableProperty("format = '%s'".formatted(storageFormat)))


This is quite a few test cases.
For AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, TEXTFILE, CSV, [REGEX] should be enough to have one test case eg changing from integer to bigint and checking message is "Unsupported storage format for changing column type"
(then you can simplify pattern in verifySetColumnTypeFailurePermissible)

findepi · 2023-03-02T16:08:58Z

testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

@@ -2442,7 +2442,7 @@ private List<SetColumnTypeSetup> setColumnTypeSetupData()
                .build();
    }

-    public record SetColumnTypeSetup(String sourceColumnType, String sourceValueLiteral, String newColumnType, String newValueLiteral, boolean unsupportedType)
+    public record SetColumnTypeSetup(String sourceColumnType, String sourceValueLiteral, String newColumnType, String newValueLiteral, boolean unsupportedType, Optional<String> tableProperty)


Optional<String> tableProperty -> String tableProperties
(empty string means "no properties", no need to wrap in optional)

findepi · 2023-03-02T16:10:17Z

testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

+
+        public SetColumnTypeSetup withTableProperty(String tableProperty)
+        {
+            return new SetColumnTypeSetup(sourceColumnType, sourceValueLiteral, newColumnType, newValueLiteral, unsupportedType, Optional.of("WITH (%s)".formatted(tableProperty)));


withTableProperty(foo) should be equivalent do instnatiating new SetColumnTypeSetup(.... , foo)
currently the former wraps with WITH (..)

move the SQL formatting logic to where it is used (will add separate comment there)

findepi · 2023-03-02T16:12:00Z

testing/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.java

@@ -2365,7 +2365,7 @@ public void testSetColumnTypes(SetColumnTypeSetup setup)

        TestTable table;
        try {
-            table = new TestTable(getQueryRunner()::execute, "test_set_column_type_", " AS SELECT CAST(" + setup.sourceValueLiteral + " AS " + setup.sourceColumnType + ") AS col");
+            table = new TestTable(getQueryRunner()::execute, "test_set_column_type_", setup.tableProperty.orElse("") + " AS SELECT CAST(" + setup.sourceValueLiteral + " AS " + setup.sourceColumnType + ") AS col");


tableConfiguration = ""; if (!setup.tableProperties.isEmpty()) { tableConfiguration += " WITH(%s)".formatted(setup.tableProperties); } ... new TestTable(..., tableConfiguration + " AS SELECT

findepi · 2023-03-02T16:18:48Z

docs/src/main/sphinx/connector/hive.rst

pre-existing, but since you're changing this line.... see #15637 (comment)

Sent #16376

findepi · 2023-03-02T16:20:05Z

lib/trino-orc/src/main/java/io/trino/orc/reader/ColumnReaders.java

@@ -65,9 +64,6 @@ public static ColumnReader createColumnReader(
            case BOOLEAN:
                return new BooleanColumnReader(type, column, memoryContext.newLocalMemoryContext(ColumnReaders.class.getSimpleName()));
            case BYTE:
-                if (type == INTEGER && !column.getAttributes().containsKey("iceberg.id")) {
-                    throw invalidStreamType(column, type);


Thanks for changing the commit title.

Is TestHiveTransactionalTable.java the only test coverage we have for this "Fix reading ORC files after column evolved from tinyint to integer" change?

findepi · 2023-03-02T16:22:29Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+    public void setColumnType(ConnectorSession session, ConnectorTableHandle tableHandle, ColumnHandle columnHandle, Type type)
+    {
+        HiveTableHandle table = (HiveTableHandle) tableHandle;
+        failIfAvroSchemaIsSet(table);


we should fail for CSV files too (csv is all varchar)

findepi · 2023-03-02T16:24:13Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+        table.getPartitionNames().ifPresent(partitionNames -> {
+            if (partitionNames.contains(column.getName())) {
+                throw new TrinoException(NOT_SUPPORTED, "Changing partition column types is not supported");


Partition names = names of partitions != names of partitioning columns

findepi · 2023-03-02T16:25:14Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+        if (sourceType instanceof VarcharType || sourceType instanceof CharType) {
+            return targetType instanceof VarcharType || targetType instanceof CharType;


Is truncation allowed?
Add a comment

findepi · 2023-03-02T16:28:49Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+    {
+        return fields.stream()
+                .filter(field -> field.getName().orElseThrow().equals(fieldName))
+                .findAny();


we expect at most one. i guess we should fail when duplicates

.collect(toOptional());

findepi · 2023-03-02T16:32:43Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/glue/GlueHiveMetastore.java

+    {
+        Table oldTable = getExistingTable(databaseName, tableName);
+        if (oldTable.getPartitionColumns().stream().anyMatch(column -> column.getName().equals(columnName))) {
+            throw new TrinoException(NOT_SUPPORTED, "Changing partition column types is not supported");


Does this check buy us anything wrt the check already done in HiveMetadata?

findepi · 2023-03-02T16:33:19Z

...in/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/BridgingHiveMetastore.java

+        io.trino.hive.thrift.metastore.Table table = delegate.getTable(databaseName, tableName)
+                .orElseThrow(() -> new TableNotFoundException(new SchemaTableName(databaseName, tableName)));
+        if (table.getPartitionKeys().stream().anyMatch(column -> column.getName().equals(columnName))) {
+            throw new TrinoException(NOT_SUPPORTED, "Changing partition column types is not supported");


Does this check buy us anything wrt the check already done in HiveMetadata?

findepi · 2023-03-02T16:34:57Z

plugin/trino-hive/src/test/java/io/trino/plugin/hive/BaseTestHiveOnDataLake.java

+        assertThat(onHive("SELECT * FROM " + hiveTableName))
+                .containsPattern("[ |]+123[ |]+");


onHive returns a String output from a CLI
this isn't really meant to run test queries

let's use product tests setup for that, where we have Hive JDBC and we can run queries normally, without processing textual output of some foreign CLI tool

findepi · 2023-03-02T16:36:06Z

...ve/src/test/java/io/trino/plugin/hive/metastore/glue/TestHiveGlueMetastoreCompatibility.java

+ * See https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html#credentials-default
+ * on ways to set your AWS credentials which will be needed to run this test.
+ */
+public class TestHiveGlueMetastoreCompatibility


Is there some existing test class which we could add the test to?
would be nice to avoid a minute-long setup just to run a second-long test case

I can't find the existing test class except for TestHiveGlueMetastore extending AbstractTestHiveLocal. Do you want me to use the class instead? I avoided the class because I prefer query-based tests.

The field isn't required in Thrift Hive metastore.

findepi · 2023-03-10T14:01:28Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

+            // Hive changes a column definition in each partitions unless the ALTER TABLE statement doesn't contain partition condition
+            // Trino doesn't support specifying partitions in ALTER TABLE, so SET DATA TYPE updates all partitions
+            // https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-AlterPartition
+            metastore.alterPartitions(table.getSchemaName(), table.getTableName(), partitions, OptionalLong.empty());


I am surprised

Firstl, i wouldn't expect existing partitions to be updated at all.

Second, it will make it impossible to update type in large tables.

Let's talk more about this

ebyhr · 2023-06-06T04:50:10Z

Closing because of less demand.

wookiist · 2023-06-13T07:33:39Z

@ebyhr Hello!
Is the code that has been committed so far sufficient to make the feature work? Or does it require further implementation?
I was following this pr because my team has needs to use this feature, but it was closed 😭

ebyhr self-assigned this Feb 2, 2023

cla-bot bot added the cla-signed label Feb 2, 2023

github-actions bot added docs tests:hive labels Feb 2, 2023

Praveen2112 reviewed Feb 2, 2023

View reviewed changes

lib/trino-orc/src/main/java/io/trino/orc/reader/ByteColumnReader.java Show resolved Hide resolved

ebyhr force-pushed the ebi/hive-set-column-type branch 2 times, most recently from bddf5dc to 42db233 Compare February 3, 2023 07:10

ebyhr requested a review from findepi February 6, 2023 23:51

raunaqmorarka reviewed Feb 7, 2023

View reviewed changes

findepi reviewed Feb 7, 2023

View reviewed changes

ebyhr force-pushed the ebi/hive-set-column-type branch from 42db233 to b8d8f72 Compare February 8, 2023 09:52

findinpath mentioned this pull request Feb 8, 2023

Fix ANALYZE when Hive partition has non-canonical value #15995

Merged

ebyhr force-pushed the ebi/hive-set-column-type branch from b8d8f72 to 65d6cb1 Compare February 9, 2023 07:43

ebyhr force-pushed the ebi/hive-set-column-type branch from 65d6cb1 to 45c4adc Compare March 2, 2023 06:06

github-actions bot added the hive Hive connector label Mar 2, 2023

findepi reviewed Mar 2, 2023

View reviewed changes

ebyhr added 7 commits March 6, 2023 14:20

Fix typo

a8ed176

Make writeId for alterPartitions optional

63f66ac

The field isn't required in Thrift Hive metastore.

Fix reading ORC files after column evolved from tinyint to integer

3c15ec3

Support type evolution from tinyint to smallint and bigint in ORC

6e081f1

Support changing column types in Hive connector

45629b4

fixup! Support changing column types in Hive connector

79d32e3

fixup! Support changing column types in Hive connector

ec89b8e

ebyhr force-pushed the ebi/hive-set-column-type branch from 45c4adc to ec89b8e Compare March 6, 2023 06:30

findepi mentioned this pull request Mar 6, 2023

Add coercion test for unpartitioned tables in hive #16119

Merged

findepi reviewed Mar 10, 2023

View reviewed changes

kokosing force-pushed the master branch from 3f05134 to 58d6356 Compare March 14, 2023 11:34

ebyhr closed this Jun 6, 2023

ebyhr deleted the ebi/hive-set-column-type branch June 6, 2023 04:49

ebyhr mentioned this pull request Oct 3, 2024

Implement ALTER COLUMN SET DATA TYPE statement in connectors #15515

Closed

10 tasks

		if (storageFormat != HiveStorageFormat.ORC && storageFormat != HiveStorageFormat.PARQUET) {
		throw new TrinoException(NOT_SUPPORTED, "Unsupported storage format for changing column type: " + storageFormat);

		setup.addAll(super.setColumnTypeSetupData().stream()
		.map(data -> data.withTableProperty("format = '%s'".formatted(storageFormat)))

		if (sourceType instanceof VarcharType \|\| sourceType instanceof CharType) {
		return targetType instanceof VarcharType \|\| targetType instanceof CharType;

		assertThat(onHive("SELECT * FROM " + hiveTableName))
		.containsPattern("[ \|]+123[ \|]+");

Support changing column types in Hive connector #15938

Support changing column types in Hive connector #15938

Conversation

ebyhr commented Feb 2, 2023

Description

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Mar 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Jun 6, 2023

wookiist commented Jun 13, 2023