Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338

puchengy · 2022-03-05T06:35:06Z

I found out that Trino 371 is not able to read timestamp column of the iceberg table generated from iceberg snapshot action.

This is the way to reproduce: in spark-sql 3.2.0, create a hive table of timestamp data type, and create a iceberg table on top of it

spark-sql> create table my_db.my_tbl (col timestamp);
spark-sql> insert into my_db.my_tbl values (from_unixtime(0, 'yyyy-MM-dd HH:mm:ss'));
spark-sql> CALL iceberg.system.snapshot('my_db.my_tbl', 'iceberg.my_db.my_tbl_ice');

on trino-371, we can successfully query the hive table, however, we will have problem when querying the iceberg table select * from iceberg.my_db.my_tbl_ice, then following error is generated:

java.lang.UnsupportedOperationException: io.trino.spi.type.LongTimestampWithTimeZoneType
	at io.trino.spi.type.AbstractType.writeLong(AbstractType.java:91)
	at io.trino.parquet.reader.TimestampColumnReader.readValue(TimestampColumnReader.java:51)
	at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:242)
	at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:298)
	at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:240)
	at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:229)
	at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:370)
	at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:444)
	at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:427)
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:211)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:390)
	at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:369)
	at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:276)
	at io.trino.spi.Page.getLoadedPage(Page.java:293)
	at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:314)
	at io.trino.operator.Driver.processInternal(Driver.java:388)
	at io.trino.operator.Driver.lambda$processFor$9(Driver.java:292)
	at io.trino.operator.Driver.tryWithLock(Driver.java:693)
	at io.trino.operator.Driver.processFor(Driver.java:285)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
	at io.trino.$gen.Trino_0a00079____20220305_060513_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The text was updated successfully, but these errors were encountered:

findepi · 2022-03-05T22:09:41Z

LongTimestampWithTimeZoneType should not implement the writeLong methd.

Please describe the problem you're trying to solve.

puchengy · 2022-03-06T05:41:02Z

@findepi thanks, I updated the title as well as the description to describe the issue and the way to reproduce it.

puchengy · 2022-03-07T19:38:00Z

Update:

the Iceberg table created from Spark-3.2.0 is using Iceberg 0.12.0
I tried to upgrade Trino's iceberg dependencies to 0.13.1 (backporting Update iceberg to 0.13.1 #11032 to Trino 371), still the same issue.

puchengy · 2022-03-08T04:35:10Z

I think this is how the issue is generated:

Spark-3.2.0 is writing timestamp data type into Hive Parquet table using Parquet int96 data type.
After the Hive table is converted into Iceberg, timestamp column is represented as Trino type timestamp (6) with timezone by class LongTimestampWithTimezoneType.
During the reads in Trino, TimestampColumnReader is used read int96 Parquet data https://github.com/trinodb/trino/blob/master/lib%2Ftrino-parquet%2Fsrc%2Fmain%2Fjava%2Fio%2Ftrino%2Fparquet%2Freader%2FPrimitiveColumnReader.java#L123
And as above mentioned, this column in Trino is of type LongTimestampWithTimezoneType and as LongTimestampWithTimezoneType didn't implement writeLong() method, so it failed at https://github.com/trinodb/trino/blob/master/lib%2Ftrino-parquet%2Fsrc%2Fmain%2Fjava%2Fio%2Ftrino%2Fparquet%2Freader%2FTimestampColumnReader.java#L51

puchengy · 2022-03-08T04:42:18Z

I see some relative discussion here apache/iceberg#1138 From the discussion, I think what we might need to do is to add the int96 support for iceberg parquet.

mgmarino · 2024-01-25T14:00:47Z

Is there an update on this? This is still an issue, e.g.:

Given: Hive table written with Spark/Hive legacy INT96 timestamps
Snapshot of Hive Table to Iceberg (without data rewrite) using e.g. Spark procedures e.g. here for testing.
The resultant Iceberg table is queryable using Spark, but trino fails with the error:

Query 20240125_133800_00003_rxbzi failed: Unsupported Trino column type (timestamp(6) with time zone) for Parquet column ([ts] optional int96 ts = 15)

It seems that the issue has been solved in Spark, so I assume it can also be solved in trino (and am happy to do so if I am pointed in the right direction).

findepi · 2024-01-31T14:50:35Z

#17391 is related

cc @marcinsbd

mgmarino · 2024-01-31T15:50:33Z

Thanks for the heads up. It looks like that PR is simply about enabling "migration" of a Hive Table with timestamp(3) (are those also written as INT96?), whereas this is about reading an already migrated or snapshot table (migrated/snapshot using Spark). The default is to migrat

I realize I missed the stack trace for my case (which is somewhat different than is posted above):

	at io.trino.parquet.reader.ColumnReaderFactory.unsupportedException(ColumnReaderFactory.java:356)
	at io.trino.parquet.reader.ColumnReaderFactory.create(ColumnReaderFactory.java:195)
	at io.trino.parquet.reader.ParquetReader.initializeColumnReaders(ParquetReader.java:509)
	at io.trino.parquet.reader.ParquetReader.advanceToNextRowGroup(ParquetReader.java:316)
	at io.trino.parquet.reader.ParquetReader.nextBatch(ParquetReader.java:269)
	at io.trino.parquet.reader.ParquetReader.nextPage(ParquetReader.java:242)
	at io.trino.plugin.hive.parquet.ParquetPageSource.getNextPage(ParquetPageSource.java:102)
	at io.trino.plugin.iceberg.IcebergPageSource.getNextPage(IcebergPageSource.java:127)
	at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:266)
	at io.trino.operator.Driver.processInternal(Driver.java:398)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:301)
	at io.trino.operator.Driver.tryWithLock(Driver.java:704)
	at io.trino.operator.Driver.process(Driver.java:293)
	at io.trino.operator.Driver.processForDuration(Driver.java:264)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:887)
	at io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187)
	at io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:565)
	at io.trino.$gen.Trino_437____20240131_153253_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

which points to here.

It looks like it would work if the timestamp does not have a timezone associated with it. Interestingly, the source table also did not have a timezone, so it must have somehow been assigned as UTC during the spark migration process. Not sure if that could provide a workaround here until this is fixed.

mgmarino · 2024-02-01T08:47:59Z

Ok, after realizing that timezone-agnostic INT96 timestamps should be readable, this led me to a workaround for this issue.

I performed the migration of the hive table in Spark, but I assume it is probably similar in other tools as well. In this case, Spark was 3.4.1, which was important because 3.4 introduce explicit support for timestamps without timezones.

The basic steps:

Grab table to migrate, cast the timestamp columns, and save a temporary view. In our case, this was anyways valid, because the underlying data was written as timezone-agnostic.

import pyspark.sql.functions as F
sc.table("db.tbl").withColumn(
    "ts",
    F.col("ts").cast("timestamp_ntz")
).createOrReplaceTempView("db_tbl_casted")

Create an empty iceberg table using the above view to define the schema (ice is the name of the iceberg catalog here in spark):

sc.sql("""
CREATE OR REPLACE TABLE ice.dest_db.dest_tbl
       USING ICEBERG 
       -- PARTITIONED BY (..) required if the table to migrate is partitioned and you wish to preserve this
       -- LOCATION 's3://your-location' possible required
       AS (SELECT * FROM db_tbl_casted LIMIT 0)
 """)

Run the add_files (note the original table is referenced!) to correctly reference the datafiles in the iceberg metadata:

sc.sql("CALL ice.system.add_files('dest_db.dest_tbl', 'db.tbl')")

Most importantly, subsequently, trino can read the migrated iceberg table correctly and does not throw errors. (In my case, I checked that I could use the most recent version of trino locally, as well as e.g. AWS Athena.)

I hope that helps, I'm still not sure if trino should support reading INT96 in the case that the schema marks it as a tz-aware timestamp.

raunaqmorarka · 2024-07-25T06:42:17Z

#22781 will allow reading INT96 timestamps as timestamp(6) with time zone type and fix the error encountered here

puchengy changed the title ~~Support writeLong() for io.trino.spi.type.LongTimestampWithTimeZoneType~~ Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure Mar 6, 2022

puchengy changed the title ~~Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure~~ [Iceberg] Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure Mar 6, 2022

puchengy changed the title ~~[Iceberg] Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure~~ [Iceberg Parquet] Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure Mar 6, 2022

puchengy changed the title ~~[Iceberg Parquet] Trino 371 not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure~~ Trino not able to read Iceberg column of type timestamptz in parquet format Mar 7, 2022

puchengy changed the title ~~Trino not able to read Iceberg column of type timestamptz in parquet format~~ Trino not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure Mar 8, 2022

puchengy changed the title ~~Trino not able to read timestamp column of the iceberg table generated from iceberg snapshot procedure~~ Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark Mar 8, 2022

puchengy self-assigned this Mar 11, 2022

puchengy added the bug Something isn't working label Mar 11, 2022

puchengy removed their assignment Aug 28, 2023

raunaqmorarka added the iceberg Iceberg connector label Jul 16, 2024

raunaqmorarka mentioned this issue Jul 24, 2024

Support reading INT96 timestamps as LongTimestampWithTimeZoneType #22781

Merged

wendigo closed this as completed in #22781 Jul 25, 2024

raunaqmorarka mentioned this issue Sep 18, 2024

Support timestamp type in Iceberg migrate procedure #17391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338

Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338

puchengy commented Mar 5, 2022 •

edited

Loading

findepi commented Mar 5, 2022

puchengy commented Mar 6, 2022

puchengy commented Mar 7, 2022

puchengy commented Mar 8, 2022

puchengy commented Mar 8, 2022 •

edited

Loading

mgmarino commented Jan 25, 2024

findepi commented Jan 31, 2024

mgmarino commented Jan 31, 2024 •

edited

Loading

mgmarino commented Feb 1, 2024

raunaqmorarka commented Jul 25, 2024

Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338

Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338

Comments

puchengy commented Mar 5, 2022 • edited Loading

findepi commented Mar 5, 2022

puchengy commented Mar 6, 2022

puchengy commented Mar 7, 2022

puchengy commented Mar 8, 2022

puchengy commented Mar 8, 2022 • edited Loading

mgmarino commented Jan 25, 2024

findepi commented Jan 31, 2024

mgmarino commented Jan 31, 2024 • edited Loading

mgmarino commented Feb 1, 2024

raunaqmorarka commented Jul 25, 2024

puchengy commented Mar 5, 2022 •

edited

Loading

puchengy commented Mar 8, 2022 •

edited

Loading

mgmarino commented Jan 31, 2024 •

edited

Loading