-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trino not able to read timestamp column represented as int96 of the iceberg parquet table migrated from Hive table written by Spark #11338
Comments
LongTimestampWithTimeZoneType should not implement the writeLong methd. Please describe the problem you're trying to solve. |
@findepi thanks, I updated the title as well as the description to describe the issue and the way to reproduce it. |
Update:
|
timestamptz
in parquet format
I think this is how the issue is generated:
|
timestamptz
in parquet format
I see some relative discussion here apache/iceberg#1138 From the discussion, I think what we might need to do is to add the int96 support for iceberg parquet. |
Is there an update on this? This is still an issue, e.g.:
It seems that the issue has been solved in Spark, so I assume it can also be solved in trino (and am happy to do so if I am pointed in the right direction). |
#17391 is related cc @marcinsbd |
Thanks for the heads up. It looks like that PR is simply about enabling "migration" of a Hive Table with I realize I missed the stack trace for my case (which is somewhat different than is posted above):
which points to here. It looks like it would work if the timestamp does not have a timezone associated with it. Interestingly, the source table also did not have a timezone, so it must have somehow been assigned as UTC during the spark migration process. Not sure if that could provide a workaround here until this is fixed. |
Ok, after realizing that timezone-agnostic INT96 timestamps should be readable, this led me to a workaround for this issue. I performed the migration of the hive table in Spark, but I assume it is probably similar in other tools as well. In this case, Spark was 3.4.1, which was important because 3.4 introduce explicit support for timestamps without timezones. The basic steps:
import pyspark.sql.functions as F
sc.table("db.tbl").withColumn(
"ts",
F.col("ts").cast("timestamp_ntz")
).createOrReplaceTempView("db_tbl_casted")
sc.sql("""
CREATE OR REPLACE TABLE ice.dest_db.dest_tbl
USING ICEBERG
-- PARTITIONED BY (..) required if the table to migrate is partitioned and you wish to preserve this
-- LOCATION 's3://your-location' possible required
AS (SELECT * FROM db_tbl_casted LIMIT 0)
""")
sc.sql("CALL ice.system.add_files('dest_db.dest_tbl', 'db.tbl')") Most importantly, subsequently, trino can read the migrated iceberg table correctly and does not throw errors. (In my case, I checked that I could use the most recent version of trino locally, as well as e.g. AWS Athena.) I hope that helps, I'm still not sure if trino should support reading INT96 in the case that the schema marks it as a tz-aware timestamp. |
#22781 will allow reading INT96 timestamps as |
I found out that Trino 371 is not able to read timestamp column of the iceberg table generated from iceberg snapshot action.
This is the way to reproduce: in spark-sql 3.2.0, create a hive table of timestamp data type, and create a iceberg table on top of it
on trino-371, we can successfully query the hive table, however, we will have problem when querying the iceberg table
select * from iceberg.my_db.my_tbl_ice
, then following error is generated:The text was updated successfully, but these errors were encountered: