-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28518: Iceberg: Fix ClassCastException during in-place migration to Iceberg tables with timestamp columns #5590
base: master
Are you sure you want to change the base?
Conversation
…Iceberg tables with timestamp columns
Quality Gate passedIssues Measures |
@ggangadharan , thanks for the PR. I have 1 question though. |
Thank you for raising this question. Upon investigation, it appears that the issue stems from how the IcebergRecordReader interprets the timestamp column for different file formats:
Due to this discrepancy, we are encountering a ClassCastException when working with Parquet tables. As you mentioned, I also believe the root cause lies in the underlying file format/Iceberg level. I’ve attached a screenshot for reference. Please let me know if you need further details or if we should take any additional steps to address this. if it looks okay , Please review the PR. |
I wonder why everything is ok when we directly create an Iceberg + Parquet table.
|
Hi @okumin , Thank you for taking the time to review the pull request. In the Iceberg Parquet table, the timestamp column is read as LOCALDATETIME. I’ve attached a screenshot for reference. There is a notable difference in how the timestamp column is stored at the Parquet file format level. Specifically:
For clarity, I’ve also included the metadata from Parquet-tools for reference. As Iceberg Parquet table
As standard parquet table
|
Thanks. I also remember Hive didn't follow the regular convention on encoding TIMESTAMP. I don't have an immediate idea on how to fix it |
We should try to fix the regular convention on encoding TIMESTAMP, but that might not fix the case with existing tables, For those the fix in the current PR seems ok to me |
@ayushtkn Thank you for the feedback. Based on this, I believe the changes are in a good state to proceed unless there are further concerns. @Aggarwal-Raghav @okumin Could you kindly review the code and share your feedback? Your insights would be greatly appreciated to help move this forward. If there are any questions or blockers, feel free to let me know. Thank you for your time and support! |
Thank you! It is obvious. So, is the remaining problem to verify INT96 can be compatible with Iceberg's TIMESTAMP, which means verifying other query engines or tools can read it as a timestamp. I am trying to check it. |
@okumin Thanks for the update Successfully read the migrated ICEBERG table (previously migrated from Hive) using spark.sql in Spark , and it worked as expected. Spark is reading the timestamp column as TimestampNTZType. As per documentation - TimestampNTZType : Timestamp without time zone(TIMESTAMP_NTZ). It represents values comprising values of fields year, month, day, hour, minute, and second. All operations are performed without taking any time zone into account. Ref - https://spark.apache.org/docs/latest/sql-ref-datatypes.html Attaching spark3-shell output for a reference.
FYI While reading the string column name, I encountered an error that has been reported here . Since it is related to a spark/Iceberg issue, we can ignore it for now
|
What changes were proposed in this pull request?
This fix improves the stability and reliability of in-place migrated Iceberg tables involving timestamp data types.
Why are the changes needed?
The issue occurred due to incorrect type casting in the timestamp handling logic, which caused the migrated Iceberg tables Fetch task to fail.
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Qtest - iceberg_inplace_migration_with_timestamp_column.q