-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive cannot read ORC ACID table updated by Trino twice #8268
Comments
Since we update on the same key multiple times it throws this exception. If we do a major compaction post each update. Things should work as expected. Will test and raise a PR. |
Trino cannot do a major compaction. Table should be readable also if a compaction doesn't happen. |
But for tests maybe we could trigger the compaction from Hive after assertion. Or we would need to apply update on a different set of columns. |
I would see the bug fixed rather than worked around in tests. Assuming it is indeed a bug. @djsstarburst do you happen to recognize this? |
But the bug is in Hive side ? https://issues.apache.org/jira/browse/HIVE-22318 It looks like the delete delta are of same size and it would create similar record identifier. (as quoted in the JIRA) |
Other workaround is to ensure the update is applied on different set of rows instead of applying on same set of rows. |
Corresponding JIRA in Hive : https://issues.apache.org/jira/browse/HIVE-22318 |
the jira talks about Hive's MERGE statement. Can we assume at this point there is no bug on the Trino side? |
This is seen both during merge statement and when selecting from that Table. From the exception and its source it looks like the issue is during Reading the ORC file (with a bunch of delete delta). Ref : https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1225
Yes !! We are able to read the data and its the updated one so its not a bug in Trino side. |
Full repro steps:
now in Hive:
however if i recreate the table in Trino
and then run INSERTs and UPDATEs in Hive then the
or, if i recreate and populate the table in Trino
and then run UPDATEs in Hive then the
To be the above is quite convincing it's a problem in how Trino UPDATE creates delta files. |
When we run a query like this on fresh table
Trino inserts the data into the following directory
And when we insert another row in it
Trino inserts the data into the following directory
So now when we run an update like this
Trino creates a delta directories for each of the directory (delta_0000001_0000001_0000, delta_0000002_0000002_0000) now when we run an update like this
Trino creates two more directories for the new delta ( referring to delta_0000001_0000001_0000, delta_0000002_0000002_0000) but now the deleted rows information have same One solution is introduce a different bucket number for each of the delta directories created so that similar rowIds could be mapped to a different bucket. Please correct me if I am wrong. |
I'm surprised that Hive can't read files with the same bucket but different statementIds. Confirming what you found, I used the orc-tools to decode the data files after the two inserts and two updates in the test cased by @findepi. The results are below. I guess it's obvious - - and I just tested it - - that if the two rows were inserted in a single insert transaction the test passes, because there is only one split in the bucket. To avoid producing files with different statementIds, I think Trino UPDATE would have to add an ExchangeNode layer to flow all the splits belonging to a single bucket into one node and one file. @electrum, your thoughts?
|
This is the source of the Hive error message: https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcRawRecordMerger.java#L1169 |
The equality for Assuming this is the issue, then we need to ensure unique row IDs across the writers. I can think of two ways to do this:
|
It looks like the current row ID generation has a bug where it gets reset for every page (which is not the cause of this issue but needs to be fixed regardless): trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcFileWriter.java Lines 307 to 314 in 2734d84
|
Note that long term we could switch to the first strategy of single writer per bucket, after merge lands and we change the implementation of update/delete to use the merge connector APIs, which support redistribution. |
Fixes trinodb#8268 The problem was caused by multiple rows having the same (writeId, bucket, rowId). In order to fix this it is necessary to ensure unique row IDs across writers. To achieve it different writers will have separated id ranges in the split assigned to them
Fixes #8268 The problem was caused by multiple rows having the same (writeId, bucket, rowId). In order to fix this it is necessary to ensure unique row IDs across writers. To achieve it different writers will have separated id ranges in the split assigned to them
Fixes trinodb#8268 The problem was caused by multiple rows having the same (writeId, bucket, rowId). In order to fix this it is necessary to ensure unique row IDs across writers. To achieve it different writers will have separated id ranges in the split assigned to them
Fixes trinodb#8268 The problem was caused by multiple rows having the same (writeId, bucket, rowId). In order to fix this it is necessary to ensure unique row IDs across writers. To achieve it different writers will have separated id ranges in the split assigned to them
repro steps in #8267 as a TODO
full repro steps in a comment below #8268 (comment)
The text was updated successfully, but these errors were encountered: