You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This might not be a bug that is possible to solve for delta-rs, but maybe someone has experience with this. TL;DR incompatible delta log commits for optimize on delta-rs and Databricks?
Environment
Delta-rs version: 0.19.0
Binding: Rust
Environment:
Cloud provider: AWS
Bug
What happened: I am trying to successfully write to a table from Databricks (we have many) and to optimize these from delta-rs.
I started a job in Databricks writing into a Delta Lake table. I then stopped it. I then compacted the table using delta-rs, specifically .optimize() on a DeltaOps. When I start the stream in Databricks again, I get this error and am unable to write.
The transaction log has failed integrity checks. Failed verification at version 14 of:
FileSizeHistogram mismatch in file sizes
FileSizeHistogram mismatch in file counts
Table size (bytes) - Expected: 7478399129 Computed: 7439862311
Number of files - Expected: 109 Computed: 96
SetTransaction mismatch
Caused by: DeltaIllegalStateException: The transaction log has failed integrity checks. Failed verification at version 14 of:
FileSizeHistogram mismatch in file sizes
FileSizeHistogram mismatch in file counts
Table size (bytes) - Expected: 7478399129 Computed: 7439862311
Number of files - Expected: 109 Computed: 96
Note that DESCRIBE HISTORY on the table does not fail after the delta-rs commit.
What you expected to happen: Databricks to be able to read the delta log.
More details:
I noticed that the commitInfo object looks pretty different between delta-rs and Databricks. I'm wondering if that's why it fails the integrity check.
A Databricks optimize looks like this (from a different table)
My hypothesis is that Databricks expects to read some properties but is unable to find them because they don't exist (maybe the same value is stored under a different key). Notice e.g. numAddedFiles vs numFilesAdded.
I saw this issue, but that seems to have been fixed #2087
I tried manually editing the commit to this - changed to numAddedFiles etc, and added some other properties that Databricks has. But it fails with the same error.
I think you should create a databricks support ticket for this or an issue in the spark delta repo. I honestly have no clue what their integrity check is supposed to do, especially since delta-rs has no issue writing to optimized tables.
From the second commit info it clearly says that 96 files survived if you take considered + added - removed, so I'm not sure where spark delta is getting their total number from
I think you should create a databricks support ticket for this or an issue in the spark delta repo. I honestly have no clue what their integrity check is supposed to do, especially since delta-rs has no issue writing to optimized tables.
From the second commit info it clearly says that 96 files survived if you take considered + added - removed, so I'm not sure where spark delta is getting their total number from
Yeah, good idea! Thanks!
Embarrassingly, I'm not able to reproduce this now, either, so I'm starting to doubt myself here.
This might not be a bug that is possible to solve for delta-rs, but maybe someone has experience with this. TL;DR incompatible delta log commits for optimize on delta-rs and Databricks?
Environment
Delta-rs version: 0.19.0
Binding: Rust
Environment:
Bug
What happened: I am trying to successfully write to a table from Databricks (we have many) and to optimize these from delta-rs.
I started a job in Databricks writing into a Delta Lake table. I then stopped it. I then compacted the table using delta-rs, specifically
.optimize()
on aDeltaOps
. When I start the stream in Databricks again, I get this error and am unable to write.Note that
DESCRIBE HISTORY
on the table does not fail after the delta-rs commit.What you expected to happen: Databricks to be able to read the delta log.
More details:
I noticed that the
commitInfo
object looks pretty different between delta-rs and Databricks. I'm wondering if that's why it fails the integrity check.A Databricks optimize looks like this (from a different table)
where as one from delta-rs is
My hypothesis is that Databricks expects to read some properties but is unable to find them because they don't exist (maybe the same value is stored under a different key). Notice e.g.
numAddedFiles
vsnumFilesAdded
.I saw this issue, but that seems to have been fixed #2087
I tried manually editing the commit to this - changed to
numAddedFiles
etc, and added some other properties that Databricks has. But it fails with the same error.The text was updated successfully, but these errors were encountered: