Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks fails integrity check after compacting with delta-rs #2839

Closed
vegarsti opened this issue Sep 2, 2024 · 2 comments
Closed

Databricks fails integrity check after compacting with delta-rs #2839

vegarsti opened this issue Sep 2, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@vegarsti
Copy link
Contributor

vegarsti commented Sep 2, 2024

This might not be a bug that is possible to solve for delta-rs, but maybe someone has experience with this. TL;DR incompatible delta log commits for optimize on delta-rs and Databricks?

Environment

Delta-rs version: 0.19.0

Binding: Rust

Environment:

  • Cloud provider: AWS

Bug

What happened: I am trying to successfully write to a table from Databricks (we have many) and to optimize these from delta-rs.
I started a job in Databricks writing into a Delta Lake table. I then stopped it. I then compacted the table using delta-rs, specifically .optimize() on a DeltaOps. When I start the stream in Databricks again, I get this error and am unable to write.

The transaction log has failed integrity checks. Failed verification at version 14 of:
FileSizeHistogram mismatch in file sizes
FileSizeHistogram mismatch in file counts
Table size (bytes) - Expected: 7478399129 Computed: 7439862311
Number of files - Expected: 109 Computed: 96
SetTransaction mismatch
Caused by: DeltaIllegalStateException: The transaction log has failed integrity checks. Failed verification at version 14 of:
FileSizeHistogram mismatch in file sizes
FileSizeHistogram mismatch in file counts
Table size (bytes) - Expected: 7478399129 Computed: 7439862311
Number of files - Expected: 109 Computed: 96

Note that DESCRIBE HISTORY on the table does not fail after the delta-rs commit.

What you expected to happen: Databricks to be able to read the delta log.

More details:

I noticed that the commitInfo object looks pretty different between delta-rs and Databricks. I'm wondering if that's why it fails the integrity check.

A Databricks optimize looks like this (from a different table)

{
  "commitInfo": {
    "timestamp": 1725025378204,
    "userId": "2843865834037528",
    "userName": "[email protected]",
    "operation": "OPTIMIZE",
    "operationParameters": {
      "predicate": "[]",
      "auto": false,
      "clusterBy": "[]",
      "zOrderBy": "[]",
      "batchId": "0"
    },
    "readVersion": 56,
    "isolationLevel": "SnapshotIsolation",
    "isBlindAppend": false,
    "operationMetrics": {
      "numRemovedFiles": "57",
      "numRemovedBytes": "128939735",
      "p25FileSize": "49587739",
      "numDeletionVectorsRemoved": "0",
      "minFileSize": "49587739",
      "numAddedFiles": "2",
      "maxFileSize": "78320319",
      "p75FileSize": "78320319",
      "p50FileSize": "78320319",
      "numAddedBytes": "127908058"
    },
    "tags": {
      "delta.rowTracking.preserved": "false"
    },
    "engineInfo": "Databricks-Runtime/15.3.x-photon-scala2.12",
    "txnId": "96f7c1b5-98e1-476f-9524-dadb04223f56"
  }
}

where as one from delta-rs is

{
  "commitInfo": {
    "timestamp": 1725021299379,
    "operation": "OPTIMIZE",
    "operationParameters": {
      "predicate": "[]",
      "targetSize": "104857600"
    },
    "operationMetrics": {
      "filesAdded": "{\"avg\":43499800.0,\"max\":54133760,\"min\":32865840,\"totalFiles\":2,\"totalSize\":86999600}",
      "filesRemoved": "{\"avg\":8966035.214285715,\"max\":55106878,\"min\":12030,\"totalFiles\":14,\"totalSize\":125524493}",
      "numBatches": 320,
      "numFilesAdded": 2,
      "numFilesRemoved": 14,
      "partitionsOptimized": 2,
      "preserveInsertionOrder": true,
      "totalConsideredFiles": 108,
      "totalFilesSkipped": 94
    },
    "clientVersion": "delta-rs.0.19.0",
    "readVersion": 13
  }
}

My hypothesis is that Databricks expects to read some properties but is unable to find them because they don't exist (maybe the same value is stored under a different key). Notice e.g. numAddedFiles vs numFilesAdded.

I saw this issue, but that seems to have been fixed #2087

I tried manually editing the commit to this - changed to numAddedFiles etc, and added some other properties that Databricks has. But it fails with the same error.

{
  "commitInfo": {
    "timestamp": 1725021299379,
    "operation": "OPTIMIZE",
    "operationParameters": {
      "predicate": "[]",
      "targetSize": "104857600"
    },
    "operationMetrics": {
      "filesAdded": "{\"avg\":43499800.0,\"max\":54133760,\"min\":32865840,\"totalFiles\":2,\"totalSize\":86999600}",
      "filesRemoved": "{\"avg\":8966035.214285715,\"max\":55106878,\"min\":12030,\"totalFiles\":14,\"totalSize\":125524493}",
      "numBatches": 320,
      "numAddedFiles": "2",
      "numRemovedBytes": "125524493",
      "numAddedBytes": "86999600",
      "numRemovedFiles": "14",
      "partitionsOptimized": 2,
      "preserveInsertionOrder": true,
      "totalConsideredFiles": 108,
      "totalFilesSkipped": 94,
      "numDeletionVectorsRemoved": "0",
      "minFileSize": "32865840",
      "maxFileSize": "54133760"
    },
    "clientVersion": "delta-rs.0.19.0",
    "readVersion": 13,
    "isolationLevel": "SnapshotIsolation",
    "isBlindAppend": false
  }
}
@vegarsti vegarsti added the bug Something isn't working label Sep 2, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Sep 2, 2024

I think you should create a databricks support ticket for this or an issue in the spark delta repo. I honestly have no clue what their integrity check is supposed to do, especially since delta-rs has no issue writing to optimized tables.

From the second commit info it clearly says that 96 files survived if you take considered + added - removed, so I'm not sure where spark delta is getting their total number from

@vegarsti vegarsti closed this as completed Sep 3, 2024
@vegarsti
Copy link
Contributor Author

vegarsti commented Sep 3, 2024

I think you should create a databricks support ticket for this or an issue in the spark delta repo. I honestly have no clue what their integrity check is supposed to do, especially since delta-rs has no issue writing to optimized tables.

From the second commit info it clearly says that 96 files survived if you take considered + added - removed, so I'm not sure where spark delta is getting their total number from

Yeah, good idea! Thanks!

Embarrassingly, I'm not able to reproduce this now, either, so I'm starting to doubt myself here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants