After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

mattslack-db · 2024-01-16T22:31:32Z

Environment

Delta-rs version:
rust-v0.16.5
Python deltalake 0.15.1

Binding:
Python binding, but believe not specific to Python as it is just a wrapper over this method

Environment:

Cloud provider:
Azure
OS:
Ubuntu
Other:
Databricks 14.1

Bug

What happened:
Running dt.optimize.compact() creates an entry in the delta log which is not readable by Databricks Runtime.

Example:
"commitInfo":{"timestamp":1705439516034,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600"},"operationMetrics":{"filesAdded":{"avg":50424.6,"max":50517,"min":50335,"totalFiles":5,"totalSize":252123},"filesRemoved":{"avg":14010.447154471545,"max":15213,"min":13942,"totalFiles":123,"totalSize":1723285},"numBatches":123,"numFilesAdded":5,"numFilesRemoved":123,"partitionsOptimized":5,"preserveInsertionOrder":true,"totalConsideredFiles":123,"totalFilesSkipped":0},"clientVersion":"

What you expected to happen:
Databricks Runtime (and Delta Sharing) expect filesAdded to be a string

How to reproduce it:
This can be reproduced in Databricks by running dt.optimize.compact() against a Delta table after some appends and then running DESCRIBE HISTORY against the same table.

More details:

The text was updated successfully, but these errors were encountered:

Blajda · 2024-01-17T02:55:19Z

Hi @mattslack-db thanks for reporting this issue.
Does a specification exist for the expected format of operationMetrics? Testing integration with Spark can be bit annoying so we would test against that.

ion-elgreco · 2024-01-17T06:49:53Z

@Blajda this is part of the commitInfo, which is a free format according to the protocol. It's what we discussed last week in the call

@mattslack-db we are actually following the protocol but spark-delta is not since they assume a dtype. We are looking at aligning this and making some of the assumptions of what the expected schema is into the protocol

mattslack-db · 2024-01-17T12:34:20Z

@ion-elgreco agreed, according to the protocol, "Implementations are free to store any valid JSON-formatted data via the commitInfo action.", so although Databricks can make assumptions about the schema of commitInfo (in this case, that operationMetrics is a map<string, string>) it shouldn't fall over if that schema is not followed.

roeap · 2024-01-27T00:10:30Z

The commit info or rather having arbitrarily nested data in the actions is quite a nuisance, So while the protocol gives us the freedom (for now) I believe delta-rs should also avoid this and just dump json payloads as strings where applicable.

mayanksingh2298 · 2024-01-31T12:56:50Z

I've been noticing something similar. After I run a compact() command over my delta lake stored in s3.
I'm unable to run a glue crawler on it to produce the metadata catalog (it gives an Internal service exception - without any further debug info).

At first I thought it's a transient error with AWS but now it seems like it is associated with this bug.

Do we have any plans to solve this? Also would using the apache spark optimize solve this?

mayanksingh2298 · 2024-01-31T13:16:30Z

The same thing happens with filesRemoved.

I even verified my last comment by manually editing the fields - filesAdded and filesRemoved as strings instead of json. AWS glue was successful in crawling them.

It's not just limited to glue and databricks, rather a lot of other query engines I work with expect the same thing and everything got resolved because of @mattslack-db one comment. Thank you so much!

Please patch this :)

…2317) I am by no means a Rust developer and haven't touched it in years; so please let me know if there's a better way to go about this. The Rust z_order and optimize.compact already serializes the metrics before it is passed back to Python, which then deserializes it back, so the Python behavior in terms of expecting this as a Dict has not changed which I think is what we want. # Description Adds a custom serialzer and Display implementation for the `MetricDetails` fields, namely `filesAdded` and `filesRemoved` so that those fields are written as strings instead of a struct to the commit log. Query engines expect these fields to be strings on reads. I had trouble getting the pyspark tests running locally, but here is an example optimize commit log that gets written with these changes: ``` {"commitInfo":{"timestamp":1711125995487,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600","predicate":"[]"},"clientVersion":"delta-rs.0.17.1","readVersion":10,"operationMetrics":{"filesAdded":"{\"avg\":19956.0,\"max\":19956,\"min\":19956,\"totalFiles\":1,\"totalSize\":19956}","filesRemoved":"{\"avg\":4851.833333333333,\"max\":10358,\"min\":3734,\"totalFiles\":6,\"totalSize\":29111}","numBatches":6,"numFilesAdded":1,"numFilesRemoved":6,"partitionsOptimized":1,"preserveInsertionOrder":true,"totalConsideredFiles":6,"totalFilesSkipped":0}}} ``` # Related Issue(s) - #2087 # Documentation N/A

mattslack-db added the bug Something isn't working label Jan 16, 2024

rtyler added this to the Rust v0.18 milestone Feb 6, 2024

This was referenced Mar 22, 2024

fix(python): return serialized optimize metrics as is #2316

Closed

fix(rust): serialize MetricDetails from compaction runs to a string #2317

Merged

vegarsti mentioned this issue Sep 2, 2024

Databricks fails integrity check after compacting with delta-rs #2839

Closed

ion-elgreco closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

mattslack-db commented Jan 16, 2024 •

edited

Loading

Blajda commented Jan 17, 2024

ion-elgreco commented Jan 17, 2024

mattslack-db commented Jan 17, 2024

roeap commented Jan 27, 2024

mayanksingh2298 commented Jan 31, 2024

mayanksingh2298 commented Jan 31, 2024

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

Comments

mattslack-db commented Jan 16, 2024 • edited Loading

Environment

Bug

Blajda commented Jan 17, 2024

ion-elgreco commented Jan 17, 2024

mattslack-db commented Jan 17, 2024

roeap commented Jan 27, 2024

mayanksingh2298 commented Jan 31, 2024

mayanksingh2298 commented Jan 31, 2024

mattslack-db commented Jan 16, 2024 •

edited

Loading