-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087
Comments
Hi @mattslack-db thanks for reporting this issue. |
@Blajda this is part of the commitInfo, which is a free format according to the protocol. It's what we discussed last week in the call @mattslack-db we are actually following the protocol but spark-delta is not since they assume a dtype. We are looking at aligning this and making some of the assumptions of what the expected schema is into the protocol |
@ion-elgreco agreed, according to the protocol, "Implementations are free to store any valid JSON-formatted data via the commitInfo action.", so although Databricks can make assumptions about the schema of |
The commit info or rather having arbitrarily nested data in the actions is quite a nuisance, So while the protocol gives us the freedom (for now) I believe delta-rs should also avoid this and just dump json payloads as strings where applicable. |
I've been noticing something similar. After I run a compact() command over my delta lake stored in s3. At first I thought it's a transient error with AWS but now it seems like it is associated with this bug. Do we have any plans to solve this? Also would using the apache spark optimize solve this? |
The same thing happens with I even verified my last comment by manually editing the fields - It's not just limited to glue and databricks, rather a lot of other query engines I work with expect the same thing and everything got resolved because of @mattslack-db one comment. Thank you so much! Please patch this :) |
…2317) I am by no means a Rust developer and haven't touched it in years; so please let me know if there's a better way to go about this. The Rust z_order and optimize.compact already serializes the metrics before it is passed back to Python, which then deserializes it back, so the Python behavior in terms of expecting this as a Dict has not changed which I think is what we want. # Description Adds a custom serialzer and Display implementation for the `MetricDetails` fields, namely `filesAdded` and `filesRemoved` so that those fields are written as strings instead of a struct to the commit log. Query engines expect these fields to be strings on reads. I had trouble getting the pyspark tests running locally, but here is an example optimize commit log that gets written with these changes: ``` {"commitInfo":{"timestamp":1711125995487,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600","predicate":"[]"},"clientVersion":"delta-rs.0.17.1","readVersion":10,"operationMetrics":{"filesAdded":"{\"avg\":19956.0,\"max\":19956,\"min\":19956,\"totalFiles\":1,\"totalSize\":19956}","filesRemoved":"{\"avg\":4851.833333333333,\"max\":10358,\"min\":3734,\"totalFiles\":6,\"totalSize\":29111}","numBatches":6,"numFilesAdded":1,"numFilesRemoved":6,"partitionsOptimized":1,"preserveInsertionOrder":true,"totalConsideredFiles":6,"totalFilesSkipped":0}}} ``` # Related Issue(s) - #2087 # Documentation N/A
Environment
Delta-rs version:
rust-v0.16.5
Python deltalake 0.15.1
Binding:
Python binding, but believe not specific to Python as it is just a wrapper over this method
Environment:
Azure
Ubuntu
Databricks 14.1
Bug
What happened:
Running
dt.optimize.compact()
creates an entry in the delta log which is not readable by Databricks Runtime.Example:
"commitInfo":{"timestamp":1705439516034,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600"},"operationMetrics":{"filesAdded":{"avg":50424.6,"max":50517,"min":50335,"totalFiles":5,"totalSize":252123},"filesRemoved":{"avg":14010.447154471545,"max":15213,"min":13942,"totalFiles":123,"totalSize":1723285},"numBatches":123,"numFilesAdded":5,"numFilesRemoved":123,"partitionsOptimized":5,"preserveInsertionOrder":true,"totalConsideredFiles":123,"totalFilesSkipped":0},"clientVersion":"
What you expected to happen:
Databricks Runtime (and Delta Sharing) expect filesAdded to be a string
How to reproduce it:
This can be reproduced in Databricks by running
dt.optimize.compact()
against a Delta table after some appends and then runningDESCRIBE HISTORY
against the same table.More details:
The text was updated successfully, but these errors were encountered: