Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After dt.optimize.compact(), commitInfo.operationMetrics.filesAdded is a JSON map when other readers (e.g. Databricks) expect a string #2087

Closed
mattslack-db opened this issue Jan 16, 2024 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@mattslack-db
Copy link

mattslack-db commented Jan 16, 2024

Environment

Delta-rs version:
rust-v0.16.5
Python deltalake 0.15.1

Binding:
Python binding, but believe not specific to Python as it is just a wrapper over this method

Environment:

  • Cloud provider:
    Azure
  • OS:
    Ubuntu
  • Other:
    Databricks 14.1

Bug

What happened:
Running dt.optimize.compact() creates an entry in the delta log which is not readable by Databricks Runtime.

Example:
"commitInfo":{"timestamp":1705439516034,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600"},"operationMetrics":{"filesAdded":{"avg":50424.6,"max":50517,"min":50335,"totalFiles":5,"totalSize":252123},"filesRemoved":{"avg":14010.447154471545,"max":15213,"min":13942,"totalFiles":123,"totalSize":1723285},"numBatches":123,"numFilesAdded":5,"numFilesRemoved":123,"partitionsOptimized":5,"preserveInsertionOrder":true,"totalConsideredFiles":123,"totalFilesSkipped":0},"clientVersion":"

What you expected to happen:
Databricks Runtime (and Delta Sharing) expect filesAdded to be a string

How to reproduce it:
This can be reproduced in Databricks by running dt.optimize.compact() against a Delta table after some appends and then running DESCRIBE HISTORY against the same table.

More details:

@mattslack-db mattslack-db added the bug Something isn't working label Jan 16, 2024
@Blajda
Copy link
Collaborator

Blajda commented Jan 17, 2024

Hi @mattslack-db thanks for reporting this issue.
Does a specification exist for the expected format of operationMetrics? Testing integration with Spark can be bit annoying so we would test against that.

@ion-elgreco
Copy link
Collaborator

@Blajda this is part of the commitInfo, which is a free format according to the protocol. It's what we discussed last week in the call

@mattslack-db we are actually following the protocol but spark-delta is not since they assume a dtype. We are looking at aligning this and making some of the assumptions of what the expected schema is into the protocol

@mattslack-db
Copy link
Author

@ion-elgreco agreed, according to the protocol, "Implementations are free to store any valid JSON-formatted data via the commitInfo action.", so although Databricks can make assumptions about the schema of commitInfo (in this case, that operationMetrics is a map<string, string>) it shouldn't fall over if that schema is not followed.

@roeap
Copy link
Collaborator

roeap commented Jan 27, 2024

The commit info or rather having arbitrarily nested data in the actions is quite a nuisance, So while the protocol gives us the freedom (for now) I believe delta-rs should also avoid this and just dump json payloads as strings where applicable.

@mayanksingh2298
Copy link

I've been noticing something similar. After I run a compact() command over my delta lake stored in s3.
I'm unable to run a glue crawler on it to produce the metadata catalog (it gives an Internal service exception - without any further debug info).

At first I thought it's a transient error with AWS but now it seems like it is associated with this bug.

Do we have any plans to solve this? Also would using the apache spark optimize solve this?

@mayanksingh2298
Copy link

The same thing happens with filesRemoved.

I even verified my last comment by manually editing the fields - filesAdded and filesRemoved as strings instead of json. AWS glue was successful in crawling them.

It's not just limited to glue and databricks, rather a lot of other query engines I work with expect the same thing and everything got resolved because of @mattslack-db one comment. Thank you so much!

Please patch this :)

@rtyler rtyler added this to the Rust v0.18 milestone Feb 6, 2024
ion-elgreco pushed a commit that referenced this issue Mar 24, 2024
…2317)

I am by no means a Rust developer and haven't touched it in years; so
please let me know if there's a better way to go about this. The Rust
z_order and optimize.compact already serializes the metrics before it is
passed back to Python, which then deserializes it back, so the Python
behavior in terms of expecting this as a Dict has not changed which I
think is what we want.

# Description
Adds a custom serialzer and Display implementation for the
`MetricDetails` fields, namely `filesAdded` and `filesRemoved` so that
those fields are written as strings instead of a struct to the commit
log. Query engines expect these fields to be strings on reads.

I had trouble getting the pyspark tests running locally, but here is an
example optimize commit log that gets written with these changes:

```
{"commitInfo":{"timestamp":1711125995487,"operation":"OPTIMIZE","operationParameters":{"targetSize":"104857600","predicate":"[]"},"clientVersion":"delta-rs.0.17.1","readVersion":10,"operationMetrics":{"filesAdded":"{\"avg\":19956.0,\"max\":19956,\"min\":19956,\"totalFiles\":1,\"totalSize\":19956}","filesRemoved":"{\"avg\":4851.833333333333,\"max\":10358,\"min\":3734,\"totalFiles\":6,\"totalSize\":29111}","numBatches":6,"numFilesAdded":1,"numFilesRemoved":6,"partitionsOptimized":1,"preserveInsertionOrder":true,"totalConsideredFiles":6,"totalFilesSkipped":0}}}
```

# Related Issue(s)
- #2087

# Documentation

N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants