Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction #10417

Open
jlowe opened this issue Feb 13, 2024 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@jlowe
Copy link
Member

jlowe commented Feb 13, 2024

https://docs.databricks.com/en/delta/tune-file-size.html states that Delta Lake MERGE, UPDATE, and DELETE operations will always trigger optimized write and auto compaction behavior as of 10.4 LTS, and this cannot be disabled. The RAPIDS Accelerator forms of these operations should mimic this behavior.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify and removed feature request New feature or request labels Feb 13, 2024
@mattahrens mattahrens added feature request New feature or request and removed ? - Needs Triage Need team to review and classify labels Feb 13, 2024
@jlowe
Copy link
Member Author

jlowe commented Feb 13, 2024

Note that this also should remove the repartition by partition key for partitioned tables when writing a MERGE because we're going to turn around and repartition for the optimized write anyway.

@andygrove andygrove added the ? - Needs Triage Need team to review and classify label Apr 1, 2024
@andygrove andygrove removed their assignment Apr 1, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024
@jlowe
Copy link
Member Author

jlowe commented Apr 2, 2024

Note that for MERGE the user can specify spark.databricks.delta.merge.repartitionBeforeWrite.enabled=false to avoid repartitioning by the partition key when doing a merge into a few number of partitions to avoid sending all the write data to just a small number of tasks. Not exactly semantically equivalent to optimize write and auto compact, but it can avoid the terrible write performance for that partitioned write case.

@liurenjie1024
Copy link
Collaborator

Hi, @jlowe delta oss have added support for optimized write: delta-io/delta#2145 I think we can always enable optimized write after porting this?

@jlowe
Copy link
Member Author

jlowe commented Apr 18, 2024

This is a Databricks-specific behavior per the doc linked above, not a behavior in OSS Delta Lake, at least for the versions of OSS Delta Lake that we support. There's already a separate issue for tracking the OSS versions of optimized write and auto compact, see #10397 and #10398, respectively, but I do not see it as being relevant for this issue. We already support optimized write and auto compact on Databricks.

@liurenjie1024 liurenjie1024 self-assigned this Apr 19, 2024
@liurenjie1024
Copy link
Collaborator

I'll take this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants