-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] add WHEN NOT MATCHED BY SOURCE/TARGET clause suppoort #1364
Comments
Is there any workaround / current solution (perhaps not in the SQL but scala/Python API? |
Thanks for creating this issue @geoHeil! Copying over a suggested approach from the Slack thread:
Otherwise, the feature request is a good one, so we'll leave it open here if anyone from the community would like to work on it. |
A small example: step 1: initial write
First UPSERT (including SCD2 soft delete):
Next (2nd) UPSERT (including SCD2 soft delete):
now try to achieve the SCD2 (closing old row inserting new row) via MERGE INTO:
now trying the merge into:
questions:
|
Some of the questions (around easing the handling of deletions) can be partially handled with:
now when applying this function:
Works as expected
works as expected
works as expected
this is not working as expected. The old record is Invalidated. Notice however, that the new record is not written (opening a new fresh row starting from that date) When re-executing for a 2nd time:
the 67 appears. How can I fix the MERGE INTO conditions so the value appears straight away? Is there perhaps some additional MATCH condition required? How can I specify a |
This feature seems neat! Unfortunately, we are already fully booked in our H2 roadmap, and since you've indicated you are unable to contribute a solution, there's not much we can do right now. When we begin our 2023 roadmap planning, please bring up this issue again so that we can get the community's input on how desired it is! Cheers. |
FYI I actually did all the work a couple of years ago and have a branch with this implemented for the Scala API only here: https://github.com/tripl-ai/delta At the time the PR was rejected to this repo but if you are motivated the code could be updated for latest Delta (not by me). |
I created a design doc to implement support for WHEN NOT MATCHED BY SOURCE clauses: [Design Doc] WHEN NOT MATCHED BY SOURCE. This enables selectively updating or deleting target rows that have no matches in the source table based on the merge condition. APIA new Usage example:
This merge invocation will:
More details on the API and the implementation proposal can be found in the design doc. The SQL API will be shipped with Spark 3.4, see apache/spark#38400. Project Plan
|
… BY SOURCE clause in merge commands. Support for the clause was introduced in #1511 using the Scala Delta Table API, this patch extends the Python API to support the new clause. See corresponding feature request: #1364 Adding python tests covering WHEN NOT MATCHED BY SOURCE to test_deltatable.py. The extended API for NOT MATCHED BY SOURCE mirrors existing clauses (MATCHED/NOT MATCHED). Usage: ``` dt.merge(source, "key = k") .whenNotMatchedBySourceDelete(condition="value > 0") .whenNotMatchedBySourceUpdate(set={"value": "value + 0"}) .execute() ``` Closes #1533 GitOrigin-RevId: 76c7aea481fdbbf47af36ef7251ed555749954ac
solved by #1511 |
Just posting for visibility, from #1511: |
SQL support is in #1740 |
Feature request
https://delta-users.slack.com/archives/CJ70UCSHM/p1661955032288519
Overview
WHEN NOT MATCHED BY SOURCE/TARGET clause support
Motivation
feature parity with popular other SQL databases, ease of use
Further details
Each day I get a full dump of a table. However, this data needs to be cleaned and in particular, compressed using the SCD2 style approach to be easily consumable downstream.
Unfortunately, I do not get changesets or a NULL value for a key in case of deletions. I only receive NO LONGER a row (including the key).
The links:
show me that something along these lines (https://www.mssqltips.com/sqlservertip/1704/using-merge-in-sql-server-to-insert-update-and-delete-at-the-same-time):
will not work with Delta/Spark as the WHEN NOT MATCHED clause does not seem to support the BY SOURCE | TARGET extension.
How can I still calculate the SCD2 representation?
An example case/dataset:
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: