-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Rows detected during snapshot #2642
Comments
This issue sounds like the crux of the matter:
Which snapshot strategy are you using? If it's the Therefore, the way to end up with duplicate
If the underlying table is liable to having exact duplicate records, you can add logic to the snapshot query or an intermediate model to remove those duplicates. |
I've encountered this exact same issue with the |
Hey @atvaccaro, it is technically possible to force Such an approach doesn't use a snapshot strategy to detect changes; it's just append-only. That means the snapshot is:
It's not something I'd ever recommend, but you're welcome to implement it in your own project. |
I see multiplicates too, in rather large number (in one table, 20% or rows appear twice in the snapshot) when using a
we're using redshift and I checked that all of the Any idea what this could be? |
@josarago Just confirming, do those duplicate rows have the exact same value of |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
I have the same issue, with my config looking like this
should i create a separate issue for thus.
|
We also seem to observe duplicate rows in our snapshot (single row from source table becomes two rows), although the circumstances of how it happened are unclear at the moment. We are using the |
@muscovitebob @urwa we're looking into this again. One hypothesis is that using multiple threads could cause this. Are you also using more than 1? |
We also run into this issue every now and then. Our config is the same as @urwa's |
I think in my case the issue may have been caused by running two instances of dbt concurrently. We have been migrating Airflow instances and had a dbt dag running on both instances at one point. I suspect that the snapshot command ran at the same time on both by accident and this is the root cause on my case. |
I have the same issue on exasol with error message: "Unable to get a stable set of rows in the source tables" and there are duplicate lines in the temp table before merge, even though the soures are clean. I figured out, that a single quote within a varchar column caused the problem, after excluding all rows with single quotes in the string, the duplicates where gone |
We have somewhat similar issue. I got the error
I checked my snapshot table by following query:
And I found exactly identical rows for a lot of I can manually remove them but i would like to understand why it happened in the first place. We are using
|
I just saw I have commented on this thread before. We are using |
Hello, I know this is a closed issue, but I wanted to report that I'm seeing this weird behavior as well. I have a table that I'm snapshotting with the |
we recently had the same error, and eventually we found it is due to duplications in the underlying table in RAW DB and it is tricky to identify initially. some of our experience while we trouble shot our error
|
@ddppi first of all, I loved the recursive structure of your troubleshooting steps. Here's a time lapse of me walking through them making sure I understand everything correctly: Secondly, some good news! We have a timely discussion going on in #6089 that contains a proposal to proactively search for duplicates before merging into the final snapshot table. The summary of the current proposal is here. Would welcome you to join that discussion and contribute any feedback you have. |
Glad that it is working for you! Thank you for report behavior that you didn't expect. I'll do my best to respond below. I didn't actually confirm any of this -- just thinking about it conceptually -- so please forgive me if I end up getting some of the details wrong! To do a full snapshot every time (no matter what) adding the You mentioned that your table has unique rows, which should theoretically do a full snapshot if you are doing either of the following:
There is at least one situation for each that would lead to it not creating a new row:
|
@dbeatty10 It's been a little while since I had the issue. Yes, I had been using the "timestamp" strategy. The |
We have been battling a dbt bug for several months now that we were hopeful was solved in the release of 0.17.0.
Consistently, the snapshot of a table we have breaks due to the following error:
Database Error in snapshot user_campaign_audit (snapshots/user_campaign_audit.sql) 100090 (42P18): Duplicate row detected during DML action
Checking our snapshot table, there are indeed multiple rows with identical
dbt_scd_id
s. The table being snapshot changes it's schema with relatively high frequency. It's a core table that feeds a lot of downstream tables, so new columns are added fairly often. We also run a production dbt run every time we merge a branch into our master branch (we are running dbt on a GItlab CI/CD flow), so the snapshot can run multiple times a day.Our current approach to fix this is to create a copy of the snapshot table, reduce it to every distinct record, and then use that as the production version of the table. Something like:
create broken_audit_table as (select distinct * from audit_table);
alter table broken_audit_table swap with audit_table;
'grant ownership on audit_table to role dbt;
Let me know if there is any more detail I can provide. Full stack is Fivetran/Snowflake/dbt
The text was updated successfully, but these errors were encountered: