Duplicate Rows detected during snapshot #2642

dave-connors3 · 2020-07-23T19:39:14Z

We have been battling a dbt bug for several months now that we were hopeful was solved in the release of 0.17.0.

Consistently, the snapshot of a table we have breaks due to the following error:

Database Error in snapshot user_campaign_audit (snapshots/user_campaign_audit.sql) 100090 (42P18): Duplicate row detected during DML action

Checking our snapshot table, there are indeed multiple rows with identical dbt_scd_ids. The table being snapshot changes it's schema with relatively high frequency. It's a core table that feeds a lot of downstream tables, so new columns are added fairly often. We also run a production dbt run every time we merge a branch into our master branch (we are running dbt on a GItlab CI/CD flow), so the snapshot can run multiple times a day.

Our current approach to fix this is to create a copy of the snapshot table, reduce it to every distinct record, and then use that as the production version of the table. Something like:

create broken_audit_table as (select distinct * from audit_table);
alter table broken_audit_table swap with audit_table;
'grant ownership on audit_table to role dbt;

Let me know if there is any more detail I can provide. Full stack is Fivetran/Snowflake/dbt

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-07-24T14:37:06Z

This issue sounds like the crux of the matter:

there are indeed multiple rows with identical dbt_scd_ids

Which snapshot strategy are you using? If it's the timestamp strategy, dbt creates a hash of the configured unique_key and updated_at. If it's the check strategy, dbt creates a hash of the configured unique_key and the current timestamp.

Therefore, the way to end up with duplicate dbt_scd_id is:

timestamp strategy: there exist multiple rows in your underlying table with the same values of unique key + updated timestamp
check strategy: multiple dbt snapshot invocations at exactly the same time (unlikely)

If the underlying table is liable to having exact duplicate records, you can add logic to the snapshot query or an intermediate model to remove those duplicates.

atvaccaro · 2020-11-03T17:27:32Z

I've encountered this exact same issue with the check strategy. I can only assume that we ended up with duplicate Airflow runs at the same time. I wish snapshots had the option to just fully copy the table each time; and then we can reconstruct the type 2 dimension in a model rather than have dbt itself handle the merging. Then snapshots would be less likely to fail and lose data.

jtcohen6 · 2020-11-11T21:08:47Z

Hey @atvaccaro, it is technically possible to force dbt snapshot to fully copy the table each time. You'd do this by setting the unique_key to something globally unique, e.g. a surrogate key that hashes together your actual primary key with the current_date.

Such an approach doesn't use a snapshot strategy to detect changes; it's just append-only. That means the snapshot is:

Much less efficient, since you're storing duplicate data in the table many times over.
No longer idempotent: given the same raw data in the table, running dbt snapshot once or many times returns different results. (Rather than recording only the rows that are different, dbt will append another entire copy of the table.)

It's not something I'd ever recommend, but you're welcome to implement it in your own project.

josarago · 2021-01-20T21:19:35Z

timestamp strategy: there exist multiple rows in your underlying table with the same values of unique key + updated timestamp

I see multiplicates too, in rather large number (in one table, 20% or rows appear twice in the snapshot) when using a timestamp strategy:

{% snapshot snapshot_my_table %}
    {{
        config(
          target_schema='snapshots',
          unique_key='id',
          strategy='timestamp',
          updated_at='updated_at'
        )
    }}

    select * from {{ source('schema', 'my_table') }}

{% endsnapshot %}

we're using redshift and I checked that all of the id in my_table are unique so any (id, updated_at) pair has to be unique too.

Any idea what this could be?

jtcohen6 · 2021-01-21T08:53:31Z

I see multiplicates too, in rather large number (in one table, 20% or rows appear twice in the snapshot)

@josarago Just confirming, do those duplicate rows have the exact same value of dbt_scd_id? Is there any chance you're running more than one dbt snapshot command simultaneously?

github-actions · 2021-11-06T01:46:43Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

urwa · 2022-01-19T07:04:14Z

I have the same issue, with my config looking like this

{{
    config(
        target_database='XXX',
        target_schema='XXX',
        strategy='check',
        unique_key='extraction_id',
        check_cols='all'
    )
}}

should i create a separate issue for thus.

dbt_scd_id is same. rows are 100% duplicate.
dbt snapshot is being run only once since I am doing it manually.

muscovitebob · 2022-02-11T09:42:59Z

We also seem to observe duplicate rows in our snapshot (single row from source table becomes two rows), although the circumstances of how it happened are unclear at the moment. We are using the 'timestamp' strategy.

josarago · 2022-02-11T22:37:04Z

@muscovitebob @urwa we're looking into this again. One hypothesis is that using multiple threads could cause this. Are you also using more than 1?

leoch20 · 2022-02-14T11:00:34Z

We also run into this issue every now and then. Our config is the same as @urwa's

muscovitebob · 2022-02-14T12:17:46Z

@muscovitebob @urwa we're looking into this again. One hypothesis is that using multiple threads could cause this. Are you also using more than 1?

I think in my case the issue may have been caused by running two instances of dbt concurrently. We have been migrating Airflow instances and had a dbt dag running on both instances at one point. I suspect that the snapshot command ran at the same time on both by accident and this is the root cause on my case.

anitsche-bt · 2022-04-13T06:16:41Z

I have the same issue on exasol with error message: "Unable to get a stable set of rows in the source tables" and there are duplicate lines in the temp table before merge, even though the soures are clean. I figured out, that a single quote within a varchar column caused the problem, after excluding all rows with single quotes in the string, the duplicates where gone

urwa · 2022-09-12T10:37:25Z

We have somewhat similar issue. I got the error

 UPDATE/MERGE must match at most one source row for each target row

I checked my snapshot table by following query:

SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY id) row_number
FROM `my_snapshot_table`
where dbt_valid_to is null)
WHERE row_number > 1

And I found exactly identical rows for a lot of ids.

I can manually remove them but i would like to understand why it happened in the first place.

We are using check strategy. In profiles.yaml it looks like this:

snapshots:
 bigquery:
   history_table:
     +target_schema: history
     +unique_key: id
     +strategy: check
     +check_cols: [
       col1,
       col2,
       col3,
       .,
       .,
       .
       .
       .
    ]

urwa · 2022-09-12T10:41:20Z

I just saw I have commented on this thread before. We are using threads=10 for no reason in particular.
Is there a best practice here that we should follow.

sarah-tuva · 2022-10-19T17:10:00Z

Hello, I know this is a closed issue, but I wanted to report that I'm seeing this weird behavior as well. I have a table that I'm snapshotting with the timestamp strategy. The unique_key is made up of several columns. The intention is that this is a full snapshot each time the model is run. The table that is being snapshotted has all unique rows meaning dbt_scd_id is a unique key. I was able to resolve this issue by adding the updated_at column to the unique_key config. In theory, this shouldn't matter since the dbt_scd_id is already a concatenation of unique_key and updated_at. Regardless, it has resolved the issue. shrug

ddppi · 2023-01-31T05:07:29Z

we recently had the same error, and eventually we found it is due to duplications in the underlying table in RAW DB and it is tricky to identify initially.

some of our experience while we trouble shot our error

the actual SQL query to create snapshot table can be found in .<my-local-project-folder>\logs\dbt.log. We can copy the query out and run in Snowflake, which helped tremendously in our case to quickly pin-point the issue.
the DBT model creates table in CONFORMED from RAW and snapshot creates snapshot of this DBT model. When we started investigating the duplicates, we first looked at the source table in CONFORMED instead of RAW, and we couldn't find any duplicate records there. The tricky part is our pipeline runs DBT run and snapshot commands in a temporary database first before building everything in CONFORMED. So long story short is we should have looked at the source table in RAW, even though snapshot "appears" to create snapshots from the tables in CONFORMED.
if you don't understand what I was saying in point 2, refer back to point 1 :)

dbeatty10 · 2023-01-31T13:49:14Z

@ddppi first of all, I loved the recursive structure of your troubleshooting steps. Here's a time lapse of me walking through them making sure I understand everything correctly:

Secondly, some good news!

We have a timely discussion going on in #6089 that contains a proposal to proactively search for duplicates before merging into the final snapshot table. The summary of the current proposal is here. Would welcome you to join that discussion and contribute any feedback you have.

dbeatty10 · 2023-01-31T17:38:57Z

@sarah-tuva

The intention is that this is a full snapshot each time the model is run. The table that is being snapshotted has all unique rows meaning dbt_scd_id is a unique key. I was able to resolve this issue by adding the updated_at column to the unique_key config. In theory, this shouldn't matter since the dbt_scd_id is already a concatenation of unique_key and updated_at.

Glad that it is working for you!

Thank you for report behavior that you didn't expect. I'll do my best to respond below. I didn't actually confirm any of this -- just thinking about it conceptually -- so please forgive me if I end up getting some of the details wrong!

To do a full snapshot every time (no matter what) adding the updated_at column to the unique_key config is the solution I'd expect. But there are other configurations that I think would lead to a full snapshot every time, and presumably you were trying to use one of those configurations.

You mentioned that your table has unique rows, which should theoretically do a full snapshot if you are doing either of the following:

strategy='check' and check_cols=all'; or
strategy='timestamp' and updated_at='your_constantly_updating_timestamp'

There is at least one situation for each that would lead to it not creating a new row:

if the columns don't actually change day-to-day
if your_constantly_updating_timestamp doesn't actually change day-to-day

sarah-tuva · 2023-01-31T22:20:54Z

@dbeatty10 It's been a little while since I had the issue. Yes, I had been using the "timestamp" strategy. The updated_at field was a timestamp with millisecond precision so it was definitely changing between runs.

github-actions bot added the stale Issues that have gone stale label Nov 6, 2021

github-actions bot closed this as completed Nov 13, 2021

dbeatty10 mentioned this issue Nov 15, 2022

[CT-1472] [Bug] Snapshotting a Table with a PK duplicate and invalidate_hard_deletes=True can leave the snapshot in a non-runnable state #6215

Closed

2 tasks

dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Rows detected during snapshot #2642

Duplicate Rows detected during snapshot #2642

dave-connors3 commented Jul 23, 2020

jtcohen6 commented Jul 24, 2020

atvaccaro commented Nov 3, 2020

jtcohen6 commented Nov 11, 2020

josarago commented Jan 20, 2021

jtcohen6 commented Jan 21, 2021

github-actions bot commented Nov 6, 2021

urwa commented Jan 19, 2022

muscovitebob commented Feb 11, 2022

josarago commented Feb 11, 2022

leoch20 commented Feb 14, 2022

muscovitebob commented Feb 14, 2022

anitsche-bt commented Apr 13, 2022

urwa commented Sep 12, 2022

urwa commented Sep 12, 2022

sarah-tuva commented Oct 19, 2022

ddppi commented Jan 31, 2023

dbeatty10 commented Jan 31, 2023 •

edited

Loading

dbeatty10 commented Jan 31, 2023

sarah-tuva commented Jan 31, 2023

Duplicate Rows detected during snapshot #2642

Duplicate Rows detected during snapshot #2642

Comments

dave-connors3 commented Jul 23, 2020

jtcohen6 commented Jul 24, 2020

atvaccaro commented Nov 3, 2020

jtcohen6 commented Nov 11, 2020

josarago commented Jan 20, 2021

jtcohen6 commented Jan 21, 2021

github-actions bot commented Nov 6, 2021

urwa commented Jan 19, 2022

muscovitebob commented Feb 11, 2022

josarago commented Feb 11, 2022

leoch20 commented Feb 14, 2022

muscovitebob commented Feb 14, 2022

anitsche-bt commented Apr 13, 2022

urwa commented Sep 12, 2022

urwa commented Sep 12, 2022

sarah-tuva commented Oct 19, 2022

ddppi commented Jan 31, 2023

dbeatty10 commented Jan 31, 2023 • edited Loading

dbeatty10 commented Jan 31, 2023

sarah-tuva commented Jan 31, 2023

dbeatty10 commented Jan 31, 2023 •

edited

Loading