Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Rows detected during snapshot #2642

Closed
dave-connors3 opened this issue Jul 23, 2020 · 19 comments
Closed

Duplicate Rows detected during snapshot #2642

dave-connors3 opened this issue Jul 23, 2020 · 19 comments
Labels
snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale

Comments

@dave-connors3
Copy link

We have been battling a dbt bug for several months now that we were hopeful was solved in the release of 0.17.0.

Consistently, the snapshot of a table we have breaks due to the following error:

Database Error in snapshot user_campaign_audit (snapshots/user_campaign_audit.sql) 100090 (42P18): Duplicate row detected during DML action

Checking our snapshot table, there are indeed multiple rows with identical dbt_scd_ids. The table being snapshot changes it's schema with relatively high frequency. It's a core table that feeds a lot of downstream tables, so new columns are added fairly often. We also run a production dbt run every time we merge a branch into our master branch (we are running dbt on a GItlab CI/CD flow), so the snapshot can run multiple times a day.

Our current approach to fix this is to create a copy of the snapshot table, reduce it to every distinct record, and then use that as the production version of the table. Something like:

create broken_audit_table as (select distinct * from audit_table);
alter table broken_audit_table swap with audit_table;
'grant ownership on audit_table to role dbt;

Let me know if there is any more detail I can provide. Full stack is Fivetran/Snowflake/dbt

@jtcohen6
Copy link
Contributor

This issue sounds like the crux of the matter:

there are indeed multiple rows with identical dbt_scd_ids

Which snapshot strategy are you using? If it's the timestamp strategy, dbt creates a hash of the configured unique_key and updated_at. If it's the check strategy, dbt creates a hash of the configured unique_key and the current timestamp.

Therefore, the way to end up with duplicate dbt_scd_id is:

  • timestamp strategy: there exist multiple rows in your underlying table with the same values of unique key + updated timestamp
  • check strategy: multiple dbt snapshot invocations at exactly the same time (unlikely)

If the underlying table is liable to having exact duplicate records, you can add logic to the snapshot query or an intermediate model to remove those duplicates.

@atvaccaro
Copy link

I've encountered this exact same issue with the check strategy. I can only assume that we ended up with duplicate Airflow runs at the same time. I wish snapshots had the option to just fully copy the table each time; and then we can reconstruct the type 2 dimension in a model rather than have dbt itself handle the merging. Then snapshots would be less likely to fail and lose data.

@jtcohen6
Copy link
Contributor

Hey @atvaccaro, it is technically possible to force dbt snapshot to fully copy the table each time. You'd do this by setting the unique_key to something globally unique, e.g. a surrogate key that hashes together your actual primary key with the current_date.

Such an approach doesn't use a snapshot strategy to detect changes; it's just append-only. That means the snapshot is:

  • Much less efficient, since you're storing duplicate data in the table many times over.
  • No longer idempotent: given the same raw data in the table, running dbt snapshot once or many times returns different results. (Rather than recording only the rows that are different, dbt will append another entire copy of the table.)

It's not something I'd ever recommend, but you're welcome to implement it in your own project.

@josarago
Copy link

  • timestamp strategy: there exist multiple rows in your underlying table with the same values of unique key + updated timestamp

I see multiplicates too, in rather large number (in one table, 20% or rows appear twice in the snapshot) when using a timestamp strategy:

{% snapshot snapshot_my_table %}
    {{
        config(
          target_schema='snapshots',
          unique_key='id',
          strategy='timestamp',
          updated_at='updated_at'
        )
    }}

    select * from {{ source('schema', 'my_table') }}

{% endsnapshot %}

we're using redshift and I checked that all of the id in my_table are unique so any (id, updated_at) pair has to be unique too.

Any idea what this could be?

@jtcohen6
Copy link
Contributor

I see multiplicates too, in rather large number (in one table, 20% or rows appear twice in the snapshot)

@josarago Just confirming, do those duplicate rows have the exact same value of dbt_scd_id? Is there any chance you're running more than one dbt snapshot command simultaneously?

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2021

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 6, 2021
@urwa
Copy link

urwa commented Jan 19, 2022

I have the same issue, with my config looking like this

{{
    config(
        target_database='XXX',
        target_schema='XXX',
        strategy='check',
        unique_key='extraction_id',
        check_cols='all'
    )
}}

should i create a separate issue for thus.

  • dbt_scd_id is same. rows are 100% duplicate.
  • dbt snapshot is being run only once since I am doing it manually.

@muscovitebob
Copy link

We also seem to observe duplicate rows in our snapshot (single row from source table becomes two rows), although the circumstances of how it happened are unclear at the moment. We are using the 'timestamp' strategy.

@josarago
Copy link

@muscovitebob @urwa we're looking into this again. One hypothesis is that using multiple threads could cause this. Are you also using more than 1?

@leoch20
Copy link

leoch20 commented Feb 14, 2022

We also run into this issue every now and then. Our config is the same as @urwa's

@muscovitebob
Copy link

@muscovitebob @urwa we're looking into this again. One hypothesis is that using multiple threads could cause this. Are you also using more than 1?

I think in my case the issue may have been caused by running two instances of dbt concurrently. We have been migrating Airflow instances and had a dbt dag running on both instances at one point. I suspect that the snapshot command ran at the same time on both by accident and this is the root cause on my case.

@anitsche-bt
Copy link

I have the same issue on exasol with error message: "Unable to get a stable set of rows in the source tables" and there are duplicate lines in the temp table before merge, even though the soures are clean. I figured out, that a single quote within a varchar column caused the problem, after excluding all rows with single quotes in the string, the duplicates where gone

@urwa
Copy link

urwa commented Sep 12, 2022

We have somewhat similar issue. I got the error

 UPDATE/MERGE must match at most one source row for each target row

I checked my snapshot table by following query:

SELECT * EXCEPT(row_number)
FROM (SELECT*,ROW_NUMBER() OVER (PARTITION BY id) row_number
FROM `my_snapshot_table`
where dbt_valid_to is null)
WHERE row_number > 1

And I found exactly identical rows for a lot of ids.

I can manually remove them but i would like to understand why it happened in the first place.

We are using check strategy. In profiles.yaml it looks like this:

snapshots:
 bigquery:
   history_table:
     +target_schema: history
     +unique_key: id
     +strategy: check
     +check_cols: [
       col1,
       col2,
       col3,
       .,
       .,
       .
       .
       .
    ]

@urwa
Copy link

urwa commented Sep 12, 2022

I just saw I have commented on this thread before. We are using threads=10 for no reason in particular.
Is there a best practice here that we should follow.

@sarah-tuva
Copy link

Hello, I know this is a closed issue, but I wanted to report that I'm seeing this weird behavior as well. I have a table that I'm snapshotting with the timestamp strategy. The unique_key is made up of several columns. The intention is that this is a full snapshot each time the model is run. The table that is being snapshotted has all unique rows meaning dbt_scd_id is a unique key. I was able to resolve this issue by adding the updated_at column to the unique_key config. In theory, this shouldn't matter since the dbt_scd_id is already a concatenation of unique_key and updated_at. Regardless, it has resolved the issue. shrug

@ddppi
Copy link

ddppi commented Jan 31, 2023

we recently had the same error, and eventually we found it is due to duplications in the underlying table in RAW DB and it is tricky to identify initially.

some of our experience while we trouble shot our error

  1. the actual SQL query to create snapshot table can be found in .<my-local-project-folder>\logs\dbt.log. We can copy the query out and run in Snowflake, which helped tremendously in our case to quickly pin-point the issue.
  2. the DBT model creates table in CONFORMED from RAW and snapshot creates snapshot of this DBT model. When we started investigating the duplicates, we first looked at the source table in CONFORMED instead of RAW, and we couldn't find any duplicate records there. The tricky part is our pipeline runs DBT run and snapshot commands in a temporary database first before building everything in CONFORMED. So long story short is we should have looked at the source table in RAW, even though snapshot "appears" to create snapshots from the tables in CONFORMED.
  3. if you don't understand what I was saying in point 2, refer back to point 1 :)

@dbeatty10
Copy link
Contributor

dbeatty10 commented Jan 31, 2023

@ddppi first of all, I loved the recursive structure of your troubleshooting steps. Here's a time lapse of me walking through them making sure I understand everything correctly:
image

Secondly, some good news!

We have a timely discussion going on in #6089 that contains a proposal to proactively search for duplicates before merging into the final snapshot table. The summary of the current proposal is here. Would welcome you to join that discussion and contribute any feedback you have.

@dbeatty10
Copy link
Contributor

@sarah-tuva

The intention is that this is a full snapshot each time the model is run. The table that is being snapshotted has all unique rows meaning dbt_scd_id is a unique key. I was able to resolve this issue by adding the updated_at column to the unique_key config. In theory, this shouldn't matter since the dbt_scd_id is already a concatenation of unique_key and updated_at.

Glad that it is working for you!

Thank you for report behavior that you didn't expect. I'll do my best to respond below. I didn't actually confirm any of this -- just thinking about it conceptually -- so please forgive me if I end up getting some of the details wrong!

To do a full snapshot every time (no matter what) adding the updated_at column to the unique_key config is the solution I'd expect. But there are other configurations that I think would lead to a full snapshot every time, and presumably you were trying to use one of those configurations.

You mentioned that your table has unique rows, which should theoretically do a full snapshot if you are doing either of the following:

  1. strategy='check' and check_cols=all'; or
  2. strategy='timestamp' and updated_at='your_constantly_updating_timestamp'

There is at least one situation for each that would lead to it not creating a new row:

  1. if the columns don't actually change day-to-day
  2. if your_constantly_updating_timestamp doesn't actually change day-to-day

@sarah-tuva
Copy link

@dbeatty10 It's been a little while since I had the issue. Yes, I had been using the "timestamp" strategy. The updated_at field was a timestamp with millisecond precision so it was definitely changing between runs.

@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshots Issues related to dbt's snapshot functionality stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests