Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deduplication option #38

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mikeperello-scopely
Copy link

Description of change

It may occur for some taps, such as the DynamoDB one (using DynamoDB Streams) that duplicates are generated. To solve this, based on an attribute (Typically a timestamp), a query will run before the loading part to BQ, removing all duplicates.

Manual QA steps

In order to run the deduplication, we need to specify as environment variables, the following attributes:

  • deduplication_property: we need to specify the attribute that we want to deduplicate on (Tipically a timestamp).
    In case of having a duplicate, the query is going to keep the one, by default, with the bigger deduplication_property or a random element between those who have the bigger one.
    But we can also modify the order, if for example we want to keep the smaller one.
    • deduplication_order:
      • DESC: Default option. Keeps the one with the bigger deduplication_property or a random element between those who have the bigger one.
      • ASC: Keeps the one with the smaller deduplication_property or a random element between those who have the bigger one.

Additional info

For example, let's say we have the following data coming from the tap:

#_id, name, email, city, date
#1, Mike, NULL, NULL, 2022-07-19 09:00:00
#1, Mike, [email protected], NULL, 2022-07-19 09:10:00
#2, Scopely, [email protected], NULL, 2022-07-19 09:05:00
#1, Mike, [email protected], Barcelona, 2022-07-19 09:15:00

In this case, being date as deduplication_property and deduplication_order by default (DESC) we would have:

#_id, name, email, city, date
#1, Mike, [email protected], Barcelona, 2022-07-19 09:15:00
#2, Scopely, [email protected], NULL, 2022-07-19 09:05:00

If we set the deduplication_order to ASC, we would have:

#_id, name, email, city, date
#1, Mike, NULL, NULL, 2022-07-19 09:00:00
#2, Scopely, [email protected], NULL, 2022-07-19 09:05:00

@RuslanBergenov RuslanBergenov self-assigned this Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants