Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Delta] Correcting late arriving SCD2 #41

Open
christophergrant opened this issue Jan 21, 2023 · 0 comments
Open

[Delta] Correcting late arriving SCD2 #41

christophergrant opened this issue Jan 21, 2023 · 0 comments

Comments

@christophergrant
Copy link
Owner

christophergrant commented Jan 21, 2023

Late arriving data, especially when operating in real time, can be a plague.

With late arriving data, we can end up with situations where we get nonsensical SCD2s like:

id start_date end_date
1 2023-01-01 2023-01-03
1 2023-01-02 2023-01-03
1 2023-01-03 null

In this situation, rows 2 and 3 arrived in the data set on time. Row 1 arrived later than rows 2 and 3 for whatever reason. The problem with this output is that our dates don't really make sense. We have 01 -> 03, 02 -> 03. The duplicate 03 ruins our time tracking. What we want is 01 -> 02, and 02 -> 03.

This issue is to track work for either detecting this at point of ingestion and SCD2 transformation - addressing it in real time, and/or a separate function that scans over existing data and corrects it after the fact.

As an aside, this is probably why the SC in SCD2 stands for Slowly Changing, so we can avoid situations like this as much as possible. However, there is obviously great utility in being able to display the history of a particular entity in a single snapshot of a Delta Lake table, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant