You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Late arriving data, especially when operating in real time, can be a plague.
With late arriving data, we can end up with situations where we get nonsensical SCD2s like:
id
start_date
end_date
1
2023-01-01
2023-01-03
1
2023-01-02
2023-01-03
1
2023-01-03
null
In this situation, rows 2 and 3 arrived in the data set on time. Row 1 arrived later than rows 2 and 3 for whatever reason. The problem with this output is that our dates don't really make sense. We have 01 -> 03, 02 -> 03. The duplicate 03 ruins our time tracking. What we want is 01 -> 02, and 02 -> 03.
This issue is to track work for either detecting this at point of ingestion and SCD2 transformation - addressing it in real time, and/or a separate function that scans over existing data and corrects it after the fact.
As an aside, this is probably why the SC in SCD2 stands for Slowly Changing, so we can avoid situations like this as much as possible. However, there is obviously great utility in being able to display the history of a particular entity in a single snapshot of a Delta Lake table, for example.
The text was updated successfully, but these errors were encountered:
Late arriving data, especially when operating in real time, can be a plague.
With late arriving data, we can end up with situations where we get nonsensical SCD2s like:
In this situation, rows 2 and 3 arrived in the data set on time. Row 1 arrived later than rows 2 and 3 for whatever reason. The problem with this output is that our dates don't really make sense. We have 01 -> 03, 02 -> 03. The duplicate 03 ruins our time tracking. What we want is 01 -> 02, and 02 -> 03.
This issue is to track work for either detecting this at point of ingestion and SCD2 transformation - addressing it in real time, and/or a separate function that scans over existing data and corrects it after the fact.
As an aside, this is probably why the SC in SCD2 stands for Slowly Changing, so we can avoid situations like this as much as possible. However, there is obviously great utility in being able to display the history of a particular entity in a single snapshot of a Delta Lake table, for example.
The text was updated successfully, but these errors were encountered: