-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn if updated_at
field for snapshot is not same datatype as what's returned in snapshot_get_time()
#10234
Comments
[Adding example from snapshot discussion] Regarding the naïve timestamp approach. I feel this is a pretty big problem because it can result in the wrong data being written. In our case, we use Fivetran to load Snowflake. We use a timestamp snapshot strategy with the _FIVETRAN_SYNCED column as the updated_at config. We also include "where _fivetran_deleted = FALSE" on our snapshots so when the row is deleted in the source the snapshot will indicate the row is no longer valid. The _FIVETRAN_SYNCED column is defined as a timestamp_tz and therefore when the initial snapshot occurs the dbt_valid_from and dbt_valid_to columns are also defined as timestamp_tz. When a record is then deleted from the source, the dbt snapshots sets the dbt_valid_to column to snapshot_get_time(). Snapshot_get_time() however returns a timestamp_ntz data type in the UTC time zone. This UTC time is then implicitly converted to a timestamp_tz which stores the UTC value and adds the time zone offset of the session, which in our case is currently -0500. This time is now 5 hours in the future, which is not the correct time. As mentioned I feel this a bug and should be addressed in dbt-core. Some ideas. I realize some of these ideas are breaking changes so maybe a new snapshot strategy should be introduced that allows users to make the switch when ready. Workarounds We found that overriding the snapshot_get_time() was the best work around for us. I don't like the solution and it is a little risky but we changed the Snowflake command to: to_varchar(convert_timezone('UTC', current_timestamp()),'YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM') This converts the timestamp to a varchar which retains the offset information so when it is stored to timestamp_tz or timestamp_ltz the offset is correct. When stored to a timestamp_ntz column it is also correct and in UTC because the offset is trimmed away. As mentioned this is risky because I am not sure all the places snapshot_get_time() is used and changing the data type could be problematic. Originally posted by @jvanpee in #7018 (comment) |
I have confirmed
|
updated_at
field for snapshot is not same datatype as what's returned in snapshot_get_time()
updated_at
field for snapshot is not same datatype as what's returned in snapshot_get_time()
updated_at
field for snapshot is not same datatype (or timezone) as what's returned in snapshot_get_time()
updated_at
field for snapshot is not same datatype (or timezone) as what's returned in snapshot_get_time()
updated_at
field for snapshot is not same datatype as what's returned in snapshot_get_time()
There is differences on what is returned by
|
Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#6012 |
Description
There are 4 different semantics for timestamp datatypes - for simplicity let’s say 2 categories:
ltz
uses your session timezoneOne of the trickiest things that can happen within snapshots is when there are naive timestamps (rather than aware). In those cases, we need clear ways for the user to configure a "mutual agreement" with the dbt system how to interpret those timestamps when they are actively involved in the snapshot configuration.
Up until this point, the approach has been that naive timestamp must be UTC
Some users may want to use an adjacent column that contains the relevant UTC offset or time zone (or configure the time zone globally for the dbt project (like this) or specifically for one model).
Example of snapshot -
updated_at
when usingstrategy=timestamp
:PROBLEM
When the initial snapshot occurs, the
dbt_valid_from
anddbt_valid_to
columns inherit the data type of whatever you've provided forupdated_at
config (could be adate
, could be a naive timestamp, etc.). But once a record is changed, dbt updates thedbt_valid_to
column tosnapshot_get_time()
which returns (in snowflake) atimestamp_ntz
data type in the UTC time zone. This is then implicitly converted to the original inherited data type, which can lead to incorrect data if theirupdated_at
field is NOTtimestamp_ntz
data type in the UTC time zone (incorrect comparisons; may coerce timestamp differently than user expects).Naive vs. aware timestamps in Snowflake:
https://docs.google.com/presentation/d/14qrx8YbnGnoP2FgnxSD5Z8fxcnoYq4DKRrfHBDanbfA/edit#slide=id.g20f6efbe802_0_34
SOLUTION
Let’s go with Idea 1 for now, maybe some magic in the future
Idea 1 - put burden of using correct data type on user:
updated_at
field is, see if it’s compatibleupdated_at
already accepts a SQL statement [though we should add docs for this]:Idea 2 - magic stuff to automatically convert:
updated_at_field
is, see if it’s compatibleThe text was updated successfully, but these errors were encountered: