Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/postgres performance #126

Merged
merged 16 commits into from
May 14, 2024
Merged

Bug/postgres performance #126

merged 16 commits into from
May 14, 2024

Conversation

fivetran-catfritz
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz commented May 11, 2024

PR Overview

This PR will address the following Issue/Feature:

This PR will result in the following new package version:

  • v1.7.3 non-breaking since there are no schema changes and results are the same as prior

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Performance Improvements

  • Updated the sequence of JSON parsing for model fivetran_platform__audit_table to reduce runtime.

Bug Fixes

  • Updated model fivetran_platform__audit_user_activity to correct the JSON parsing used to determine column email.

Under the hood

  • Updated logic for macro fivetran_log_lookback to align with logic used in similar macros in other packages.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

  • See internal ticket for data quality validation

  • Runtime comparison:

    • Note: this was tested on a subset of the log data limited to about 250k rows. This was done by filtering the staging log model using filter
where cast(created_at as date) < cast('2024-05-10' as date)
and cast(created_at as date) >= cast('2024-05-07' as date)

which produces these results:

v1.7.2 run: ~ 1hr
Screenshot 2024-05-11 at 2 17 09 AM

v1.7.3 run: ~ 10 mins
Screenshot 2024-05-11 at 1 00 29 AM

If you had to summarize this PR in an emoji, which would it be?

🌳

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz Thanks for investigating and working through this update! It is also great to see the performance improvements it brought with our sample data set. I am hopeful this will save runtimes for our customers using Postgres.

I do have one small callout to include a change in the CHANGELOG and for your eyes on my thoughts around a changing datatype in the json parse macro. Those are relatively small and I am comfortable approving this release. Let me know if you have any questions around my comment. Thanks!

@@ -28,7 +28,7 @@
{% macro postgres__fivetran_log_json_parse(string, string_path) %}

case when {{ string }} ~ '^\s*[\{].*[\}]?\s*$' -- Postgres has no native json check, so this will check the string for indicators of a JSON object
then {{ string }}::json #>> '{ {%- for s in string_path -%}{{ s }}{%- if not loop.last -%},{%- endif -%}{%- endfor -%} }'
then {{ string }}::jsonb #>> '{ {%- for s in string_path -%}{{ s }}{%- if not loop.last -%},{%- endif -%}{%- endfor -%} }'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is incredibly small, but we should call this out as a change in the CHANGELOG. Also, I am fairly certain this won't be breaking, but can you confirm that this datatype change won't cause any datatype changes for Postgres users? I tested myself and didn't see any issues when running prod then dev right after, but just want to double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to under the hood. I also tested this and it would make sense there is no issue since the end results in either version are strings, and the cast to json is an intermediate step.

CHANGELOG.md Outdated
- Updated the sequence of JSON parsing for model `fivetran_platform__audit_table` to reduce runtime.

## Bug Fixes
- Updated model `fivetran_platform__audit_user_activity` to correct the JSON parsing used to determine column `email`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be good to note that was causing fivetran_platform__audit_user_activity to potentially have 0 rows

@fivetran-catfritz fivetran-catfritz merged commit 8b325b8 into main May 14, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants