Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Performance of Source/Model/Exposure extraction #20

Closed
kgpayne opened this issue Apr 15, 2021 · 2 comments
Closed

Improve Performance of Source/Model/Exposure extraction #20

kgpayne opened this issue Apr 15, 2021 · 2 comments

Comments

@kgpayne
Copy link
Contributor

kgpayne commented Apr 15, 2021

With ~1 year of historical manifest.json and run_results.json data, we have started experiencing timeouts running --full-refresh of dbt_artifacts.


1832 | 2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - Database Error in model dim_dbt__sources (models/incremental/dim_dbt__sources.sql)
-- | --
2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - 000630 (57014): Statement reached its statement or warehouse timeout of 1,200 second(s) and was canceled.

Possible solutions:

  • Archive old data.
  • Replace the 3 individual models that extract Sources, Models and (soon) Exposures into 1 wide table, so json extraction happens once per artefact, rather than at least 3 times.
  • Snapshot the extracted json(?)
  • Unpack all required fields in the COPY command executed by either the run-operation or by Snowpipe. This is probably the most performant but least flexible. What happens when we want to extract more details from previous runs? Gets a bit difficult 😬 It does however solve the case where JSON objects become too big to fit into a single VARIANT (16mb).
  • Do something with the orchestration layer to avoid ever running full-refresh on these models - effectively relying on incremental models to only extract JSON values once per artefact. We'd probably use tags.
@kgpayne
Copy link
Contributor Author

kgpayne commented Jul 7, 2021

@NiallRees FYI. We had resolved to try unpacking required fields in the COPY command here, in the hope of solving our 'too much history' problem at the same time as #29 🤔 Its less flexible, but ensures that json extract happens on load, meaning the base tables are flat and therefore not a problem during full refresh.

@alanmcruickshank
Copy link
Contributor

I propose that I try and fix this at the same time as #62 . Proposed approach there.

NiallRees pushed a commit that referenced this issue Jul 21, 2022
All new and modified files for invocations vertical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants