Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for temporary tables in remote database when running incremental model #326

Merged

Conversation

guenp
Copy link
Collaborator

@guenp guenp commented Jan 30, 2024

A few months ago an issue was raised in the MotherDuck community Slack that CREATE TEMPORARY TABLE when using incremental models was causing slowdowns on dbt.

Summary of the underlying issue:

  • when using a remote MotherDuck duckdb instance, all temporary tables are created locally
  • dbt incremental models create a temporary table for appending tuples to an existing table
  • this means that whenever an incremental model is run on a MotherDuck database ("md:my_database"), a new table is created locally, tuples are transported from the server into the new local table, and then sent back to the server when the final INSERT INTO is executed. This adds a nontrivial round trip time causing said slowdown.

This PR contains a quick suggestion for a workaround until we support temporary tables in MotherDuck. It creates a "mock" temporary table in the database and deletes it afterwards.
Some thoughts for @jwills and @nicku33:

  • The mock temp table is created in the same database/schema as the target table(s). We might want to change that? DuckDB currently stores temp tables in a schema named temp.
  • It might make sense to move the logic to the MotherDuck extension and instead drop the table server-side whenever the connection closes so we don't have to rely on dbt to run successfully after a mock temp table is created

@jwills
Copy link
Collaborator

jwills commented Jan 30, 2024

The approach here makes sense to me-- how are we testing this out?

@guenp
Copy link
Collaborator Author

guenp commented Jan 30, 2024

The approach here makes sense to me-- how are we testing this out?

Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good?

Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.

@jwills
Copy link
Collaborator

jwills commented Jan 31, 2024 via email

tox.ini Outdated Show resolved Hide resolved
dbt/adapters/duckdb/environments/__init__.py Outdated Show resolved Hide resolved
dbt/include/duckdb/macros/materializations/incremental.sql Outdated Show resolved Hide resolved
dbt/include/duckdb/macros/materializations/incremental.sql Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
@guenp
Copy link
Collaborator Author

guenp commented Jan 31, 2024

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

@jwills
Copy link
Collaborator

jwills commented Jan 31, 2024

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

I am good with that!

@jwills
Copy link
Collaborator

jwills commented Feb 1, 2024

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

@guenp
Copy link
Collaborator Author

guenp commented Feb 1, 2024

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

Ah yes, I totally forgot about that, thank you for the reminder! It's done.

I've now pushed all my changes. Mainly what I've added since you last reviewed, in addition to addressing your comments:

  • DuckDBAdapter.get_temp_relation_path(model). This method is used in the incremental.sql macro and will create a new unique name for a temp_relation that is to be dropped at the end of the macro, with identifier dbt_temp.<schema_name>__<target_relation_name>.

  • DuckDBAdapter.post_model_hook(config, context). This method fixes a memory leak issue that arises if the incremental macro doesn't run successfully (e.g. when it runs into a CompilationError as it does in this test) and never reaches the lines where it deletes the temp_relation. The method makes sure to drop the temp_relation if it still exists after the model failed (see the try-except block that intercepts this here).

The behavior now looks something like this:
(1) While running unit test:
image
(2) After unit test is finished and all test-related and "temp" tables are cleaned up:
image

@nicku33 and I decided against dropping dbt_temp in the end because it might interfere with concurrent processes.
We also discussed making the name dbt_temp configurable. I was thinking to implement this via the model config, but do you know if there may be a more global way to do configure this?

@jwills
Copy link
Collaborator

jwills commented Feb 1, 2024

The model config is the right place to use for setting/overriding the temp schema name, and that setting can be set globally for all models if need be by specifying it in e.g. the dbt_project.yml file; see https://docs.getdbt.com/reference/model-configs#model-specific-configurations

@guenp
Copy link
Collaborator Author

guenp commented Feb 1, 2024

Oops, I didn't actually run the unit tests with the default profile. 😁
It looks like it failed because of the post_model_hook! I need to drop for the day but will push a fix & also implement the temp schema name config.

@guenp
Copy link
Collaborator Author

guenp commented Feb 1, 2024

All checks are green! I also added tests/functional/plugins/test_motherduck.py to the MotherDuck tests, I didn't realize your CI/CD ran against a live version of the service so this will help make sure I don't break our product ;)

@guenp guenp requested a review from jwills February 1, 2024 17:49
dbt/adapters/duckdb/impl.py Outdated Show resolved Hide resolved
@jwills jwills merged commit c14627a into duckdb:master Feb 1, 2024
29 of 30 checks passed
@jwills
Copy link
Collaborator

jwills commented Feb 1, 2024

Thank you very much @guenp!

@guenp
Copy link
Collaborator Author

guenp commented Feb 1, 2024

Yay! 🎉 Could we hold off on pushing a release until @nicku33 has had the chance to test it on a real-life workload? We just want to make sure this workaround is good on our end as well. Thanks!

@guenp guenp deleted the guen/eco-23-dbt-incremental-broken-on-md branch February 1, 2024 21:02
@jwills
Copy link
Collaborator

jwills commented Feb 1, 2024

of course, will wait for you all to sign off on it first

@guenp
Copy link
Collaborator Author

guenp commented Feb 15, 2024

We had the chance to do a full end to end test with "real life" data so this is signed off on our end!
FYI here's what we tested:

  1. using a source from an internal share, make a reasonably sized agg model, something on the order of 1M rows for incremental. Run it once to get a datetime agg for, say, 14 days, then modify it in that incremental kicks in for a good size row set and has to update rows
  2. same, but for missing rows. modify the target table so that upon incremental refresh, new rows are filled in
  3. same but for deleting rows. modify the source (maybe a copy of a bunch of rows) so that the update produces the correct gaps
  4. test the above for the two different incremental styles: delete + insert and append

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants