-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt_utils.deduplicate
macro returns zero rows when deduplicating a CTE with NULL column (Redshift)
#713
Comments
@yauhen-sobaleu thanks for opening this! If there are two null rows and a non-null row, does it still lose the null rows? I assume so but your repro steps don't say one way or the other. |
I have the same issue. I think it's the natural join that is the culprit here. It joins on all your selected column, and if you have columns where your value is |
Yes the problem is in natural join that is not null safe. What I had to do is to add another function parameter that explicitly accepts the list of columns to select from. Using deduplicate macro from dbt-utils is quite dangerous right know. You never know when NULLs will come out and corrupt your data.
|
OK thanks for clarifying! I'd welcome a PR that resolved this (and an addition to the integration tests to ensure it stays fixed) |
For the record another user hit this issue. See the slack here https://getdbt.slack.com/archives/CBSQTAPLG/p1679007195056989 TL;DR: The switch in #548 to use a select distinct natural join not only introduced this NULL bug, but also very poor performance compared to a more sane approach and a future bug where data types that are not comparable (such as JSON in Postgres) will cause the natural join to fail. My humble suggestion is to revert to the old method of using |
I also hit the same issue and find out this issue ticket. I am using DuckDB, and the problem also comes from natural join on columns containing NULL values. |
Agreed with @PaulGVernon here that the default implementation of Furthermore, I don't think the natural join approach can ever work the way we would want. I believe that
This idea is compelling. A couple alternative ideas:
|
OK I've just spent an afternoon staring at the old and new versions of this macro, and no simple solution has fallen from the sky. I would love to be able to revert to using
Unless I'm missing something, these would also require So a new macro then? It strikes me that For a model:
For a CTE:
What do y'all think? |
😂 Reflections on the following paths forward:
Move deduplicate to dbt-coredbt-utils is caught in this funny place where it only supports a small handful of (popular) databases. dbt-utils really works best when it has only a single implementation of a macro and no adapter-specific overrides. It can get caught in difficult places (like this one), when nearly every SQL dialect needs its own implementation. That's where dbt-core + adapters can shine. Whenever adapter-specific variants are added to a dbt package, it's the perfect moment to stop and consider: there's probably some abstraction that should be in Core! Adding Since there's not a ubiquitous solution that would work across all dialects, the default implementation in dbt-core would probably be to raise a New macroTo support arbitrary CTEs, there are two main options for databases that don't support something akin to
Each have their downsides, but the latter seems especially unpleasant. Overall, the experience for end users of the long tail of dbt adapters is likely to be much better if we move |
Some more info... Support for
|
Qualify is supported in Redshift now. |
@dbeatty10 can you help review the PR? thank you! |
Describe the bug
dbt_utils.deduplicate
returns zero rows when it deduplicates a CTE with a column that consists entirely of NULL values.Steps to reproduce
Given model code as:
it is compiled by dbt on Redshift adapter into the following SQL:
Expected results
Actual results
Screenshots and log output
System information
The contents of your
packages.yml
file:Which database are you using dbt with?
The output of
dbt --version
:The text was updated successfully, but these errors were encountered: