Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Should dbt_metrics support null-safe joins? In other words, is null = null? #164

Closed
3 tasks done
callum-mcdata opened this issue Oct 28, 2022 · 5 comments
Closed
3 tasks done
Labels
enhancement New feature or request triage

Comments

@callum-mcdata
Copy link
Contributor

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

As shown in some slack threads and some Github Issues, some customers are running into issues where they expect their NULL values to be correctly represented in the output. This is not currently supported in dbt_metrics because we don't use NULL-SAFE operators for joins. Should we support this as a config option or macro parameter?

Describe alternatives you've considered

The easy alternative to this is to just state that null values need to be overwritten for metric calculations.

Who will this benefit?

Anyone who has null values in their dimensions

Are you interested in contributing this feature?

Maybe

Anything else?

No response

@callum-mcdata callum-mcdata added the enhancement New feature or request label Oct 28, 2022
@josepfranquetf
Copy link

josepfranquetf commented Jan 4, 2023

Hi @callum-mcdata, I think it would make sense to support null-safe joins. A sensible case would be when you have a derived metric which is a division between two metrics. If you have only one dimension which has a null value (not strange), the metric could not be calculated. I really think that being worried of nulls can create friction to the user.

@callum-mcdata
Copy link
Contributor Author

Hey @josepfranquetf. Do you mind laying out an example with data for the use case that you mentioned above? I'm not sure I fully understand the implications on derived metrics. I've got some opinions in this area after testing of my own but want to make sure I hear from users before settling on a path.

@callum-mcdata callum-mcdata added the awaiting_response Waiting on a response label Jan 5, 2023
@github-actions github-actions bot added triage and removed awaiting_response Waiting on a response labels Jan 5, 2023
@josepfranquetf
Copy link

Hey @callum-mcdata. Imagine that you are running a metrics.calculate macro with one dimension (marketing_source) and a derived metric (cac), which is calculated as investment/number_of_customers. Additionally, marketing_source takes 3 different values: Facebook, Google and null.
In this case, cac would be null when marketing_source is null because null-safe joins are not supported (Although, investment and number_of_customers have positive values when marketing_source is null).

@opalh
Copy link

opalh commented Jan 9, 2023

Hey @callum-mcdata, I was struggling with this issue before managed to understand what was happening. From my point of view, since metrics are created to have multiple dimensions it is problematic that you lose rows if one of the dimensions is null and for me, it doesn't seem reasonable to replace all nulls in my columns to be able to use metrics with dimensions.
Also, it was quite confusing to understand where the difference is coming from since the null rows were there but they had zero in the metric.

@callum-mcdata
Copy link
Contributor Author

callum-mcdata commented Jan 9, 2023

No commitment that we'll be implementing this but I wanted to dig into the method that we would use to resolve this issue.

Using A IS [ NOT ] DISTINCT FROM B

All righty, let's do some digging here for all our supported adapters:

I suspect that this is the best syntax to use that is consistent across all the supported adapters. We could potentially need to figure out an alternative solution for Redshift but that is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

3 participants