Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[41] Fix count on all null VALUES clause #70

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

findepi
Copy link
Collaborator

@findepi findepi commented Oct 22, 2024

Cherry pick apache#13029

Before the change, the ValuesExec containing NullArray would incorrectly report column statistics as being non-null, which would misinform AggregateStatistics optimizer and fold count(always_null) into row count instead of 0.

This commit fixes the column statistics derivation for values with NullArray and therefore fixes execution of logical plans with count over such values.

Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the VALUES (NULL) doesn't have type DataType:Null (it has some apparently arbitrarily picked type instead).

As a follow-up, all usages of Array:null_count should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).

@findepi findepi requested review from vgapeyev and milevin October 22, 2024 06:51
@github-actions github-actions bot added the core label Oct 22, 2024
Copy link

@vgapeyev vgapeyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing cross-repo work!

@findepi findepi force-pushed the findepi/count-null-only-41 branch from e46a004 to 3545fbe Compare October 23, 2024 11:46
@findepi
Copy link
Collaborator Author

findepi commented Oct 23, 2024

rebased on top of #71

* Test Count accumulator with all-nulls

* Fix count on null values

Before the change, the `ValuesExec` containing `NullArray` would
incorrectly report column statistics as being non-null, which would
misinform `AggregateStatistics` optimizer and fold `count(always_null)`
into row count instead of 0.

This commit fixes the column statistics derivation for values with
`NullArray` and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the `VALUES (NULL)` doesn't have type
`DataType:Null` (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of `Array:null_count` should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).
@findepi findepi force-pushed the findepi/count-null-only-41 branch from 3545fbe to 3ff9b52 Compare October 24, 2024 18:55
@findepi findepi merged commit 89d1c8e into sdf-labs:findepi/41 Oct 25, 2024
24 checks passed
@findepi findepi deleted the findepi/count-null-only-41 branch October 25, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants