-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix count on all null VALUES
clause
#13029
Conversation
@@ -156,7 +156,11 @@ pub fn compute_record_batch_statistics( | |||
for partition in batches.iter() { | |||
for batch in partition { | |||
for (stat_index, col_index) in projection.iter().enumerate() { | |||
null_counts[stat_index] += batch.column(*col_index).null_count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of this line was apparently changed in apache/arrow-rs#4691
At least it seems so from this diff line apache/arrow-rs@979a070#diff-0cfddb6ef017ce20c5e7f528095956b2433cf2ea2d17e28ba4b515b7c1bd2e57L179
9d23575
to
dae2610
Compare
Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).
dae2610
to
a97d895
Compare
null_counts[stat_index] += batch | ||
.column(*col_index) | ||
.logical_nulls() | ||
.map(|nulls| nulls.null_count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified back if we had something like apache/arrow-rs#6608
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh yeah, I don't know why arrow-rs
made that choice. To me the "physical number of nulls" seems kinda useless. I care about the semantics, not the implementation details 😄
The reason is performance -- I agree it is quite confusing. THe rationale is that arrow-rs I think tries to let the user have maximal control, but that does make the API harder to reason about sometime. I tried to clarify this in the docs, but I am also still not happy with how easy it is to get confused: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I haven't read the Arrow spec so I'm not sure why you would do that (encode nulls in the child arrays) |
* Test Count accumulator with all-nulls * Fix count on null values Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).
* Test Count accumulator with all-nulls * Fix count on null values Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).
* Test Count accumulator with all-nulls * Fix count on null values Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).
* Test Count accumulator with all-nulls * Fix count on null values Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).
Before the change, the
ValuesExec
containingNullArray
wouldincorrectly report column statistics as being non-null, which would
misinform
AggregateStatistics
optimizer and foldcount(always_null)
into row count instead of 0.
This commit fixes the column statistics derivation for values with
NullArray
and therefore fixes execution of logical plans with countover such values.
Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the
VALUES (NULL)
doesn't have typeDataType:Null
(it has some apparently arbitrarily picked typeinstead).
As a follow-up, all usages of
Array:null_count
should be inspected.The function can easily be misused (it returns "physical nulls", which
do not exist for null type).