Fix count on all null `VALUES` clause #13029

findepi · 2024-10-21T10:26:54Z

Before the change, the ValuesExec containing NullArray would
incorrectly report column statistics as being non-null, which would
misinform AggregateStatistics optimizer and fold count(always_null)
into row count instead of 0.

This commit fixes the column statistics derivation for values with
NullArray and therefore fixes execution of logical plans with count
over such values.

Note that the bug was not reproducible using DataFusion SQL frontend,
because in DataFusion SQL the VALUES (NULL) doesn't have type
DataType:Null (it has some apparently arbitrarily picked type
instead).

As a follow-up, all usages of Array:null_count should be inspected.
The function can easily be misused (it returns "physical nulls", which
do not exist for null type).

Relates to: Fix count(null) and count(distinct null) #8511

findepi · 2024-10-21T10:29:26Z

datafusion/physical-plan/src/common.rs

@@ -156,7 +156,11 @@ pub fn compute_record_batch_statistics(
    for partition in batches.iter() {
        for batch in partition {
            for (stat_index, col_index) in projection.iter().enumerate() {
-                null_counts[stat_index] += batch.column(*col_index).null_count();


The meaning of this line was apparently changed in apache/arrow-rs#4691
At least it seems so from this diff line apache/arrow-rs@979a070#diff-0cfddb6ef017ce20c5e7f528095956b2433cf2ea2d17e28ba4b515b7c1bd2e57L179

Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).

findepi · 2024-10-21T11:09:52Z

datafusion/physical-plan/src/common.rs

+                null_counts[stat_index] += batch
+                    .column(*col_index)
+                    .logical_nulls()
+                    .map(|nulls| nulls.null_count())


This could be simplified back if we had something like apache/arrow-rs#6608

findepi · 2024-10-21T11:13:49Z

@alamb @joroKr21 please take a look

joroKr21

Ugh yeah, I don't know why arrow-rs made that choice. To me the "physical number of nulls" seems kinda useless. I care about the semantics, not the implementation details 😄

alamb · 2024-10-21T15:23:15Z

Ugh yeah, I don't know why arrow-rs made that choice. To me the "physical number of nulls" seems kinda useless. I care about the semantics, not the implementation details 😄

The reason is performance -- I agree it is quite confusing. THe rationale is that arrow-rs I think tries to let the user have maximal control, but that does make the API harder to reason about sometime.

I tried to clarify this in the docs, but I am also still not happy with how easy it is to get confused:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.nulls

alamb

Thank you @findepi and @joroKr21 -- this makes sense to me

joroKr21 · 2024-10-21T15:30:17Z

I tried to clarify this in the docs, but I am also still not happy with how easy it is to get confused:
https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.nulls

The physical representation is efficient, but is sometimes non intuitive for certain array types such as those with nullable child arrays like DictionaryArray::values, RunArray::values or UnionArray, or without a null buffer, such as NullArray.

Yeah, I haven't read the Arrow spec so I'm not sure why you would do that (encode nulls in the child arrays)

* Test Count accumulator with all-nulls * Fix count on null values Before the change, the `ValuesExec` containing `NullArray` would incorrectly report column statistics as being non-null, which would misinform `AggregateStatistics` optimizer and fold `count(always_null)` into row count instead of 0. This commit fixes the column statistics derivation for values with `NullArray` and therefore fixes execution of logical plans with count over such values. Note that the bug was not reproducible using DataFusion SQL frontend, because in DataFusion SQL the `VALUES (NULL)` doesn't have type `DataType:Null` (it has some apparently arbitrarily picked type instead). As a follow-up, all usages of `Array:null_count` should be inspected. The function can easily be misused (it returns "physical nulls", which do not exist for null type).

Test Count accumulator with all-nulls

1e046f5

github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate functions labels Oct 21, 2024

findepi commented Oct 21, 2024

View reviewed changes

findepi force-pushed the findepi/count-null-only branch 3 times, most recently from 9d23575 to dae2610 Compare October 21, 2024 10:56

findepi force-pushed the findepi/count-null-only branch from dae2610 to a97d895 Compare October 21, 2024 11:00

github-actions bot removed the optimizer Optimizer rules label Oct 21, 2024

This was referenced Oct 21, 2024

Prevent take_optimizable from discarding arbitrary plan node #13030

Closed

Add Array::logical_null_count for inspecting number of null values apache/arrow-rs#6608

Merged

findepi commented Oct 21, 2024

View reviewed changes

joroKr21 approved these changes Oct 21, 2024

View reviewed changes

alamb approved these changes Oct 21, 2024

View reviewed changes

alamb changed the title ~~Fix count on null values~~ Fix count on all null VALUES clause Oct 21, 2024

alamb merged commit 34fbe8e into apache:main Oct 21, 2024
26 checks passed

findepi deleted the findepi/count-null-only branch October 22, 2024 06:41

findepi mentioned this pull request Oct 22, 2024

[41] Fix count on all null VALUES clause sdf-labs/arrow-datafusion#70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix count on all null `VALUES` clause #13029

Fix count on all null `VALUES` clause #13029

findepi commented Oct 21, 2024 •

edited

Loading

findepi Oct 21, 2024

findepi Oct 21, 2024

findepi commented Oct 21, 2024

joroKr21 left a comment •

edited

Loading

alamb commented Oct 21, 2024

alamb left a comment

joroKr21 commented Oct 21, 2024

Fix count on all null VALUES clause #13029

Fix count on all null VALUES clause #13029

Conversation

findepi commented Oct 21, 2024 • edited Loading

findepi Oct 21, 2024

Choose a reason for hiding this comment

findepi Oct 21, 2024

Choose a reason for hiding this comment

findepi commented Oct 21, 2024

joroKr21 left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Oct 21, 2024

alamb left a comment

Choose a reason for hiding this comment

joroKr21 commented Oct 21, 2024

Fix count on all null `VALUES` clause #13029

Fix count on all null `VALUES` clause #13029

findepi commented Oct 21, 2024 •

edited

Loading

joroKr21 left a comment •

edited

Loading