Add `ColumnStatistics::Sum` #14074

gatesn · 2025-01-10T12:52:00Z

Which issue does this PR close?

This PR adds a sum statistic to DataFusion.

Future use will include optimizing aggregation functions (sum, avg, count), see https://github.com/apache/datafusion/pull/13736/files for examples.

Are there any user-facing changes?

The ColumnStatistics struct has an extra public sum_value field.

alamb

Thank you @gatesn -- I think this is a nice addition.

It looks like the cargo fmt test is failing

Ideally we would add unit test coverage for Precision::multiply Precision::sub and Precision::cast_to before we merged.

Thanks again -- excited to see this working

FYI @suremarc @berkaysynnada / @ozankabak as this changes statistics and I think you are already working on things related to that:

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078 (comment)

alamb · 2025-01-12T12:31:17Z

datafusion/common/src/stats.rs

@@ -436,6 +492,8 @@ pub struct ColumnStatistics {
    pub max_value: Precision<ScalarValue>,
    /// Minimum value of column
    pub min_value: Precision<ScalarValue>,
+    /// Sum value of a column
+    pub sum_value: Precision<ScalarValue>,


As I think we mentioned in #13736 my only real concern with this addition is that it will make ColumnStatistics even bigger (each ScalarValue is quite large already and ColumnStatistics are copied a bunch

However, I think the "right" fix for that is to move to using a different statistics representation (e.g. Arc::ColumnStatistics) so I don't see this as a blocker

datafusion/common/src/stats.rs

datafusion/physical-plan/src/joins/cross_join.rs

gatesn · 2025-01-12T15:05:52Z

datafusion/common/src/stats.rs

            (_, _) => Precision::Absent,
        }
    }
+
+    /// Casts the value to the given data type, propagating exactness information.
+    pub fn cast_to(&self, data_type: &DataType) -> Result<Precision<ScalarValue>> {


@alamb one question I have is whether this should return a Result, or we should assume that a failed cast implies overflow and therefore return Precision::Absent?

The caller (currently in cross-join) unwraps to Absent, I just didn't know whether to internalize that here.

Edit: I decided it was better to propagate the error and allow the caller to decide. It was more useful in a couple of places.

gatesn

Added some tests and hopefully appeases the linter!

berkaysynnada · 2025-01-12T20:07:55Z

FYI @suremarc @berkaysynnada / @ozankabak as this changes statistics and I think you are already working on things related to that:

We've started to refactor. The design is complete, and the implementation is in progress.

I’ve taken a look at this and have some questions. For example, are we planning to add many types of functions to statistics, or is there a defined list of statistics that can be inferred from the sources or have meaningful applications in optimizer rules? If we agree that these kinds of extensions to column statistics are indeed useful and obtainable, then we can proceed with merging this. We would also ensure it is included in the new setup.

FYI @ozankabak

alamb · 2025-01-13T20:10:56Z

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

berkaysynnada · 2025-01-13T20:39:21Z

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

I've reached you via discord

alamb · 2025-01-14T01:18:42Z

We've started to refactor. The design is complete, and the implementation is in progress.

Thanks! Is there anywhere I can follow along @berkaysynnada (I am particularly interested in what the final API / representation looks like)

I've reached you via discord

For anyone else who is interested, the draft PR in the synnada fork is here:

StatisticsV2: statistics framework initial redesign for Datafusion synnada-ai/datafusion-upstream#57

gatesn · 2025-01-15T14:22:05Z

Looks like I got hit by some new ColumnStatistics tests on main. Should be fixed now 🤞

@berkaysynnada can you expand on the rationale for the V2 stats? I understand that it's more expressive, but I can't see in the PR or Notion how those distributions might actually be used? Is this for join planning?

My understanding is I would no longer define a "min" or a "max" for a column. But there doesn't seem to be a place to define null count or sum?

berkaysynnada · 2025-01-16T07:36:45Z

Looks like I got hit by some new ColumnStatistics tests on main. Should be fixed now 🤞

@berkaysynnada can you expand on the rationale for the V2 stats? I understand that it's more expressive, but I can't see in the PR or Notion how those distributions might actually be used? Is this for join planning?

My understanding is I would no longer define a "min" or a "max" for a column. But there doesn't seem to be a place to define null count or sum?

You can still define min or max. We are not replacing Statistics with Statistics_v2; it is actually replacing the Precision and Interval objects. We plan to rename the API of the execution plan from fn statistics(&self) -> Statistics to fn statistics(&self) -> TableStatistics, which is still structured as:

pub struct TableStatistics {
    pub num_rows: Statistics,
    pub total_byte_size: Statistics,
    pub column_statistics: Vec<ColumnStatistics>,
}

and

pub struct ColumnStatistics {
    pub null_count: Statistics,
    pub max_value: Statistics,
    pub min_value: Statistics,
    pub distinct_count: Statistics,
}

What we are trying to address is how the current way of indeterminate quantities are handled in a target-dependent way. For example, if there’s a possibility of indeterminate statistics, it is stored as the mean value when the caller requires an estimate. However, if bounds are required, that indeterminism is stored as an interval.

Our goal is to consolidate all forms of indeterminism and structure them with a strong mathematical foundation. This way, every user can utilize the statistics in their intended way. We aim to preserve and sustain all possible helpful quantities wherever feasible.

We are also constructing a robust evaluation and back-propagation mechanism (similar to interval arithmetic, evaluate_bounds, and propagate_constraints). With this mechanism, any kind of expression—whether projection-based (evaluation only) or filter-based (evaluation followed by propagation)—will automatically resolve using the new statistics.

alamb · 2025-01-22T23:34:17Z

@berkaysynnada can we merge this PR in now? Or shall we wait for the statistics revamp that is underway?

berkaysynnada · 2025-01-23T08:26:37Z

@berkaysynnada can we merge this PR in now? Or shall we wait for the statistics revamp that is underway?

No need to wait for underway PR as it does not depend which statistics an operator has. It is about how these statistics are stored, computed and used.

But still, I wonder if we're planning to support a wide variety of statistical quantities -- like sum -- or is there a specific set of statistics that can be inferred from the sources or have practical applications in optimizer rules?

If we agree that extending column statistics in this way is both useful and feasible for any user, we can move forward with merging this. We’ll also make sure it’s integrated into the new setup.

gatesn · 2025-01-23T08:37:27Z

I can't think of any other statistical quantities that would immediately help operators, so from our perspective it's only "sum" (we may also use sum to mean true-count for booleans).

If this lands I can follow up with a PR to start using it in SUM, AVG operators. I guess the more contentious API change was adding compute_statistics to the Expr trait: https://github.com/apache/datafusion/pull/13736/files#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42R156-R158

@berkaysynnada is this something that would also remain compatible with the V2 API? I believe it is

berkaysynnada · 2025-01-23T10:19:00Z

I can't think of any other statistical quantities that would immediately help operators, so from our perspective it's only "sum" (we may also use sum to mean true-count for booleans).

If this lands I can follow up with a PR to start using it in SUM, AVG operators. I guess the more contentious API change was adding compute_statistics to the Expr trait: https://github.com/apache/datafusion/pull/13736/files#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42R156-R158

@berkaysynnada is this something that would also remain compatible with the V2 API? I believe it is

What I know is the whole statistics concept was created and used because of helping some optimization decisions, informing the optimizer rules about the data that comes to any execution plan node. What I couldn't understand is how "sum" information is helpful in any kind of optimization process.

to start using it in SUM, AVG operators

Please correct me if I get wrongly your intention within this and #13736, you propose to add this "sum" info to get a result from it as a normal batch data? Why cannot you just use an AggregateExec having a sum accumulator?

As I said, the V2 API does nothing to which kind of statistics will be preserved in Statistics{} struct, it is more about consolidating the Precision and Interval objects to represent and compute any kind of statistical quantity.

gatesn · 2025-01-23T10:23:06Z

Statistics can be helpful for optimizer rules, but they also allow short-circuiting computations. For example, min/max can be used to avoid evaluating a filter over a record batch and quickly throw away the whole thing.

Sum statistics help with short-circuiting aggregation functions. For example, SELECT SUM(a) FROM foo becomes a constant time operation. Similarly, AVG(a) can be computed with sum / row count.

Why cannot you just use an AggregateExec having a sum accumulator?

Because our file format already stores a pre-computed sum.

berkaysynnada · 2025-01-23T10:32:47Z

Statistics can be helpful for optimizer rules, but they also allow short-circuiting computations. For example, min/max can be used to avoid evaluating a filter over a record batch and quickly throw away the whole thing.

Sum statistics help with short-circuiting aggregation functions. For example, SELECT SUM(a) FROM foo becomes a constant time operation. Similarly, AVG(a) can be computed with sum / row count.

Why cannot you just use an AggregateExec having a sum accumulator?

Because our file format already stores a pre-computed sum.

Thanks for the explanation. I see the reason now, and it makes sense when you have such pre-compute

alamb · 2025-01-23T23:23:29Z

I merged this branch up from main and triggered the CI again. If there are no additional concerns I hope to merge this in a day or two

gatesn · 2025-01-28T19:33:08Z

Any other blockers @alamb ? Thanks for hustling this through

ozankabak · 2025-01-28T19:50:03Z

LGTM

alamb · 2025-01-28T20:19:22Z

Any other blockers @alamb ? Thanks for hustling this through

I am somewhat overwhelmed with

Release DataFusion 45.0.0 #14008
(and also
Release arrow-rs / parquet minor version 54.1.0 (Jan 2025) arrow-rs#6929 / Regression: Concatenating sliced ListArrays is broken arrow-rs#7034

And I haven't had a chance to fully think about downstream implications of this PR / have the bandwidth yet to pull the trigger and add potentially some other issues to the 45 release

So no blockers from me yet, I was just hadn't gotten up the guts to merge it yet

alamb · 2025-01-28T20:19:41Z

WFT let's do it and keep things moving

alamb · 2025-01-28T20:54:36Z

And I broke the build 🤦 . Fix PR:

Fix build "missing field sum_value in initializer of ColumnStatistics" #14345

gatesn added 6 commits January 10, 2025 12:28

Add sum statistic

ee13b64

Add sum statistic

d1d0996

Add sum statistic

04b7cea

Add sum statistic

2449d2b

Add sum statistic

fbf3188

Add sum statistic

42ac6aa

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate common Related to common crate proto Related to proto crate labels Jan 10, 2025

alamb changed the title ~~Add a sum statistic~~ Add a ColumnStatistics::Sum Jan 12, 2025

alamb approved these changes Jan 12, 2025

View reviewed changes

alamb changed the title ~~Add a ColumnStatistics::Sum~~ Add ColumnStatistics::Sum Jan 12, 2025

gatesn commented Jan 12, 2025

View reviewed changes

Add tests and Cargo fmt

87c6a5c

gatesn commented Jan 12, 2025

View reviewed changes

gatesn added 2 commits January 15, 2025 14:14

Merge branch 'main' into ngates/sum-statistic

8431b91

fix up

579e046

Merge remote-tracking branch 'apache/main' into ngates/sum-statistic

bcec17c

alamb merged commit f8063e8 into apache:main Jan 28, 2025
25 checks passed

alamb mentioned this pull request Jan 28, 2025

Fix build "missing field sum_value in initializer of ColumnStatistics" #14345

Merged

alamb mentioned this pull request Feb 4, 2025

Feb 4, 2025: This week(s) in DataFusion #14491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ColumnStatistics::Sum` #14074

Add `ColumnStatistics::Sum` #14074

gatesn commented Jan 10, 2025

alamb left a comment

alamb Jan 12, 2025

gatesn Jan 12, 2025 •

edited

Loading

gatesn left a comment

berkaysynnada commented Jan 12, 2025

alamb commented Jan 13, 2025

berkaysynnada commented Jan 13, 2025

alamb commented Jan 14, 2025

gatesn commented Jan 15, 2025

berkaysynnada commented Jan 16, 2025

alamb commented Jan 22, 2025

berkaysynnada commented Jan 23, 2025

gatesn commented Jan 23, 2025

berkaysynnada commented Jan 23, 2025 •

edited

Loading

gatesn commented Jan 23, 2025

berkaysynnada commented Jan 23, 2025 •

edited

Loading

alamb commented Jan 23, 2025

gatesn commented Jan 28, 2025

ozankabak commented Jan 28, 2025

alamb commented Jan 28, 2025

alamb commented Jan 28, 2025

alamb commented Jan 28, 2025

Add ColumnStatistics::Sum #14074

Add ColumnStatistics::Sum #14074

Conversation

gatesn commented Jan 10, 2025

Which issue does this PR close?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 12, 2025

Choose a reason for hiding this comment

gatesn Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

gatesn left a comment

Choose a reason for hiding this comment

berkaysynnada commented Jan 12, 2025

alamb commented Jan 13, 2025

berkaysynnada commented Jan 13, 2025

alamb commented Jan 14, 2025

gatesn commented Jan 15, 2025

berkaysynnada commented Jan 16, 2025

alamb commented Jan 22, 2025

berkaysynnada commented Jan 23, 2025

gatesn commented Jan 23, 2025

berkaysynnada commented Jan 23, 2025 • edited Loading

gatesn commented Jan 23, 2025

berkaysynnada commented Jan 23, 2025 • edited Loading

alamb commented Jan 23, 2025

gatesn commented Jan 28, 2025

ozankabak commented Jan 28, 2025

alamb commented Jan 28, 2025

alamb commented Jan 28, 2025

alamb commented Jan 28, 2025

Add `ColumnStatistics::Sum` #14074

Add `ColumnStatistics::Sum` #14074

gatesn Jan 12, 2025 •

edited

Loading

berkaysynnada commented Jan 23, 2025 •

edited

Loading

berkaysynnada commented Jan 23, 2025 •

edited

Loading