builtins: use Youngs-Crammer algorithm for aggregation functions #55268

mneverov · 2020-10-06T20:13:29Z

builtins: use Youngs-Crammer algorithm for aggregation functions

This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics.
It also amends aggregates builtin tests so functions with several arguments can be tested.

Release note: None

Relates to #41274

cockroach-teamcity · 2020-10-06T20:13:32Z

All committers have signed the CLA.

cockroach-teamcity · 2020-10-06T20:13:36Z

This change is

rohany · 2020-10-06T20:36:32Z

cc @yuzefovich

blathers-crl · 2020-10-07T20:03:18Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

mneverov · 2020-10-11T13:20:35Z

The tests for multiple nodes and distributed sql mode "on" are failing because correlation calculation depends on the order of the values.
This is true for the old method as well as for the Youngs-Crammer algorithm ported from the Postgres.

Consider the following example:

-- Postgres

CREATE TABLE aggtest
(
    a double precision,
    b double precision
);

insert into aggtest (a, b)
values (0, 0.09561),
       (12, 124.78),
       (42, 324.78),
       (42, 324.78),
       (42, 384.78),
       (42, 324.78),
       (56, 7.8),
       (56, 7.8),
       (56, 7.8),
       (56, 7.8),
       (56, 324.78),
       (87, 127.001),
       (100, 99.097),
       (100, 99.097),
       (100, 99.097);

SELECT corr(b, a)
FROM aggtest;

-- PG:             -0.14162191454732073
-- Old algorithm:  -0.14162191454732112
-- Youngs-Crammer: -0.14162191454732073

truncate aggtest;

-- the same values in different order
insert into aggtest (a, b)
values (56, 7.8),
       (56, 7.8),
       (42, 324.78),
       (56, 7.8),
       (100, 99.097),
       (42, 324.78),
       (56, 7.8),
       (100, 99.097),
       (100, 99.097),
       (42, 324.78),
       (56, 324.78),
       (42, 384.78),
       (12, 124.78),
       (0, 0.09561),
       (87, 127.001)
;

SELECT corr(b, a)
FROM aggtest;

-- PG:             -0.1416219145473208
-- Old algorithm:  -0.14162191454732093
-- Youngs-Crammer: -0.1416219145473208

My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number).

@yuzefovich could you please confirm that my understanding is correct and advice how I can define the order on the corr op?

yuzefovich

Thanks for taking this on and for the thorough investigation of this precision issue! I agree with your assessment that the order of float operands seems to matter, so we need to think of some remedy.

My understanding, based on the design doc, is that it's possible to define an order on the output stream, so that intervals from different nodes would merge as if the operation was executed locally (i.e. rows will be ordered based on ordinal number).

I think your understanding is correct, but I don't like the idea of requiring some order in order to get the deterministic result when talking about such tiny precision (beyond 15 digits as documented by Postgres). I'll confirm with my colleagues tomorrow, but I think we should just modify the tests to allow for some deviation, i.e. instead of something like

query R
SELECT corr(y, x)::decimal FROM statistics_agg_test
----
0.045228963191363145

we would have something like

query B
SELECT abs(corr(y, x)::decimal-0.045228963191363145) < power(10, -15) FROM statistics_agg_test
----
true

My hesitation of imposing an ordering comes from the fact that it'll likely have performance implications (we might need to plan a sort before an aggregator), and we're already satisfying the precision requirements mentioned by Postgres. I believe some non-determinism is acceptable in this case. What do you think?

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mneverov)

pkg/sql/sem/builtins/aggregate_builtins.go, line 1739 at r1 (raw file):

}

/*

nit: we tend to use // for all comments, not /* and */.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1758 at r1 (raw file):

 * to minimize code space instead.
 */

nit: we usually don't have an empty line between a comment and a struct.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

 */

type regressionAccumulate struct {

super nit: I'd probably do s/accumulate/accumulator/.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):

	}

	err = a.regAcc.add(y, x)

nit: could combine these two lines into one.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1910 at r1 (raw file):

// Reset implements tree.AggregateFunc interface.
func (a *corrAggregate) Reset(context.Context) {
	a.regAcc.n = 0

nit: could replace all the lines with this

a.regAcc = regressionAccumulate{}

yuzefovich · 2020-10-13T21:09:30Z

Another idea that was suggested by @asubiotto is to adjust the logic test harness itself in order to introduce a separate type marker for floats (separate from decimals) that allows for 10e-15 deviation, I'll look into that shortly.

yuzefovich · 2020-10-13T22:53:16Z

I opened up #55530, and I would be interested if you could try it out (note that you'll need to modify the query invocation to use F type instead of R for floating point numbers that are non-deterministic).

mneverov · 2020-10-14T06:22:37Z

I opened up #55530, and I would be interested if you could try it out (note that you'll need to modify the query invocation to use F type instead of R for floating point numbers that are non-deterministic).

@yuzefovich thanks!
Just tried it and all logic tests for aggregate testcase pass. Will update this PR when #55530 is merged into the master.

My hesitation of imposing an ordering comes from the fact that it'll likely have performance implications (we might need to plan a sort before an aggregator), and we're already satisfying the precision requirements mentioned by Postgres. I believe some non-determinism is acceptable in this case. What do you think?

I haven't ever needed 16 digits precision, so I'm ok with this. Maybe, it is worth to mention the deviation in the docs?

55530: logictest: introduce a separate matcher for floats r=yuzefovich a=yuzefovich This commit adds another type annotation `F` intended to be used for floating point numbers which have a separate matcher that allows for some deviation in the precision. Namely, the matcher checks that the expected and the actual numbers match in 15 significant decimal digits. Note that I think the matching algorithm introduces some rounding errors, so it might not work perfectly in all scenarios, yet it should be good enough in rare test cases when we need it. Informs: #55268. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

yuzefovich · 2020-10-15T16:10:55Z

#55530 has just merged.

I haven't ever needed 16 digits precision, so I'm ok with this. Maybe, it is worth to mention the deviation in the docs?

I think that floating point precision is a pretty confusing topic, and talking about it in the docs might only add to the confusion. I'd expect most people to not care whether there is non-determinism (depending on the distribution of the execution) in the result with float type, so I'd probably not call it out.

blathers-crl · 2020-10-16T17:09:32Z

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

mneverov

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, yuzefovich wrote…

super nit: I'd probably do s/accumulate/accumulator/.

sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?

pkg/sql/sem/builtins/aggregate_builtins.go, line 1892 at r1 (raw file):

Previously, yuzefovich wrote…

nit: could combine these two lines into one.

👍

mneverov · 2020-10-17T05:47:15Z

@yuzefovich could you please have another look?

The PR got quite large so I split it in two: this one ports the algorithm from Postgres to reuse in other aggregation functions later. I found some tests for corr missing.
The actual implementation of covar_pop I can do as a separate PR or an another commit to this one and then squash after the review.

yuzefovich

Thanks! with some nits.

I think it's reasonable to have this PR as a standalone, and then work on covar_pop in a different PR.

nit: you should update the PR title to the commit title.

Reviewed 4 of 4 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mneverov)

pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, mneverov (Max Neverov) wrote…

sorry didn't get it, could you please explain this one if it is still relevant with the latest changes?

The notation s/abc/xyz/ means "replace abc with xyz" (comes from sed command).

My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so regressionAccumulator or regressionAccumulatorBase would sound better to me.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):

// Reset implements tree.AggregateFunc interface.
func (a *regressionAccumulateBase) Reset(context.Context) {
	a.n = 0

nit: all these lines could be replaced with *a = regressionAccumulateBase{}.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1802 at r3 (raw file):

// Size implements tree.AggregateFunc interface.
func (a *regressionAccumulateBase) Size() int64 {
	return sizeOfCorrAggregate

nit: we should define a separate sizeOfRegressionAccumulateBase and use it here.

pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):

}

func pivotArgs(args ...[]tree.Datum) [][]tree.Datum {

nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".

This commit replaces existing algorithm for correlation calculation with the Youngs-Crammer algorithm ported from Postgresql to reduce rounding errors. It introduces the base structure for aggregate functions for statistics. It also amends aggregates builtin tests so functions with several arguments can be tested. Release note: None

mneverov

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

pkg/sql/sem/builtins/aggregate_builtins.go, line 1759 at r1 (raw file):

Previously, yuzefovich wrote…

The notation s/abc/xyz/ means "replace abc with xyz" (comes from sed command).

My nit comes from the fact that I think that "accumulate" is not a noun (unlike "aggregate"), and for the struct names we prefer nouns, so regressionAccumulator or regressionAccumulatorBase would sound better to me.

got it, thanks.

pkg/sql/sem/builtins/aggregate_builtins.go, line 1789 at r3 (raw file):

Previously, yuzefovich wrote…

nit: all these lines could be replaced with *a = regressionAccumulateBase{}.

missed that, fixed, thanks

pkg/sql/sem/builtins/aggregate_builtins_test.go, line 75 at r3 (raw file):

Previously, yuzefovich wrote…

nit: why is this function called "pivot"? It seems to me it's more like "flattenArgs".

Thanks, "flattenArgs" is indeed describes better what it does

yuzefovich

bors r=yuzefovich

Reviewed 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

craig · 2020-10-17T20:14:34Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Oct 6, 2020

blathers-crl bot requested a review from rohany October 6, 2020 20:13

rohany requested review from yuzefovich and removed request for rohany October 6, 2020 20:36

cockroachdb deleted a comment from blathers-crl bot Oct 6, 2020

yuzefovich requested a review from a team October 7, 2020 21:05

yuzefovich reviewed Oct 13, 2020

View reviewed changes

yuzefovich mentioned this pull request Oct 13, 2020

logictest: introduce a separate matcher for floats #55530

Merged

mneverov force-pushed the pop_covar branch from 89b1e28 to c565771 Compare October 16, 2020 17:09

mneverov force-pushed the pop_covar branch 2 times, most recently from d5d8ab1 to 6b24abc Compare October 16, 2020 19:28

mneverov commented Oct 17, 2020

View reviewed changes

mneverov changed the title ~~WIP Support aggregate functions~~ Support aggregate functions Oct 17, 2020

mneverov marked this pull request as ready for review October 17, 2020 05:47

yuzefovich approved these changes Oct 17, 2020

View reviewed changes

mneverov force-pushed the pop_covar branch from 6b24abc to 3e08686 Compare October 17, 2020 17:54

mneverov changed the title ~~Support aggregate functions~~ builtins: use Youngs-Crammer algorithm for aggregation functions Oct 17, 2020

mneverov commented Oct 17, 2020

View reviewed changes

yuzefovich approved these changes Oct 17, 2020

View reviewed changes

craig bot merged commit d752fa2 into cockroachdb:master Oct 17, 2020

mneverov deleted the pop_covar branch October 18, 2020 09:50

mneverov mentioned this pull request Oct 28, 2020

builtins: implement covar_pop and covar_samp aggregation functions #55707

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

builtins: use Youngs-Crammer algorithm for aggregation functions #55268

builtins: use Youngs-Crammer algorithm for aggregation functions #55268

mneverov commented Oct 6, 2020 •

edited

Loading

cockroach-teamcity commented Oct 6, 2020 •

edited

Loading

cockroach-teamcity commented Oct 6, 2020

rohany commented Oct 6, 2020

blathers-crl bot commented Oct 7, 2020

mneverov commented Oct 11, 2020

yuzefovich left a comment

yuzefovich commented Oct 13, 2020

yuzefovich commented Oct 13, 2020

mneverov commented Oct 14, 2020

yuzefovich commented Oct 15, 2020

blathers-crl bot commented Oct 16, 2020

mneverov left a comment

mneverov commented Oct 17, 2020

yuzefovich left a comment

mneverov left a comment

yuzefovich left a comment

craig bot commented Oct 17, 2020

builtins: use Youngs-Crammer algorithm for aggregation functions #55268

builtins: use Youngs-Crammer algorithm for aggregation functions #55268

Conversation

mneverov commented Oct 6, 2020 • edited Loading

cockroach-teamcity commented Oct 6, 2020 • edited Loading

cockroach-teamcity commented Oct 6, 2020

rohany commented Oct 6, 2020

blathers-crl bot commented Oct 7, 2020

mneverov commented Oct 11, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich commented Oct 13, 2020

yuzefovich commented Oct 13, 2020

mneverov commented Oct 14, 2020

yuzefovich commented Oct 15, 2020

blathers-crl bot commented Oct 16, 2020

mneverov left a comment

Choose a reason for hiding this comment

mneverov commented Oct 17, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

mneverov left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Oct 17, 2020

mneverov commented Oct 6, 2020 •

edited

Loading

cockroach-teamcity commented Oct 6, 2020 •

edited

Loading